Improving Classifier Performance by Imputing Missing Values using Discretization Method

Size: px

Start display at page:

Download "Improving Classifier Performance by Imputing Missing Values using Discretization Method"

Dwayne Gilmore
6 years ago
Views:

1 Improving Classifier Performance by Imputing Missing Values using Discretization Method E. CHANDRA BLESSIE Assistant Professor, Department of Computer Science, D.J.Academy for Managerial Excellence, Coimbatore, Tamil Nadu, India, DR.E. KARTHIKEYAN Assistant Professor, Department of Computer Science, Government Arts College, Udumalpet, Tamil Nadu, India, Abstract DR.V.THAVAVEL HOD and Assistant Professor (SG), Department of Computer Application, School of Computer Science and Technology, Karunya University, Tamil Nadu, India. The presence of the missing values in a dataset can affect the performance of a classifier. Missing values can be replaced with the estimated values based on some information available in the data set. Several have been proposed to deal with the missing values. In this paper, six different approaches are presented to fill the missing values. Also, we propose a discretization based method which can increase the relevancy between the instances and attributes. Experimental analysis is made with four datasets to evaluate the performance of the C4.5 classifier. The performance is based on the accuracy of the classifier. The datasets are taken from the UCI ML repository. Keywords :, Data Mining, C4.5, Discretization, Preprocessing, Classifier 1. Introduction Many learning algorithms perform poorly when the training data are incomplete [Kalton and Kasprzyk (1986)][Mundfrom and Whitcomb (1998)]. Missing attribute values commonly exist in real-world data set. They may come from the data collecting process or redundant diagnose tests, unknown data and so on. One standard approach involves imputing the missing values, then giving the completed data to the learning algorithm. In general, the for treating the missing values can be divided into three categories [Mehala, et al. (2009)]: 1) ignoring/discarding the data which are the easiest and most commonly applied. 2) Parameter estimation where maximum likelihood procedures are used to estimate the parameters of a model. 3) Imputation techniques, where missing values are replaced with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the dataset to assist in estimating the missing values. The rest of the paper is organized as follows. Section 2 discusses about the previous work. Section 3 explains the proposed Discretization based method. Experimental analysis and the comparison results are described in section 4. Conclusion and result discussion are described in section Review of the previous work This section surveys [Jerzy, et al. (2005)] some commonly and widely used imputation. Imputation method is one of the most frequently used [6]. It consists of replacing the missing data for a given feature (attribute) by the mean of all known values of that attribute in the class where the instance with missing attribute belongs. Let us consider that the value x ij of the k-th class, C k, is missing then it will be replaced by (x ij ) = Σ x ij /n k (1) x ij ЄC k ISSN : Vol. 4 No.03 March

2 where n k represents the number of non-missing values in the j-th attribute of the k-th class. Another two discard the data having missing values. The first method is known as complete case analysis. This method discards all instances having missing values [Tresp, et al. (1998)]. The second method determines the extents of missing values before deleting it. CN2 [Clark and Niblett.(1989)] algorithm uses a method selecting the most often occurring attribute value to fill the missing values of the attribute. The most common attribute value method does not pay any attention to the relationship between attributes and a decision. The concept most common attribute value method is a restriction of the first method to the concept, i.e., to all examples with the same value of the decision as an example with missing attribute vale. CART replaces a missing value of a given attribute using the corresponding value of a surrogate attribute, which has the highest correlation with the original attribute. C4.5 uses a probabilistic approach to handle missing data in both the training and the test sample [Quinlan (1993)]. 3. Proposed system 3.1 Discretization Discretization [Liu and Setiono (1997)] is a technique to partition continuous attributes into a finite set of adjacent intervals in order to generate attributes with a small number of distinct values. Each interval can then be treated as one value of new discrete attribute. Discretization of attributes can reduce the learning complexity and help to understand the dependencies between the attributes and the target class. Definition Assuming that a dataset consisting of N instances and S target classes, a Discretization algorithm would discretize the continuous attribute F in the dataset into n discrete intervals {[d 0,d 1 ],[d 1,d 2 ],.(d n-1,d n ]}, where d 0 is the minimal value and d n is the maximal value of attribute F. Such a discrete result {[d 0,d 1 ],[d 1,d 2 ],.(d n- 1,d n ]} is called a Discretization scheme D on attribute A. CAIM[Kurgan and Cros (2004)] and CACC[Tsai, et al. (2008)] finds the cutting points for the intervals by finding the middle value between each pair and initialize them as boundary points for each interval. But NAD [Blessie, et al. (2010)] finds the cutting points by finding the middle value between each pair where the two consecutive values have different class value and initialize them as boundary points. This reduces the time complexity. 3.2 Imputation using Discretization Let D={d 1,d 2,d 3,..d n } be the dataset and let the attributes be A={A 1,A 2,A 3,..A m }where m is the number of attributes. The proposed system consists of 2 phases. In the first phase, for each attribute, the data are sorted. Initial cutting points were found out between each pair of the instances in the attribute where the two consecutive values have different class value [Blessie, et al. (2010)]. Next step is to find the mean value within each interval for each class instead of finding the mean value of the entire non missing values in the dataset. Then the minimum values of the mean in each interval are used to fill the missing values corresponding to that class. This will increase the relevancy between the instances and attributes. In the second phase, the dataset with the filled in missing values are used to classify the dataset using c4.5 classifier and the accuracy of the classifier is analyzed. 3.3 Pseudocode Let D be the training data set with continuous features F i ; S classes. For every F i do: Phase 1 Step Find maximum (d n ) and minimum (d o ) values 1.2 sort all distinct values of F i in ascending order 1.3 Initialize all possible interval boundaries, B, with the minimum, maximum and the midpoints where the continuous features have different classes in the set B={[d 0,d 1 ][d 1,d 2 ],.,[d n-1,d n ]} Step For every interval [d i,d j ] where I is the lower bound and j is the upper bound, find the mean value corresponding to a single class value ISSN : Vol. 4 No.03 March

3 (x ij ) = Σ x ij /n k (2) x ij ЄC k 2.2 Find the minimum value of all the mean values corresponding to each class C k. 2.3 Fill the missing values of each class C k with the minimum mean value of the same class C k. Phase 2 Step Calculate the missclassification rate and accuracy by giving the filled in complete dataset into a classifier. End 4. Experimental Analysis Our experiments were carried out using four datasets taken from the Machine Learning Database UCI Repository. The datasets are Diabetes, Breast Cancer, Lung Cancer and Iris data sets. Table 1 describes the information such as number of instances and the number of attributes about the datasets used in this paper. The main objective of the experiments conducted in this work is to analyze the efficiency of the C4.5 classification algorithm. In these experiments, missing values are artificially imputed in different rates in different attributes. Datasets without missing values are taken and few values are removed from it randomly. The rates of the missing values removed are from 2% to 4%. Datasets Instances Attributes Diabetes 7 9 Iris Breast Cancer Lung Cancer Table 1. Datasets used for analysis A. Performance comparison of Diabetes dataset The original dataset without missing values yields the accurate classification rate of 73.83% and the proposed method increases the accuracy rate to.22%. The performance comparisons of five different and also the time taken to execute are shown in table 2. Methods Time Missclassification rate Discend (Proposed) Table 2 : comparison using the diabetes dataset B. Performance comparison of Breast Cancer dataset The original dataset without missing values yields the accurate classification rate of 94.56% and the proposed method increases the accuracy rate 94.71%. The performance comparisons of five different and the time taken to execute are shown in table 3. ISSN : Vol. 4 No.03 March

4 Methods Time Missclassification rate Discend (Proposed) Table 3 : comparison using the Breast Cancer dataset C. Performance comparison of IRIS dataset The original dataset without missing values yields the accurate classification rate of 96% and the Most often method and the proposed method increases the accuracy rate 95.33%. The performance comparisons of five different and the time taken to execute are shown in table 4. Methods Time Miss classification rate Discend (Proposed) Table 4 : comparison using the IRIS dataset D. Performance comparison of Lung Cancer dataset The original dataset without missing values yields the accurate classification rate of.13% and the proposed method increases the accuracy rate 79.42%. The performance comparisons of five different are shown in table 5. The time taken to execute is also given in the table 5. Methods Time Missclassification rate Discend (Proposed) Table 5 : comparison using the Lung Cancer dataset ISSN : Vol. 4 No.03 March

5 Percentage of accuracy for Diabetes dataset Discend (Proposed) Percentage of accuracy for Breast Cancer dataset Discend (Proposed) Fig : 1a Fig : 1b Percentage of accuracy for IRIS dataset Percentage of accuracy for Lung Cancer dataset Discend (Proposed) Fig : 1c Fig 1a-1d : Comparison result of C4.5 for 6 using 4 datasets Fig : 1d 5. Conclusion and Discussion From the comparison above, the classification rate for C4.5 classifier using the proposed method seems to be better than the remaining for three dataset except for IRIS dataset. Our experiment for filling the missing values was conducted using MatLab and the classifier performance was analyzed using Weka 3.6. Missing value problem must be solved before using the dataset as the incomplete data may lead to high misclassification rate. This work analyses the classification performance of the C4.5 classifier. The proposed approach uses only the numerical attributes to impute the missing values. In further it can be extended to handle categorical attributes. From the above comparison, the proposed method seems to be better than the three as the accuracy rate is increased for all the datasets. Also, while filling the missing values found out within the same class, the relevancy between the instances and the attributes can be increased which will give better result. References [1] Acuna,E.; Rodriguez,C. (2004): The treatment of missing values and its effect in the classifier accuracy. In: W. Gaul, D. Banks, L. House, F.R. McMorris, P. Arabie (Eds.) Classification, Clustering and Data Mining Applications, Springer-Verlag Berlin-Heidelberg, pp , [2] Blessie,C.E.; Karthikeyan,E.; Selvaraj,B. (2010): NAD A Discretization approach for improving interdependency, Journal of Advanced Research in Computer Science, 2(1), pp [3] Clark,P.; Niblett,T. (1989): The CN2 induction algorithm. Machine Learning 3, pp [4] Jerzy,W.; Grzymala-Busse1 and Ming Hu, (2005): A Comparison of Several Approaches to Missing Attribute Values in Data Mining, W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI, Springer-Verlag Berlin Heidelberg, pp [5] Kalton,G.; Kasprzyk,D. (1986): The treatment of missing survey data. Survey Methodology 12, pp [6] Kurgan,L.; Cros,K.J.; (2004): CAIM discretization algorithm, IEEE Transactions on Knowledge and Data Engineering 16(2), pp [7] Liu,H.; Setiono,R. (1997): Feature selection via discretization, IEEE Transactions on Knowledge and Data Engineering 9(4), pp [8] Mehala,B.; Ranjit Jeba Thangaiah,P.; Vivekanandan,K. (2009): Selecting Scalable Algorithms to Deal with Missing Values, International Journal of Recent Trends in Engineering, 1(2). [9] Mundfrom,D.J.; Whitcomb,A. (1998): Imputing missing values: The effect on the accuracy of classification. Multiple Linear Regression Viewpoints. 25 (1), pp [10] Quinlan,J.R. (1993): C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo CA. ISSN : Vol. 4 No.03 March

6 [11] Tresp,V.; Neuneier,R.; Ahmad,S. (1998): Efficient for dealing with missing data in supervised learning. In G. Tesauro, D. S. Touretzky, and Leen T. K., editors, Advances in NIPS 7. MIT Press. [12] Tsai,C.J.; Lee,C.; Yang,W.P. (2008): A Discretization algorithm based on Class-Attribute Contingency Coefficient, Information Sciences, 1(3), pp ISSN : Vol. 4 No.03 March

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 1 (2016), pp. 1131-1140 Research India Publications http://www.ripublication.com A Monotonic Sequence and Subsequence Approach