Improving Imputation Accuracy in Ordinal Data Using Classification

Size: px

Start display at page:

Download "Improving Imputation Accuracy in Ordinal Data Using Classification"

Corey Bell
5 years ago
Views:

1 Improving Imputation Accuracy in Ordinal Data Using Classification Shafiq Alam 1, Gillian Dobbie, and XiaoBin Sun 1 Faculty of Business and IT, Whitireia Community Polytechnic, Auckland, New Zealand shafiq.alam@whitireia.ac.nz 2 Department of Computer Science,University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand gill@cs.auckland.ac.nz, allensun421@gmail.com Abstract. Tackling missing data is one of the fundamental data pre-processing steps. Data analysis and pattern extraction are affected due to the underlying differences between instances with and without missing data. This is a particular problem with ordinal data, where for example a sample of a population may have all failed to answer a specific question in a questionnaire. The existing methods such as listwise deletion, mean attribute substitution, and regression substitution, naively impute data. They do not impute plausible values as they fail to take into account the relationships between the attributes, but instead consider the distribution of the attribute with missing values only. In this paper we introduce the use of Classification Based Imputation (CNI) to replace missing values with plausible values in ordinal data. The results show that not only does the CNI based technique outperform the existing approaches for imputing missing values in ordinal data but it also helps to improve the classification accuracy of machine learning algorithms. 1 Introduction Statisticians and data analysts often distinguish quantities that can be represented in an attribute using different levels of measurement; nominal, ordinal, interval and ratio quantities [1]. A set of data is said to be ordinal if the contained values can be ranked or have a rating scale attached. The categories for an ordinal data set have a natural order, but the distinction between neighbouring values is not always the same. Examples of such data are often found in questionnaires; for the well quoted example of biscuits, suppose a group of people were asked for their likings about a biscuit on a rating scale of 1 to 5. A rating of 5 indicates more enjoyment than a rating of 4. However, the difference in enjoyment expressed by giving a rating of 2 rather than 1 might be much less than the difference in enjoyment expressed by giving a rating of 4 rather than 3. Mining ordinal data has many applications. For example, in empirical studies, surveys and questionnaires are one of the primary sources of data collection. Most of the data collected in such a way is ordinal and includes human participants. Analysis of such ordinal data is the main tool for understanding the psychological behaviour of participants. However, humans tend to omit responding to some

2 of the questions in the questionnaires and this leads to missing values in the ordinal data. There can be various reasons for the existence of missing values in a dataset. The respondents may not want to answer certain questions with a sensitive topic, there may be data transmission problems, human or machine faults when the data is being recorded. The mechanism concerning the randomness of the missing data can be divided into three classes [2]. Missing Completely at Random (MCAR) occurs when the probability of an instance having a missing value for an attribute is unrelated to both the value of the variable itself, and also to the values of any other variables in the dataset. Missing at Random (MAR) occurs when the probability of an instance having a missing value for an attribute may depend on the known values, but not on the value of the missing data itself. Not Missing at Random (NMAR) occurs when the probability of an instance having a missing value for an attribute is related to the value of the attribute itself [3]. De Leeuw et al. [4] have investigated the problems caused by missing data in statistical analyses, which include biased parameter estimates and inflated standard errors. These possible issues may be particularly serious when the cause of a missing response is directly related to the value itself. Also, when there are large amounts of missing data, the power of the statistical analysis can be severely compromised. In Data Mining and Knowledge Discovery from Databases (KDD), data quality is a major concern [5]. Despite the fact that there are frequent occurrences of missing data in real world data sets, most machine learning algorithms naively handle the problem of missing data, which introduces bias into analysis. The attributes of most datasets are not independent of each other. Therefore, missing values could be determined from the identification of relationships among attributes. Imputation is the procedure of replacing the missing values in a data set by some other value. Given the potential problems that arise from the presence of missing data, a variety of methods have been suggested for dealing with such problems. The most common and simplest approach to handle missing data is to simply omit the cases with missing data and to run the analysis on the remaining dataset; this approach is usually called list-wise deletion. A second approach is called mean substitution, it simply substitutes the average value of a particular attribute in the missing one. There are some problems with this approach; in the first place it adds no new information. In addition, such approaches lead to an underestimate of error, and the value being substituted may not be a plausible value, so it is not appropriate for ordinal data. A third approach is regression substitution, which uses existing variables to make a prediction and then substitute that predicted value as if it were an actual obtained value. With this method, the imputed value is in some way conditional on other information in the data set, however, the problem of error variance remains. Su et al. [6] proposed Classifier-based Nominal Imputation (CNI), a imputation technique for nominal data that views imputation as a classification problem. The result of the experiments have shown that it improves the accuracy of imputation on nominal data, and also improves the classification performance of some machine learning algorithms. The objective of this paper is to apply the CNI technique proposed by Su et al. to ordinal features instead of nominal data, and then compare and analyse experimental results to understand the performance on ordinal datasets. We report two parameters, firstly, estimation accuracy, which shows how accurate the imputation is for ordinal datasets as compared with a baseline imputation technique: most common value imputation (MCI), and secondly, classification accuracy,

3 which shows to what extent this CNI technique could help to improve the classification performance of some machine learning algorithms on the ordinal datasets with missing data. Section 2 of the paper describes some of the related work. Section 3 gives the background of the proposed technique. Section 4 explains the experimental procedure, the experiments, the experimental result and discussion around the obtained results, and Section 5 draws conclusions and proposes future directions for research. 2 Related Work Most of the machine learning methods for classification assume the datasets are unordered, but in many real world applications these data do exhibit a natural order. In [1] the authors have proposed to use the ordering information of the datasets by converting the ordinal class problem into binary class problems for the learning of machine learning based classification algorithms. The reported results compared with naive approaches have shown better classification accuracy. Also when compared to other algorithms that were specifically designed for the ordinal classification problem, their method could be applied to ordinal datasets without any changing the underlying learning scheme. Huhn and Hullermeier [7] has shown that the method proposed by Frank and Hall [1] for ordinal classification is indeed able to exploit order information about classes. Imputation for ordinal datasets such as questionnaires is common because ordinal data is often faced with the problem of missing data. In [8] the authors suggested methods to deal with the problem of missing data. In [9] the authors evaluated different methods for the imputation of missing values in likert data; a kind of ordinal data that is collected for understanding human behaviour. Their results concluded that two methods, item mean based imputation, and person-mean substitution performs well for smaller ration of missing data (less than 20%). The authors of [10]emphasized on the significance of imputation and proposed relative mean based imputation for likert data. The comparison results show that the substitution performance for the give datasets that they have tested their approach on, outperforms the simple mean based imputation. Su et al. [11] proposed classifier-helped imputation which uses accurate imputation techniques, such as Bayesian multiple imputation (BMI), predictive mean matching (PMM), and Expectation Maximization (EM) to preprocess missing data for a machine learning algorithm. Their results show that EM and BMI effectively imputed data in case of missing completely at random (MCAR). The problem with this method is that it has been evaluated for nominal data only and its effect is not known on ordinal data. Their proposed nominal imputation technique called Classifier-based Nominal Imputation (CNI) learns a classifier for the nominal feature and then applies the trained classifier for predicting missing attribute values. They investigated different classification algorithms for imputation, and their results show that SVM and decision tree have the best performance. Also, these learners have much higher imputation accuracies compared to other commonly used nominal imputation techniques, such as k-nearest neighbour imputation (KNNI) and most common imputation (MCI). Their results also show an improvement in classification accuracy after such imputation has been performed to treat the missing data.

Fig. 1: Classifier Based Nominal Imputation (CNI) [6] In the research area, the comparison of different imputation methods for numerical and nominal data is very common.

4 Fig. 1: Classifier Based Nominal Imputation (CNI) [6] In the research area, the comparison of different imputation methods for numerical and nominal data is very common. Batista and Monard [12] investigated the use of the k-nearest neighbour algorithm as an imputation method for nominal datasets; the experimental results shown that it outperform the internal methods used by C4.5 and CN2 to treat missing data, and also it outperforms the mean or mode imputation method, which is a method used widely to treat missing values. The area of ordinal data imputation and its effects on learning are still under explored. This paper intends to fill that gap and proposes the use of CNI based imputation for treating missing values in ordinal data. This paper also explores the effect of this imputation on learning for data mining. 3 Classifier Based Nominal Imputation Algorithm (CNI) Classifier Based Nominal Imputation (CNI) was proposed for treating missing data for nominal attributes. The basic idea of CNI is to use classification for imputation. For each attribute f i of the a tuple that contains missing values, classifier c i (...) is trained which takes the the remaining n 1 attributes { f j j i} of that tuples as an input, and returns the value for f i of the this tuple based on the training of the learner. The learning algorithm L is used by CNI(L) for the training of the classifierc i (...) based on the training data with non missing values on f i, CNI then uses this classifier c i (...) to predict the missing values of this attribute for the remaining tuples. Algorithm 1 describes the functionality of the CNI imputation algorithm. After the imputation of missing data using CNI(L), the imputed dataset is passed to the base learner B that trains a classifier to classify the testing set of ordinal data with the missing values. This classifier is able to predict the missing values as a class for the tuples that contains incomplete ordinal data.

5 4 Experimental setup This section includes a description of the datasets, the type of experiments and the experimental results. We have performed two types of experiments, one to assess the imputation accuracy and the second to assess the classification accuracy of the algorithms on the imputed ordinal data. Five ordinal real-world datasets, which were drawn from the UCI machine learning repository [9] and the WEKA dataset repository [10] have been explored. (See dataset descriptions in Table 1a.) The dataset CAR contains 5 ordinal at- Table 1: Datasets from UCI machine learning Repository and WEKA (a) Dataset details Dataset #Instances #Attributes(ordinal/all) #Class CAR /7 1 ERA ESL LSV SWD (b) Dataset CAR ATTRIBUTES VALUES buying vhigh, high, med, low doors 2, 3, 4, 5more lug boot small, med, big maint vhigh, high, med, low persons 2, 4, more safety low, med, high class unacc, acc, good, vgood tributes and 2 non ordinal attributes and all other datasets contain ordinal attributes only. Except for the dataset CAR, all other ordinal values in each dataset are represented by numbers and their values indicate rank information. Table 1 is the summary of each attribute for dataset CAR. In order to analyse the behaviour of the CNI technique and to measure the performance of both the estimation accuracy and classification accuracy on datasets for different missing ratio levels, four datasets which contain missing data have been extracted from the original datasets in a random fashion by deleting 5%, 10%, 20%, 30% of the observed values, with the probability of 0.05, 0.1, 0.2, 0.3, using 10-folds cross validation. 4.1 Experimental Results Estimation Accuracy: Since datasets with missing values were generated from the original complete datasets, the truth for each missing value is clear, thus, the CNI algorithm can be evaluated against the ground truth for the estimation of accuracy by comparing each imputed value against the true values accordingly. Estimation accuracy is calculated using the following equation. EstimationAccuracy(in%) = 1 N N I(X j = Y j ) (1) j=1 where function I returns 1 for X j = Y j, and 0 otherwise, X j is the imputed value of j th instance produced by the imputer, Y j is the true value of the j th instance from the original complete dataset, and N is the total number of imputed values.

6 To measure the performance of the estimation accuracy of the CNI technique, five different imputation learners L has been used for each dataset: most common imputation (MCI), decision tree (C4.5), support vector machine (SVM), random forest (RF) and logistic regression (LR). By applying these learners for each attribute of the dataset in WEKA, the complete imputed dataset can be obtained. We get the estimation accuracy of the CNI technique for each different ratio by comparing the imputed datasets with original dataset. The results show that the CNI algorithm has a significant improvement in estimation accuracy when compared to the baseline most common imputation (MCI) technique on the ordinal data. The imputed values are much closer to the original datasets than the results using the MCI method. Also, the performance of estimation accuracies decreases as the missing data ratio increases. Table 2 shows the imputation accuracies. For the missing ratios of 5%, the average estimation accuracy of baseline imputer MCI is 28.94% and the average estimation accuracy for all CNI imputers is 61%, which indicates a big improvement over MCI. The lowest value is using support vector machine (SVM) as the imputer, which is 42.92% for dataset LSV and the highest estimation accuracy is using random forest (RF) as the imputer for dataset SWD, which is 94.37%. For the miss- Table 2: Estimation accuracy for different missing ratios (a) Estimation accuracy for missing ratio at 5% 5% MCI J48 LR RF SVM CAR 33.37% 46.63% 44.19% 43.93% 44.89% LSV 27.20% 62.88% 49.16% 72.68% 42.92% ESL 32.08% 52.00% 55.00% 50.17% 54.83% ERA 12.24% 76.40% 77.60% 81.47% 74.20% SWD 39.82% 84.73% 58.09% 94.37% 53.82% (b) Estimation accuracy for missing ratio at 10% 10% MCI J48 LR RF SVM CAR 34.11% 40.12% 43.82% 41.39% 38.61% LSV 30.31% 55.64% 46.06% 68.08% 41.10% ESL 29.14% 49.25% 51.29% 47.62% 53.74% ERA 12.44% 72.93% 73.53% 79.73% 67.67% SWD 39.41% 77.55% 57.36% 92.09% 55.68% (c) Estimation accuracy for missing ratio at 20% 20% MCI J48 LR RF SVM CAR 35.75% 39.63% 41.77% 40.55% 38.99% LSV 28.90% 39.46% 42.95% 59.79% 37.33% ESL 30.10% 45.03% 46.60% 47.21% 48.57% ERA 12.80% 53.90% 65.87% 72.83% 57.63% SWD 39.14% 62.09% 53.50% 84.64% 50.73% (d) Estimation accuracy for missing ratio at 30% 30% MCI J48 LR RF SVM CAR 35.78% 38.81% 39.34% 39.27% 38.92% LSV 29.18% 33.33% 41.51% 52.44% 34.75% ESL 31.36% 42.87% 44.93% 45.25% 46.30% ERA 12.76% 37.07% 57.71% 67.31% 50.27% SWD 41.33% 52.70% 53.12% 75.85% 49.81% ing ratios of 10%, the average estimation accuracy of baseline imputer MCI is 29.08% and the average estimation accuracy for all CNI imputers is 57.66%. The lowest value is 38.61% using support vector machine (SVM) as the imputer for dataset CAR, while the highest estimation accuracy is from using random forest (RF) as the imputer for dataset SWD, which is 92.09%. For the missing ratios of 20%, the average estimation accuracy of baseline imputer MCI is 29.34% and the average estimation accuracy for all CNI imputers is 51.45%. The lowest value is 37.33% from using support vector machine (SVM) as the imputer for dataset

7 LSV, while the highest estimation accuracy is using random forest (RF) as the imputer for dataset SWD, which is 84.64%. For the missing ratios of 30%, the average estimation accuracy of baseline imputer MCI is 30.08% and the average estimation accuracy for all CNI imputers is 47.08%. The lowest value is 33.33% using decision tree (J48) as the imputer for dataset LSV while the highest estimation accuracy is using random forest (RF) as the imputer for dataset SWD, which is 75.85%. All CNI based algorithms outperform the baseline MCI algorithm and show a significantly higher estimation accuracy. 4.2 Classification Accuracy For the measurement of classification accuracy, it can be used as a way to determine how much this CNI technique could improve the classification performance when compared with the classification on missing only datasets and the baseline technique, most common value imputation. Classification accuracy is calculated using the following equation. Classi f icationaccuracy(in%) = 1 N N I(M j = K j ) (2) j=1 where N is the total number of class values in the testing set, M j is the classified class from the result of classification on the testing set and K j is the true value in the original non-missing testing set, and function I returns 1 for M j = K j and 0 otherwise. We used the imputed datasets and randomly separate each dataset into a training set and testing set (50-50% and 20-80% ). We investigated the classification performance based on imputed data for three machine learners, K Nearest Neighbour Algorithm (KNN-5), Naive Bayes algorithm (NB), multilayer perception neural network algorithm (MLP) on incomplete ordinal data. We measured the classification accuracy on both the original datasets which contain no missing values and the datasets that contain missing values with different ratios, then compare these results. 4.3 K Nearest Neighbour Algorithm (KNN-5) In our first experiment, we evaluated the results using the K Nearest Neighbour algorithm. For the dataset CAR, the original classification accuracy is 89.58%. For the same dataset when it contains missing values, the classification accuracy is quite low (78%). With imputation using MCI, the classification accuracy goes down further to 74%. But when different CNI imputers are used as preprocessors, the classification accuracy improves and moves closer to the original classification on different missing ratios from 5% to 30%. The performance varies from the smallest lift of 1.06% (which occurs using the CNI algorithm with J48 as the imputer at 5% missing ratio level) to the biggest lift of 11.24% (which occurs using the CNI algorithm with J48 as the imputer at 20% missing ratio level). Figure 2 shows the results of the experiment. For the same dataset CAR with 20% as the training set, with the remaining 80% as the testing set, the classification accuracy is 82.49% for the original dataset. For the dataset that contains missing values, the classification accuracy decreases to 77%. Similar to the

(e) LSV dataset (50,50 Split) (f) LSV dataset (20,80 Split) (g) ERA dataset (50,50

8 Fig. 2: Classification accuracy for the five datasets (a) CAR dataset (50,50 Split) (b) CAR dataset (20,80 Split) (c) ESL dataset (50,50 Split) (d) ESL dataset (20,80 Split) (e) LSV dataset (50,50 Split) (f) LSV dataset (20,80 Split) (g) ERA dataset (50,50 Split) (h) ERA dataset (20,80 Split) (i) SWD dataset (50,50 Split) (j) SWD dataset (20,80 Split)

9 50-50 split, the baseline imputation technique MCI performs worse on this split as well. For different CNI imputers as pre-processors, the classification accuracies improve on different missing ratios from 5% to 30% when compared to the dataset that contains missing values. The performance varies from the smallest lift of 0.18% to biggest lift of 8.86%. For the dataset ESL, the classification accuracy is 57.79% for the original data set with no missing values. MCI performs worse among all the imputers while for the CNI based algorithms the the performance varies from the smallest lift of 2.22% (using random forest (RF) as the imputer at 10% missing ratio level) to the biggest lift of 27.81% (using logistic regression (LR) as the imputer at 30% missing ratio level). For the dataset LSV, the classification accuracy is 60% for the original data set. For the dataset with missing values, the classification accuracy goes down to 50%. However, using different CNI imputers as pre-processors, the classification accuracy improves from the smallest lift of 1.36% (using random forest (RF) as the imputer at 5% missing ratio level) to the biggest lift of 23.58% (using support vector machine (SVM) as the imputer at 30% missing ratio level). For the dataset ERA, the classification accuracy is 22.80% for the original data set. For the dataset with missing values, MCI performs worst whileusing different CNI imputers as pre-processors, the results have shown an increase in classification accuracy from the smallest lift of 4.67% (using support vector machine (SVM) as the imputer at 5% missing ratio level) to the biggest lift of 57.89% (using logistic regression (LR) as the imputer at 20% missing ratio level). For the dataset SWD, the classification accuracy is 56.80% for the original data. Using different CNI imputers as pre-processors, the results overall have shown a better performance on classification accuracy where the smallest lift of 0.71% occurs using decision tree (J48) as the imputer at 5% missing ratio level, and the biggest lift of 13.11% occurs with support vector machine (SVM) as the imputer at 20% missing ratio level. Besides KNN, we have also used Naive Bayes Table 3: Classification accuracy for Naive Bayes and Multilayer Perception Neural Network (MLP) (a) Dataset CAR using Naive Bayes algorithm Miss-Only MCI J48 LR RF SVM 5.00% 83.10% 81.60% 86.23% 85.30% 84.26% 85.76% 10.00% 82.06% 81.53% 86.46% 84.61% 84.95% 86.81% 20.00% 79.86% 76.12% 85.30% 85.76% 83.22% 86.46% 30.00% 80.32% 73.50% 81.48% 84.95% 81.02% 85.53% (b) Dataset CAR using MLP algorithm Miss-Only MCI J48 LR RF SVM 93.40% 88.08% 95.97% 95.45% 97.11% 94.44% 90.51% 87.32% 94.33% 93.64% 92.71% 92.36% 84.03% 83.87% 88.54% 87.38% 86.57% 87.85% 80.44% 75.81% 85.88% 85.65% 84.72% 84.49% algorithm and multilayer perception neural network algorithm to assess the improvement in classification accuracy of the imputed datasets. Table 3 shows these results. The accuracy for the original data without missing values is 85.07% for Naive Bayes, and 97.34% for MLP. For the sake of brevity, the detailed results have been omitted. The estimation accuracy of different imputers varies on different missing ratios of each dataset, but on average, the random forest and logistic regression are

10 among the top performers since these two appear in the top two more frequently, compared with the baseline MCI technique. For all these CNI imputers and MCI imputers that we have investigated here, their ranking on average estimation accuracies is: RF, LR, J48, SVM, and MCI. Note that these results are based on our 10-folds cross validation. The results also show that the estimation accuracy decreases when the missing ratio increases, which concludes that the correctness of the CNI imputation would suffer on higher ratios of missing data in the dataset. For the classification accuracy, the aim is to evaluate how much these imputers could improve the classification performance for different kinds of machine learning algorithms. K-nearest neighbour (KNN), Naive Bayes (NB) and Multilayer Perceptron Neural Network (MLP) have been used as the base learning algorithms. The results have shown that for KNN the classification accuracy decrease dramatically with an increase in the missing ratio level. It happens even if the baseline imputation technique MCI has been applied to handle the missing data problem and in most cases this result is worse than the original incomplete dataset. But with the help of the CNI algorithm, K nearest neighbour algorithm can achieve a significant improvement over the classification accuracy when compared with the results from the original incomplete datasets. This truth prevails for all the missing ratios ( 5% to 30%) on both datasets with different splitting percentages (50%-50% and 20%-80% splits). All CNI based imputations have shown that the dataset imputed using these techniques significantly improves the learning of machine learning algorithms. Overall the results show that using high quality imputation techniques such as the CNI algorithm to pre-process the incomplete ordinal data, improves the classification performance of machine learning algorithms. 5 Conclusions Incomplete ordinal data is very common and is problematic in many fields of research, biasing data analysis because of underlying differences between instances with and without missing data. Many methods have been suggested including listwise deletion, in which incomplete cases are discarded from the dataset, mean substitution, and regression substitution, which replace the missing values in a data set by some plausible values. However, these techniques ignore the relationship of different attributes and instead impute an ordinal value based on single attribute distribution. In this work we investigated two research question, firstly, how good CNI based algorithms are for the imputation of ordinal data and secondly, does such imputation improve classification results. To tackle the first problem we introduce the use of Classifier-based nominal imputation (CNI); an imputation technique for nominal data that views the imputation as a classification task, for tackling the problem of missing ordinal data. We applied the CNI technique on five different real world ordinal datasets and the experimental results show that the CNI imputers have better imputation accuracy as compared with the baseline imputation method: most common value imputation. We also assessed how these imputations improve the classification accuracy of machine learning algorithms on the ordinal datasets with missing data. We evaluated three classification algorithms including k-nearest Neighbour, Naive Bayes and Multilayer Perceptron Neural Network. The results show that the classification accuracy improves with the CNI based imputation as compared to classifica-

11 tion accuracy of a dataset with missing ordinal data. In some cases the classification accuracy of CNI imputers even surpassed the classification accuracy of the original data with no missing values. For the future work, it would be interesting to see what the result would be if applying this CNI algorithm on a mixed data type. It would also be beneficial to investigate possible ways to improve this algorithm, for example, using the previously imputed values for the later iterations of imputations, based on the distribution of the attributes or how dense or informative these attributes are. References 1. E. Frank and M. Hall, A simple approach to ordinal classification. Springer, R. J. Little and D. B. Rubin, Statistical analysis with missing data, John A. Wiley & Sons, Inc., New York, J. L. Schafer and J. W. Graham, Missing data: our view of the state of the art. Psychological methods, vol. 7, no. 2, p. 147, E. D. De Leeuw, J. Hox, and M. Huisman, Prevention and treatment of item nonresponse, Journal Of Official Statistics-Stockholm, vol. 19, no. 2, pp , W. H. Finch, Imputation methods for missing categorical questionnaire data: A comparison of approaches, Journal of Data Science, vol. 8, no. 3, pp , X. Su, R. Greiner, T. M. Khoshgoftaar, and A. Napolitano, Using classifierbased nominal imputation to improve machine learning, in Advances in Knowledge Discovery and Data Mining. Springer, 2011, pp J. C. Huhn and E. Hullermeier, Is an ordinal class structure useful in classifier learning? International Journal of Data Mining, Modelling and Management, vol. 1, no. 1, pp , E. D. De Leeuw, Reducing missing data in surveys: An overview of methods, Quality and Quantity, vol. 35, no. 2, pp , R. G. Downey and C. V. King, Missing data in likert ratings: A comparison of replacement methods, The Journal of general psychology, vol. 125, no. 2, pp , Q. A. Raaijmakers, Effectiveness of different missing data treatments in surveys with likert-type data: Introducing the relative mean substitution approach, Educational and Psychological Measurement, vol. 59, no. 5, pp , X. Su, T. M. Khoshgoftaar, and R. Greiner, Using imputation techniques to help learn accurate classifiers, in 20th IEEE International Conference on Tools with Artificial Intelligence, ICTAI , vol. 1. IEEE, 2008, pp G. E. Batista and M. C. Monard, An analysis of four missing data treatment methods for supervised learning, Applied Artificial Intelligence, vol. 17, no. 5-6, pp , 2003.

Using Imputation Techniques to Help Learn Accurate Classifiers

Using Imputation Techniques to Help Learn Accurate Classifiers Xiaoyuan Su Computer Science & Engineering Florida Atlantic University Boca Raton, FL 33431, USA xsu@fau.edu Taghi M. Khoshgoftaar Computer