Improving Imputation Accuracy in Ordinal Data Using Classification

Size: px
Start display at page:

Download "Improving Imputation Accuracy in Ordinal Data Using Classification"

Transcription

1 Improving Imputation Accuracy in Ordinal Data Using Classification Shafiq Alam 1, Gillian Dobbie, and XiaoBin Sun 1 Faculty of Business and IT, Whitireia Community Polytechnic, Auckland, New Zealand shafiq.alam@whitireia.ac.nz 2 Department of Computer Science,University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand gill@cs.auckland.ac.nz, allensun421@gmail.com Abstract. Tackling missing data is one of the fundamental data pre-processing steps. Data analysis and pattern extraction are affected due to the underlying differences between instances with and without missing data. This is a particular problem with ordinal data, where for example a sample of a population may have all failed to answer a specific question in a questionnaire. The existing methods such as listwise deletion, mean attribute substitution, and regression substitution, naively impute data. They do not impute plausible values as they fail to take into account the relationships between the attributes, but instead consider the distribution of the attribute with missing values only. In this paper we introduce the use of Classification Based Imputation (CNI) to replace missing values with plausible values in ordinal data. The results show that not only does the CNI based technique outperform the existing approaches for imputing missing values in ordinal data but it also helps to improve the classification accuracy of machine learning algorithms. 1 Introduction Statisticians and data analysts often distinguish quantities that can be represented in an attribute using different levels of measurement; nominal, ordinal, interval and ratio quantities [1]. A set of data is said to be ordinal if the contained values can be ranked or have a rating scale attached. The categories for an ordinal data set have a natural order, but the distinction between neighbouring values is not always the same. Examples of such data are often found in questionnaires; for the well quoted example of biscuits, suppose a group of people were asked for their likings about a biscuit on a rating scale of 1 to 5. A rating of 5 indicates more enjoyment than a rating of 4. However, the difference in enjoyment expressed by giving a rating of 2 rather than 1 might be much less than the difference in enjoyment expressed by giving a rating of 4 rather than 3. Mining ordinal data has many applications. For example, in empirical studies, surveys and questionnaires are one of the primary sources of data collection. Most of the data collected in such a way is ordinal and includes human participants. Analysis of such ordinal data is the main tool for understanding the psychological behaviour of participants. However, humans tend to omit responding to some

2 of the questions in the questionnaires and this leads to missing values in the ordinal data. There can be various reasons for the existence of missing values in a dataset. The respondents may not want to answer certain questions with a sensitive topic, there may be data transmission problems, human or machine faults when the data is being recorded. The mechanism concerning the randomness of the missing data can be divided into three classes [2]. Missing Completely at Random (MCAR) occurs when the probability of an instance having a missing value for an attribute is unrelated to both the value of the variable itself, and also to the values of any other variables in the dataset. Missing at Random (MAR) occurs when the probability of an instance having a missing value for an attribute may depend on the known values, but not on the value of the missing data itself. Not Missing at Random (NMAR) occurs when the probability of an instance having a missing value for an attribute is related to the value of the attribute itself [3]. De Leeuw et al. [4] have investigated the problems caused by missing data in statistical analyses, which include biased parameter estimates and inflated standard errors. These possible issues may be particularly serious when the cause of a missing response is directly related to the value itself. Also, when there are large amounts of missing data, the power of the statistical analysis can be severely compromised. In Data Mining and Knowledge Discovery from Databases (KDD), data quality is a major concern [5]. Despite the fact that there are frequent occurrences of missing data in real world data sets, most machine learning algorithms naively handle the problem of missing data, which introduces bias into analysis. The attributes of most datasets are not independent of each other. Therefore, missing values could be determined from the identification of relationships among attributes. Imputation is the procedure of replacing the missing values in a data set by some other value. Given the potential problems that arise from the presence of missing data, a variety of methods have been suggested for dealing with such problems. The most common and simplest approach to handle missing data is to simply omit the cases with missing data and to run the analysis on the remaining dataset; this approach is usually called list-wise deletion. A second approach is called mean substitution, it simply substitutes the average value of a particular attribute in the missing one. There are some problems with this approach; in the first place it adds no new information. In addition, such approaches lead to an underestimate of error, and the value being substituted may not be a plausible value, so it is not appropriate for ordinal data. A third approach is regression substitution, which uses existing variables to make a prediction and then substitute that predicted value as if it were an actual obtained value. With this method, the imputed value is in some way conditional on other information in the data set, however, the problem of error variance remains. Su et al. [6] proposed Classifier-based Nominal Imputation (CNI), a imputation technique for nominal data that views imputation as a classification problem. The result of the experiments have shown that it improves the accuracy of imputation on nominal data, and also improves the classification performance of some machine learning algorithms. The objective of this paper is to apply the CNI technique proposed by Su et al. to ordinal features instead of nominal data, and then compare and analyse experimental results to understand the performance on ordinal datasets. We report two parameters, firstly, estimation accuracy, which shows how accurate the imputation is for ordinal datasets as compared with a baseline imputation technique: most common value imputation (MCI), and secondly, classification accuracy,

3 which shows to what extent this CNI technique could help to improve the classification performance of some machine learning algorithms on the ordinal datasets with missing data. Section 2 of the paper describes some of the related work. Section 3 gives the background of the proposed technique. Section 4 explains the experimental procedure, the experiments, the experimental result and discussion around the obtained results, and Section 5 draws conclusions and proposes future directions for research. 2 Related Work Most of the machine learning methods for classification assume the datasets are unordered, but in many real world applications these data do exhibit a natural order. In [1] the authors have proposed to use the ordering information of the datasets by converting the ordinal class problem into binary class problems for the learning of machine learning based classification algorithms. The reported results compared with naive approaches have shown better classification accuracy. Also when compared to other algorithms that were specifically designed for the ordinal classification problem, their method could be applied to ordinal datasets without any changing the underlying learning scheme. Huhn and Hullermeier [7] has shown that the method proposed by Frank and Hall [1] for ordinal classification is indeed able to exploit order information about classes. Imputation for ordinal datasets such as questionnaires is common because ordinal data is often faced with the problem of missing data. In [8] the authors suggested methods to deal with the problem of missing data. In [9] the authors evaluated different methods for the imputation of missing values in likert data; a kind of ordinal data that is collected for understanding human behaviour. Their results concluded that two methods, item mean based imputation, and person-mean substitution performs well for smaller ration of missing data (less than 20%). The authors of [10]emphasized on the significance of imputation and proposed relative mean based imputation for likert data. The comparison results show that the substitution performance for the give datasets that they have tested their approach on, outperforms the simple mean based imputation. Su et al. [11] proposed classifier-helped imputation which uses accurate imputation techniques, such as Bayesian multiple imputation (BMI), predictive mean matching (PMM), and Expectation Maximization (EM) to preprocess missing data for a machine learning algorithm. Their results show that EM and BMI effectively imputed data in case of missing completely at random (MCAR). The problem with this method is that it has been evaluated for nominal data only and its effect is not known on ordinal data. Their proposed nominal imputation technique called Classifier-based Nominal Imputation (CNI) learns a classifier for the nominal feature and then applies the trained classifier for predicting missing attribute values. They investigated different classification algorithms for imputation, and their results show that SVM and decision tree have the best performance. Also, these learners have much higher imputation accuracies compared to other commonly used nominal imputation techniques, such as k-nearest neighbour imputation (KNNI) and most common imputation (MCI). Their results also show an improvement in classification accuracy after such imputation has been performed to treat the missing data.

4 Fig. 1: Classifier Based Nominal Imputation (CNI) [6] In the research area, the comparison of different imputation methods for numerical and nominal data is very common. Batista and Monard [12] investigated the use of the k-nearest neighbour algorithm as an imputation method for nominal datasets; the experimental results shown that it outperform the internal methods used by C4.5 and CN2 to treat missing data, and also it outperforms the mean or mode imputation method, which is a method used widely to treat missing values. The area of ordinal data imputation and its effects on learning are still under explored. This paper intends to fill that gap and proposes the use of CNI based imputation for treating missing values in ordinal data. This paper also explores the effect of this imputation on learning for data mining. 3 Classifier Based Nominal Imputation Algorithm (CNI) Classifier Based Nominal Imputation (CNI) was proposed for treating missing data for nominal attributes. The basic idea of CNI is to use classification for imputation. For each attribute f i of the a tuple that contains missing values, classifier c i (...) is trained which takes the the remaining n 1 attributes { f j j i} of that tuples as an input, and returns the value for f i of the this tuple based on the training of the learner. The learning algorithm L is used by CNI(L) for the training of the classifierc i (...) based on the training data with non missing values on f i, CNI then uses this classifier c i (...) to predict the missing values of this attribute for the remaining tuples. Algorithm 1 describes the functionality of the CNI imputation algorithm. After the imputation of missing data using CNI(L), the imputed dataset is passed to the base learner B that trains a classifier to classify the testing set of ordinal data with the missing values. This classifier is able to predict the missing values as a class for the tuples that contains incomplete ordinal data.

5 4 Experimental setup This section includes a description of the datasets, the type of experiments and the experimental results. We have performed two types of experiments, one to assess the imputation accuracy and the second to assess the classification accuracy of the algorithms on the imputed ordinal data. Five ordinal real-world datasets, which were drawn from the UCI machine learning repository [9] and the WEKA dataset repository [10] have been explored. (See dataset descriptions in Table 1a.) The dataset CAR contains 5 ordinal at- Table 1: Datasets from UCI machine learning Repository and WEKA (a) Dataset details Dataset #Instances #Attributes(ordinal/all) #Class CAR /7 1 ERA ESL LSV SWD (b) Dataset CAR ATTRIBUTES VALUES buying vhigh, high, med, low doors 2, 3, 4, 5more lug boot small, med, big maint vhigh, high, med, low persons 2, 4, more safety low, med, high class unacc, acc, good, vgood tributes and 2 non ordinal attributes and all other datasets contain ordinal attributes only. Except for the dataset CAR, all other ordinal values in each dataset are represented by numbers and their values indicate rank information. Table 1 is the summary of each attribute for dataset CAR. In order to analyse the behaviour of the CNI technique and to measure the performance of both the estimation accuracy and classification accuracy on datasets for different missing ratio levels, four datasets which contain missing data have been extracted from the original datasets in a random fashion by deleting 5%, 10%, 20%, 30% of the observed values, with the probability of 0.05, 0.1, 0.2, 0.3, using 10-folds cross validation. 4.1 Experimental Results Estimation Accuracy: Since datasets with missing values were generated from the original complete datasets, the truth for each missing value is clear, thus, the CNI algorithm can be evaluated against the ground truth for the estimation of accuracy by comparing each imputed value against the true values accordingly. Estimation accuracy is calculated using the following equation. EstimationAccuracy(in%) = 1 N N I(X j = Y j ) (1) j=1 where function I returns 1 for X j = Y j, and 0 otherwise, X j is the imputed value of j th instance produced by the imputer, Y j is the true value of the j th instance from the original complete dataset, and N is the total number of imputed values.

6 To measure the performance of the estimation accuracy of the CNI technique, five different imputation learners L has been used for each dataset: most common imputation (MCI), decision tree (C4.5), support vector machine (SVM), random forest (RF) and logistic regression (LR). By applying these learners for each attribute of the dataset in WEKA, the complete imputed dataset can be obtained. We get the estimation accuracy of the CNI technique for each different ratio by comparing the imputed datasets with original dataset. The results show that the CNI algorithm has a significant improvement in estimation accuracy when compared to the baseline most common imputation (MCI) technique on the ordinal data. The imputed values are much closer to the original datasets than the results using the MCI method. Also, the performance of estimation accuracies decreases as the missing data ratio increases. Table 2 shows the imputation accuracies. For the missing ratios of 5%, the average estimation accuracy of baseline imputer MCI is 28.94% and the average estimation accuracy for all CNI imputers is 61%, which indicates a big improvement over MCI. The lowest value is using support vector machine (SVM) as the imputer, which is 42.92% for dataset LSV and the highest estimation accuracy is using random forest (RF) as the imputer for dataset SWD, which is 94.37%. For the miss- Table 2: Estimation accuracy for different missing ratios (a) Estimation accuracy for missing ratio at 5% 5% MCI J48 LR RF SVM CAR 33.37% 46.63% 44.19% 43.93% 44.89% LSV 27.20% 62.88% 49.16% 72.68% 42.92% ESL 32.08% 52.00% 55.00% 50.17% 54.83% ERA 12.24% 76.40% 77.60% 81.47% 74.20% SWD 39.82% 84.73% 58.09% 94.37% 53.82% (b) Estimation accuracy for missing ratio at 10% 10% MCI J48 LR RF SVM CAR 34.11% 40.12% 43.82% 41.39% 38.61% LSV 30.31% 55.64% 46.06% 68.08% 41.10% ESL 29.14% 49.25% 51.29% 47.62% 53.74% ERA 12.44% 72.93% 73.53% 79.73% 67.67% SWD 39.41% 77.55% 57.36% 92.09% 55.68% (c) Estimation accuracy for missing ratio at 20% 20% MCI J48 LR RF SVM CAR 35.75% 39.63% 41.77% 40.55% 38.99% LSV 28.90% 39.46% 42.95% 59.79% 37.33% ESL 30.10% 45.03% 46.60% 47.21% 48.57% ERA 12.80% 53.90% 65.87% 72.83% 57.63% SWD 39.14% 62.09% 53.50% 84.64% 50.73% (d) Estimation accuracy for missing ratio at 30% 30% MCI J48 LR RF SVM CAR 35.78% 38.81% 39.34% 39.27% 38.92% LSV 29.18% 33.33% 41.51% 52.44% 34.75% ESL 31.36% 42.87% 44.93% 45.25% 46.30% ERA 12.76% 37.07% 57.71% 67.31% 50.27% SWD 41.33% 52.70% 53.12% 75.85% 49.81% ing ratios of 10%, the average estimation accuracy of baseline imputer MCI is 29.08% and the average estimation accuracy for all CNI imputers is 57.66%. The lowest value is 38.61% using support vector machine (SVM) as the imputer for dataset CAR, while the highest estimation accuracy is from using random forest (RF) as the imputer for dataset SWD, which is 92.09%. For the missing ratios of 20%, the average estimation accuracy of baseline imputer MCI is 29.34% and the average estimation accuracy for all CNI imputers is 51.45%. The lowest value is 37.33% from using support vector machine (SVM) as the imputer for dataset

7 LSV, while the highest estimation accuracy is using random forest (RF) as the imputer for dataset SWD, which is 84.64%. For the missing ratios of 30%, the average estimation accuracy of baseline imputer MCI is 30.08% and the average estimation accuracy for all CNI imputers is 47.08%. The lowest value is 33.33% using decision tree (J48) as the imputer for dataset LSV while the highest estimation accuracy is using random forest (RF) as the imputer for dataset SWD, which is 75.85%. All CNI based algorithms outperform the baseline MCI algorithm and show a significantly higher estimation accuracy. 4.2 Classification Accuracy For the measurement of classification accuracy, it can be used as a way to determine how much this CNI technique could improve the classification performance when compared with the classification on missing only datasets and the baseline technique, most common value imputation. Classification accuracy is calculated using the following equation. Classi f icationaccuracy(in%) = 1 N N I(M j = K j ) (2) j=1 where N is the total number of class values in the testing set, M j is the classified class from the result of classification on the testing set and K j is the true value in the original non-missing testing set, and function I returns 1 for M j = K j and 0 otherwise. We used the imputed datasets and randomly separate each dataset into a training set and testing set (50-50% and 20-80% ). We investigated the classification performance based on imputed data for three machine learners, K Nearest Neighbour Algorithm (KNN-5), Naive Bayes algorithm (NB), multilayer perception neural network algorithm (MLP) on incomplete ordinal data. We measured the classification accuracy on both the original datasets which contain no missing values and the datasets that contain missing values with different ratios, then compare these results. 4.3 K Nearest Neighbour Algorithm (KNN-5) In our first experiment, we evaluated the results using the K Nearest Neighbour algorithm. For the dataset CAR, the original classification accuracy is 89.58%. For the same dataset when it contains missing values, the classification accuracy is quite low (78%). With imputation using MCI, the classification accuracy goes down further to 74%. But when different CNI imputers are used as preprocessors, the classification accuracy improves and moves closer to the original classification on different missing ratios from 5% to 30%. The performance varies from the smallest lift of 1.06% (which occurs using the CNI algorithm with J48 as the imputer at 5% missing ratio level) to the biggest lift of 11.24% (which occurs using the CNI algorithm with J48 as the imputer at 20% missing ratio level). Figure 2 shows the results of the experiment. For the same dataset CAR with 20% as the training set, with the remaining 80% as the testing set, the classification accuracy is 82.49% for the original dataset. For the dataset that contains missing values, the classification accuracy decreases to 77%. Similar to the

8 Fig. 2: Classification accuracy for the five datasets (a) CAR dataset (50,50 Split) (b) CAR dataset (20,80 Split) (c) ESL dataset (50,50 Split) (d) ESL dataset (20,80 Split) (e) LSV dataset (50,50 Split) (f) LSV dataset (20,80 Split) (g) ERA dataset (50,50 Split) (h) ERA dataset (20,80 Split) (i) SWD dataset (50,50 Split) (j) SWD dataset (20,80 Split)

9 50-50 split, the baseline imputation technique MCI performs worse on this split as well. For different CNI imputers as pre-processors, the classification accuracies improve on different missing ratios from 5% to 30% when compared to the dataset that contains missing values. The performance varies from the smallest lift of 0.18% to biggest lift of 8.86%. For the dataset ESL, the classification accuracy is 57.79% for the original data set with no missing values. MCI performs worse among all the imputers while for the CNI based algorithms the the performance varies from the smallest lift of 2.22% (using random forest (RF) as the imputer at 10% missing ratio level) to the biggest lift of 27.81% (using logistic regression (LR) as the imputer at 30% missing ratio level). For the dataset LSV, the classification accuracy is 60% for the original data set. For the dataset with missing values, the classification accuracy goes down to 50%. However, using different CNI imputers as pre-processors, the classification accuracy improves from the smallest lift of 1.36% (using random forest (RF) as the imputer at 5% missing ratio level) to the biggest lift of 23.58% (using support vector machine (SVM) as the imputer at 30% missing ratio level). For the dataset ERA, the classification accuracy is 22.80% for the original data set. For the dataset with missing values, MCI performs worst whileusing different CNI imputers as pre-processors, the results have shown an increase in classification accuracy from the smallest lift of 4.67% (using support vector machine (SVM) as the imputer at 5% missing ratio level) to the biggest lift of 57.89% (using logistic regression (LR) as the imputer at 20% missing ratio level). For the dataset SWD, the classification accuracy is 56.80% for the original data. Using different CNI imputers as pre-processors, the results overall have shown a better performance on classification accuracy where the smallest lift of 0.71% occurs using decision tree (J48) as the imputer at 5% missing ratio level, and the biggest lift of 13.11% occurs with support vector machine (SVM) as the imputer at 20% missing ratio level. Besides KNN, we have also used Naive Bayes Table 3: Classification accuracy for Naive Bayes and Multilayer Perception Neural Network (MLP) (a) Dataset CAR using Naive Bayes algorithm Miss-Only MCI J48 LR RF SVM 5.00% 83.10% 81.60% 86.23% 85.30% 84.26% 85.76% 10.00% 82.06% 81.53% 86.46% 84.61% 84.95% 86.81% 20.00% 79.86% 76.12% 85.30% 85.76% 83.22% 86.46% 30.00% 80.32% 73.50% 81.48% 84.95% 81.02% 85.53% (b) Dataset CAR using MLP algorithm Miss-Only MCI J48 LR RF SVM 93.40% 88.08% 95.97% 95.45% 97.11% 94.44% 90.51% 87.32% 94.33% 93.64% 92.71% 92.36% 84.03% 83.87% 88.54% 87.38% 86.57% 87.85% 80.44% 75.81% 85.88% 85.65% 84.72% 84.49% algorithm and multilayer perception neural network algorithm to assess the improvement in classification accuracy of the imputed datasets. Table 3 shows these results. The accuracy for the original data without missing values is 85.07% for Naive Bayes, and 97.34% for MLP. For the sake of brevity, the detailed results have been omitted. The estimation accuracy of different imputers varies on different missing ratios of each dataset, but on average, the random forest and logistic regression are

10 among the top performers since these two appear in the top two more frequently, compared with the baseline MCI technique. For all these CNI imputers and MCI imputers that we have investigated here, their ranking on average estimation accuracies is: RF, LR, J48, SVM, and MCI. Note that these results are based on our 10-folds cross validation. The results also show that the estimation accuracy decreases when the missing ratio increases, which concludes that the correctness of the CNI imputation would suffer on higher ratios of missing data in the dataset. For the classification accuracy, the aim is to evaluate how much these imputers could improve the classification performance for different kinds of machine learning algorithms. K-nearest neighbour (KNN), Naive Bayes (NB) and Multilayer Perceptron Neural Network (MLP) have been used as the base learning algorithms. The results have shown that for KNN the classification accuracy decrease dramatically with an increase in the missing ratio level. It happens even if the baseline imputation technique MCI has been applied to handle the missing data problem and in most cases this result is worse than the original incomplete dataset. But with the help of the CNI algorithm, K nearest neighbour algorithm can achieve a significant improvement over the classification accuracy when compared with the results from the original incomplete datasets. This truth prevails for all the missing ratios ( 5% to 30%) on both datasets with different splitting percentages (50%-50% and 20%-80% splits). All CNI based imputations have shown that the dataset imputed using these techniques significantly improves the learning of machine learning algorithms. Overall the results show that using high quality imputation techniques such as the CNI algorithm to pre-process the incomplete ordinal data, improves the classification performance of machine learning algorithms. 5 Conclusions Incomplete ordinal data is very common and is problematic in many fields of research, biasing data analysis because of underlying differences between instances with and without missing data. Many methods have been suggested including listwise deletion, in which incomplete cases are discarded from the dataset, mean substitution, and regression substitution, which replace the missing values in a data set by some plausible values. However, these techniques ignore the relationship of different attributes and instead impute an ordinal value based on single attribute distribution. In this work we investigated two research question, firstly, how good CNI based algorithms are for the imputation of ordinal data and secondly, does such imputation improve classification results. To tackle the first problem we introduce the use of Classifier-based nominal imputation (CNI); an imputation technique for nominal data that views the imputation as a classification task, for tackling the problem of missing ordinal data. We applied the CNI technique on five different real world ordinal datasets and the experimental results show that the CNI imputers have better imputation accuracy as compared with the baseline imputation method: most common value imputation. We also assessed how these imputations improve the classification accuracy of machine learning algorithms on the ordinal datasets with missing data. We evaluated three classification algorithms including k-nearest Neighbour, Naive Bayes and Multilayer Perceptron Neural Network. The results show that the classification accuracy improves with the CNI based imputation as compared to classifica-

11 tion accuracy of a dataset with missing ordinal data. In some cases the classification accuracy of CNI imputers even surpassed the classification accuracy of the original data with no missing values. For the future work, it would be interesting to see what the result would be if applying this CNI algorithm on a mixed data type. It would also be beneficial to investigate possible ways to improve this algorithm, for example, using the previously imputed values for the later iterations of imputations, based on the distribution of the attributes or how dense or informative these attributes are. References 1. E. Frank and M. Hall, A simple approach to ordinal classification. Springer, R. J. Little and D. B. Rubin, Statistical analysis with missing data, John A. Wiley & Sons, Inc., New York, J. L. Schafer and J. W. Graham, Missing data: our view of the state of the art. Psychological methods, vol. 7, no. 2, p. 147, E. D. De Leeuw, J. Hox, and M. Huisman, Prevention and treatment of item nonresponse, Journal Of Official Statistics-Stockholm, vol. 19, no. 2, pp , W. H. Finch, Imputation methods for missing categorical questionnaire data: A comparison of approaches, Journal of Data Science, vol. 8, no. 3, pp , X. Su, R. Greiner, T. M. Khoshgoftaar, and A. Napolitano, Using classifierbased nominal imputation to improve machine learning, in Advances in Knowledge Discovery and Data Mining. Springer, 2011, pp J. C. Huhn and E. Hullermeier, Is an ordinal class structure useful in classifier learning? International Journal of Data Mining, Modelling and Management, vol. 1, no. 1, pp , E. D. De Leeuw, Reducing missing data in surveys: An overview of methods, Quality and Quantity, vol. 35, no. 2, pp , R. G. Downey and C. V. King, Missing data in likert ratings: A comparison of replacement methods, The Journal of general psychology, vol. 125, no. 2, pp , Q. A. Raaijmakers, Effectiveness of different missing data treatments in surveys with likert-type data: Introducing the relative mean substitution approach, Educational and Psychological Measurement, vol. 59, no. 5, pp , X. Su, T. M. Khoshgoftaar, and R. Greiner, Using imputation techniques to help learn accurate classifiers, in 20th IEEE International Conference on Tools with Artificial Intelligence, ICTAI , vol. 1. IEEE, 2008, pp G. E. Batista and M. C. Monard, An analysis of four missing data treatment methods for supervised learning, Applied Artificial Intelligence, vol. 17, no. 5-6, pp , 2003.

Using Imputation Techniques to Help Learn Accurate Classifiers

Using Imputation Techniques to Help Learn Accurate Classifiers Using Imputation Techniques to Help Learn Accurate Classifiers Xiaoyuan Su Computer Science & Engineering Florida Atlantic University Boca Raton, FL 33431, USA xsu@fau.edu Taghi M. Khoshgoftaar Computer

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 1 (2016), pp. 1131-1140 Research India Publications http://www.ripublication.com A Monotonic Sequence and Subsequence Approach

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

CHAPTER 6 EXPERIMENTS

CHAPTER 6 EXPERIMENTS CHAPTER 6 EXPERIMENTS 6.1 HYPOTHESIS On the basis of the trend as depicted by the data Mining Technique, it is possible to draw conclusions about the Business organization and commercial Software industry.

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Missing data analysis. University College London, 2015

Missing data analysis. University College London, 2015 Missing data analysis University College London, 2015 Contents 1. Introduction 2. Missing-data mechanisms 3. Missing-data methods that discard data 4. Simple approaches that retain all the data 5. RIBG

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION

More information

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems Comparative Study of Instance Based Learning and Back Propagation for Classification Problems 1 Nadia Kanwal, 2 Erkan Bostanci 1 Department of Computer Science, Lahore College for Women University, Lahore,

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Bootstrap and multiple imputation under missing data in AR(1) models

Bootstrap and multiple imputation under missing data in AR(1) models EUROPEAN ACADEMIC RESEARCH Vol. VI, Issue 7/ October 2018 ISSN 2286-4822 www.euacademic.org Impact Factor: 3.4546 (UIF) DRJI Value: 5.9 (B+) Bootstrap and multiple imputation under missing ELJONA MILO

More information

Experimental analysis of methods for imputation of missing values in databases

Experimental analysis of methods for imputation of missing values in databases Experimental analysis of methods for imputation of missing values in databases Alireza Farhangfar a, Lukasz Kurgan b, Witold Pedrycz c a IEEE Student Member (farhang@ece.ualberta.ca) b IEEE Member (lkurgan@ece.ualberta.ca)

More information

P. Jönsson and C. Wohlin, "Benchmarking k-nearest Neighbour Imputation with Homogeneous Likert Data", Empirical Software Engineering: An

P. Jönsson and C. Wohlin, Benchmarking k-nearest Neighbour Imputation with Homogeneous Likert Data, Empirical Software Engineering: An P. Jönsson and C. Wohlin, "Benchmarking k-nearest Neighbour Imputation with Homogeneous Likert Data", Empirical Software Engineering: An International Journal, Vol. 11, No. 3, pp. 463-489, 2006. 2 Per

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

4. Feedforward neural networks. 4.1 Feedforward neural network structure

4. Feedforward neural networks. 4.1 Feedforward neural network structure 4. Feedforward neural networks 4.1 Feedforward neural network structure Feedforward neural network is one of the most common network architectures. Its structure and some basic preprocessing issues required

More information

AMOL MUKUND LONDHE, DR.CHELPA LINGAM

AMOL MUKUND LONDHE, DR.CHELPA LINGAM International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol. 2, Issue 4, Dec 2015, 53-58 IIST COMPARATIVE ANALYSIS OF ANN WITH TRADITIONAL

More information

Lecture #11: The Perceptron

Lecture #11: The Perceptron Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be

More information

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa Ronald H. Heck 1 In this handout, we will address a number of issues regarding missing data. It is often the case that the weakest point of a study is the quality of the data that can be brought to bear

More information

An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network

An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network International Journal of Science and Engineering Investigations vol. 6, issue 62, March 2017 ISSN: 2251-8843 An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network Abisola Ayomide

More information

Learning and Evaluating Classifiers under Sample Selection Bias

Learning and Evaluating Classifiers under Sample Selection Bias Learning and Evaluating Classifiers under Sample Selection Bias Bianca Zadrozny IBM T.J. Watson Research Center, Yorktown Heights, NY 598 zadrozny@us.ibm.com Abstract Classifier learning methods commonly

More information

EM algorithm with GMM and Naive Bayesian to Implement Missing Values

EM algorithm with GMM and Naive Bayesian to Implement Missing Values , pp.1-5 http://dx.doi.org/10.14257/astl.2014.46.01 EM algorithm with GMM and aive Bayesian to Implement Missing Values Xi-Yu Zhou 1, Joon S. Lim 2 1 I.T. College Gachon University Seongnam, South Korea,

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi 1 Introduction Imputation is the commonly used

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

An Evaluation of k-nearest Neighbour Imputation Using Likert Data

An Evaluation of k-nearest Neighbour Imputation Using Likert Data An Evaluation of k-nearest Neighbour Imputation Using Likert Data Per Jönsson and Claes Wohlin School of Engineering, Blekinge Institute of Technology PO-Box 520, SE-372 25, Ronneby, Sweden per.jonsson@bth.se,

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 4, April 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Discovering Knowledge

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm

Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm E. Padmalatha Asst.prof CBIT C.R.K. Reddy, PhD Professor CBIT B. Padmaja Rani, PhD Professor JNTUH ABSTRACT Data Stream

More information

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES Narsaiah Putta Assistant professor Department of CSE, VASAVI College of Engineering, Hyderabad, Telangana, India Abstract Abstract An Classification

More information

Author Prediction for Turkish Texts

Author Prediction for Turkish Texts Ziynet Nesibe Computer Engineering Department, Fatih University, Istanbul e-mail: admin@ziynetnesibe.com Abstract Author Prediction for Turkish Texts The main idea of authorship categorization is to specify

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

Improving Classifier Performance by Imputing Missing Values using Discretization Method

Improving Classifier Performance by Imputing Missing Values using Discretization Method Improving Classifier Performance by Imputing Missing Values using Discretization Method E. CHANDRA BLESSIE Assistant Professor, Department of Computer Science, D.J.Academy for Managerial Excellence, Coimbatore,

More information

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU. SOS3003 Applied data analysis for social science Lecture note 04-2009 Erling Berge Department of sociology and political science NTNU Erling Berge 2009 1 Missing data Literature Allison, Paul D 2002 Missing

More information

Chapter 8 The C 4.5*stat algorithm

Chapter 8 The C 4.5*stat algorithm 109 The C 4.5*stat algorithm This chapter explains a new algorithm namely C 4.5*stat for numeric data sets. It is a variant of the C 4.5 algorithm and it uses variance instead of information gain for the

More information

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999. 2 Schafer, J. L., Graham, J. W.: (2002). Missing Data: Our View of the State of the Art. Psychological methods, 2002, Vol 7, No 2, 47 77 Rosner, B. (2005) Fundamentals of Biostatistics, 6th ed, Wiley.

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification Flora Yu-Hui Yeh and Marcus Gallagher School of Information Technology and Electrical Engineering University

More information

RESAMPLING METHODS. Chapter 05

RESAMPLING METHODS. Chapter 05 1 RESAMPLING METHODS Chapter 05 2 Outline Cross Validation The Validation Set Approach Leave-One-Out Cross Validation K-fold Cross Validation Bias-Variance Trade-off for k-fold Cross Validation Cross Validation

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Multiple-imputation analysis using Stata s mi command

Multiple-imputation analysis using Stata s mi command Multiple-imputation analysis using Stata s mi command Yulia Marchenko Senior Statistician StataCorp LP 2009 UK Stata Users Group Meeting Yulia Marchenko (StataCorp) Multiple-imputation analysis using mi

More information

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract

More information

Analysis of missing values in simultaneous. functional relationship model for circular variables

Analysis of missing values in simultaneous. functional relationship model for circular variables Analysis of missing values in simultaneous linear functional relationship model for circular variables S. F. Hassan, Y. Z. Zubairi and A. G. Hussin* Centre for Foundation Studies in Science, University

More information

NORM software review: handling missing values with multiple imputation methods 1

NORM software review: handling missing values with multiple imputation methods 1 METHODOLOGY UPDATE I Gusti Ngurah Darmawan NORM software review: handling missing values with multiple imputation methods 1 Evaluation studies often lack sophistication in their statistical analyses, particularly

More information

Data Mining With Weka A Short Tutorial

Data Mining With Weka A Short Tutorial Data Mining With Weka A Short Tutorial Dr. Wenjia Wang School of Computing Sciences University of East Anglia (UEA), Norwich, UK Content 1. Introduction to Weka 2. Data Mining Functions and Tools 3. Data

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya Volume 5, Issue 1, January 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Performance

More information

Analysis of Modified Rule Extraction Algorithm and Internal Representation of Neural Network

Analysis of Modified Rule Extraction Algorithm and Internal Representation of Neural Network Covenant Journal of Informatics & Communication Technology Vol. 4 No. 2, Dec, 2016 An Open Access Journal, Available Online Analysis of Modified Rule Extraction Algorithm and Internal Representation of

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

Exam Advanced Data Mining Date: Time:

Exam Advanced Data Mining Date: Time: Exam Advanced Data Mining Date: 11-11-2010 Time: 13.30-16.30 General Remarks 1. You are allowed to consult 1 A4 sheet with notes written on both sides. 2. Always show how you arrived at the result of your

More information

NCBI: A Novel Correlation Based Imputing Technique using Biclustering

NCBI: A Novel Correlation Based Imputing Technique using Biclustering NCBI: A Novel Correlation Based Imputing Technique using Biclustering Hussain A. Chowdhury 1, Hasin A. Ahmed 2, Dhruba K. Bhattacharyya 1, and Jugal K. Kalita 3 Computer Science and Engineering, Tezpur

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

A study of classification algorithms using Rapidminer

A study of classification algorithms using Rapidminer Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja

More information

Missing Data Analysis with SPSS

Missing Data Analysis with SPSS Missing Data Analysis with SPSS Meng-Ting Lo (lo.194@osu.edu) Department of Educational Studies Quantitative Research, Evaluation and Measurement Program (QREM) Research Methodology Center (RMC) Outline

More information

A Lazy Approach for Machine Learning Algorithms

A Lazy Approach for Machine Learning Algorithms A Lazy Approach for Machine Learning Algorithms Inés M. Galván, José M. Valls, Nicolas Lecomte and Pedro Isasi Abstract Most machine learning algorithms are eager methods in the sense that a model is generated

More information

Missing Data. Where did it go?

Missing Data. Where did it go? Missing Data Where did it go? 1 Learning Objectives High-level discussion of some techniques Identify type of missingness Single vs Multiple Imputation My favourite technique 2 Problem Uh data are missing

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy Simulation of Imputation Effects Under Different Assumptions Danny Rithy ABSTRACT Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

Missing Value Imputation in Multi Attribute Data Set

Missing Value Imputation in Multi Attribute Data Set Missing Value Imputation in Multi Attribute Data Set Minakshi Dr. Rajan Vohra Gimpy Department of computer science Head of Department of (CSE&I.T) Department of computer science PDMCE, Bahadurgarh, Haryana

More information

Predict the box office of US movies

Predict the box office of US movies Predict the box office of US movies Group members: Hanqing Ma, Jin Sun, Zeyu Zhang 1. Introduction Our task is to predict the box office of the upcoming movies using the properties of the movies, such

More information

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear Using Machine Learning to Identify Security Issues in Open-Source Libraries Asankhaya Sharma Yaqin Zhou SourceClear Outline - Overview of problem space Unidentified security issues How Machine Learning

More information

Supervised Learning Classification Algorithms Comparison

Supervised Learning Classification Algorithms Comparison Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------

More information

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES USING DIFFERENT DATASETS V. Vaithiyanathan 1, K. Rajeswari 2, Kapil Tajane 3, Rahul Pitale 3 1 Associate Dean Research, CTS Chair Professor, SASTRA University,

More information

3 Virtual attribute subsetting

3 Virtual attribute subsetting 3 Virtual attribute subsetting Portions of this chapter were previously presented at the 19 th Australian Joint Conference on Artificial Intelligence (Horton et al., 2006). Virtual attribute subsetting

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Statistical dependence measure for feature selection in microarray datasets

Statistical dependence measure for feature selection in microarray datasets Statistical dependence measure for feature selection in microarray datasets Verónica Bolón-Canedo 1, Sohan Seth 2, Noelia Sánchez-Maroño 1, Amparo Alonso-Betanzos 1 and José C. Príncipe 2 1- Department

More information

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION Introduction CHAPTER 1 INTRODUCTION Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data. Mplus offers researchers a wide choice of models, estimators,

More information

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms Remco R. Bouckaert 1,2 and Eibe Frank 2 1 Xtal Mountain Information Technology 215 Three Oaks Drive, Dairy Flat, Auckland,

More information

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats Data Mining Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka I. Data sets I.1. Data sets characteristics and formats The data to be processed can be structured (e.g. data matrix,

More information

Filter methods for feature selection. A comparative study

Filter methods for feature selection. A comparative study Filter methods for feature selection. A comparative study Noelia Sánchez-Maroño, Amparo Alonso-Betanzos, and María Tombilla-Sanromán University of A Coruña, Department of Computer Science, 15071 A Coruña,

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A NOVEL HYBRID APPROACH FOR PREDICTION OF MISSING VALUES IN NUMERIC DATASET V.B.Kamble* 1, S.N.Deshmukh 2 * 1 Department of Computer Science and Engineering, P.E.S. College of Engineering, Aurangabad.

More information

Machine Learning and Pervasive Computing

Machine Learning and Pervasive Computing Stephan Sigg Georg-August-University Goettingen, Computer Networks 17.12.2014 Overview and Structure 22.10.2014 Organisation 22.10.3014 Introduction (Def.: Machine learning, Supervised/Unsupervised, Examples)

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Noise-based Feature Perturbation as a Selection Method for Microarray Data Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering

More information

Handling Data with Three Types of Missing Values:

Handling Data with Three Types of Missing Values: Handling Data with Three Types of Missing Values: A Simulation Study Jennifer Boyko Advisor: Ofer Harel Department of Statistics University of Connecticut Storrs, CT May 21, 2013 Jennifer Boyko Handling

More information

Prognosis of Lung Cancer Using Data Mining Techniques

Prognosis of Lung Cancer Using Data Mining Techniques Prognosis of Lung Cancer Using Data Mining Techniques 1 C. Saranya, M.Phil, Research Scholar, Dr.M.G.R.Chockalingam Arts College, Arni 2 K. R. Dillirani, Associate Professor, Department of Computer Science,

More information

CLASSIFICATION JELENA JOVANOVIĆ. Web:

CLASSIFICATION JELENA JOVANOVIĆ.   Web: CLASSIFICATION JELENA JOVANOVIĆ Email: jeljov@gmail.com Web: http://jelenajovanovic.net OUTLINE What is classification? Binary and multiclass classification Classification algorithms Naïve Bayes (NB) algorithm

More information

The Performance of Multiple Imputation for Likert-type Items with Missing Data

The Performance of Multiple Imputation for Likert-type Items with Missing Data Journal of Modern Applied Statistical Methods Volume 9 Issue 1 Article 8 5-1-2010 The Performance of Multiple Imputation for Likert-type Items with Missing Data Walter Leite University of Florida, Walter.Leite@coe.ufl.edu

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Jesse Read 1, Albert Bifet 2, Bernhard Pfahringer 2, Geoff Holmes 2 1 Department of Signal Theory and Communications Universidad

More information

Data corruption, correction and imputation methods.

Data corruption, correction and imputation methods. Data corruption, correction and imputation methods. Yerevan 8.2 12.2 2016 Enrico Tucci Istat Outline Data collection methods Duplicated records Data corruption Data correction and imputation Data validation

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Development of Improved ID3 Algorithm with TkNN Clustering Using Car Dataset

Development of Improved ID3 Algorithm with TkNN Clustering Using Car Dataset Development of ID3 Algorithm with Clustering Using Car Dataset M.Jayakameswaraiah, and S.Ramakrishna Abstract - Data mining involves an integration of techniques from multiple disciplines such as database

More information

Attribute Reduction using Forward Selection and Relative Reduct Algorithm

Attribute Reduction using Forward Selection and Relative Reduct Algorithm Attribute Reduction using Forward Selection and Relative Reduct Algorithm P.Kalyani Associate Professor in Computer Science, SNR Sons College, Coimbatore, India. ABSTRACT Attribute reduction of an information

More information