Faculty of Sciences. Holger Cevallos Valdiviezo

Size: px

Start display at page:

Download "Faculty of Sciences. Holger Cevallos Valdiviezo"

Felicia Cummings
5 years ago
Views:

Faculty of Sciences Handling of missing data in the predictor variables when using Tree-based techniques for training and generating predictions Holger Cevallos Valdiviezo Master

1 Faculty of Sciences Handling of missing data in the predictor variables when using Tree-based techniques for training and generating predictions Holger Cevallos Valdiviezo Master dissertation submitted to obtain the degree of Master of Statistical Data Analysis Promoter: Prof. Dr. Stefan Van Aelst Department of Applied Mathematics and Computer Science Academic year i

2 The author and the promoter give permission to consult this master dissertation and to copy it or parts of it for personal use. Each other use falls under the restrictions of the copyright, in particular concerning the obligation to mention explicitly the source when using results of this master dissertation. Holger Cevallos Valdiviezo Stefan Van Aelst ii

3 Foreword This work introduces tree-based techniques dealing with missing data in the feature variables either thorough a previous imputation or through itself in the learning process. There is very little work made in the prediction context, which we will address in this thesis. We will investigate the prediction performance of the tree-based techniques proposed through simulation studies and real datasets. I would like to acknowledge my promoter, Prof. Dr. Stefan Van Aelst for his supervision in the completion of this work. iii

4 To God for his grace and to my parents with love and gratitude iv

5 Table of Contents Summary Introduction Objectives of the study Approach Methodology Description of the techniques to be implemented Classification and Regression Tree (CART) Technique 1: Rely on the CART learning algorithm to deal with missing values in its training phase Random Forest Rely on the Random Forest learning algorithm to deal with missing values in its training phase Technique 2: Impute missing values by median/mode Technique 3: Impute missing values in predictor data using proximity matrix Technique 4: Bagging Imputing missing data Multivariate Imputation by Chained Equations (MICE) implementation Multiple Imputation for missing data via sequential Regression Trees Technique 5: Multiple Imputation for missing data via sequential Regression Trees (MICE based on CART) and subsequent training with CART technique implementation Implementation An intuition behind the algorithm Technique 6: Multiple imputation for missing data via sequential Regression Trees: impute all missing values before training with Random Forest Implementation An intuition behind the algorithm Bootstrap methods to impute missing data before training with CART/Random Forest Implementation of the Bootstrap imputation method Technique 7: Non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution: impute all missing values before training with CART Technique 8: Non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution: impute all missing values before training with Random Forest v

6 Technique 9: Non-parametric bootstrap method to impute missing data based on random draws from the predictive distribution: impute all missing values before training with CART Technique10: Non-parametric bootstrap method to impute missing data based on random draws from the predictive distribution: impute all missing values before training with Random Forest Implementation of the techniques proposed Simulation datasets Simulated regression setting Test Set and Learning Set Missing data Objectives and methodology Results of the analysis Assessment of prediction and selection of the model Some other details Real datasets Regression setting: Abalone Dataset Test set and learning set Missing data Objective and methodology Data manipulation Results of the analysis and selection of the model Assessment of prediction and selection of the model Classification setting: Vertebral Column Dataset Data manipulation Test set and learning set Missing data Objective and methodology Results of the analysis and selection of the model Assessment of prediction and selection of the model Conclusions Discussion and Future Work Reference List vi

7 List of Tables Table 1: Summary of the techniques dealing with missing data to be applied throughout this study... 6 Table 2: Correlation structure of the simulated dataset. Output extracted from R Table 3: Structure of the generation of NMAR data for the simulated regression setting Table 4: Table illustrating the missing variables and their corresponding predictors selected for imputing with the bootstrap imputation methods Table 5: The first two techniques showing the best prediction performance in each of the scenarios analysed in this study. Next to the technique, we also present its test estimate of prediction error Table 6: Performance of tree-based techniques fitted to the complete datasets Table 7: Table showing all the scenarios to be analysed under the MCAR mechanism in our simulated dataset Table 8: Table showing all the scenarios to be analysed under the MAR mechanism in our simulated dataset Table 9: Table showing all the scenarios to be analysed under the NMAR mechanism in our simulated dataset Table 10: Table showing the estimated prediction performance in our simulated dataset of each of the techniques proposed across a number of different missing data scenarios. The hyphen sign - shown in some techniques for some scenarios indicates that no output could be obtained from the statistical software R Table 11: Brief description of the variables gathered in our Abalone Dataset Table 12: Brief description of the structure of missing data generation under the MAR mechanism for the Abalone dataset Table 13: Correlation structure of the Abalone dataset, output from R Table 14: Variance-Covariance Matrix of the Abalone dataset, output from R Table 15: Table illustrating the missing variables and their corresponding predictors selected for imputing with the bootstrap imputation methods, for the Abalone dataset Table 16: First two techniques showing the best prediction performance in each of the scenarios analysed in the Abalone dataset. Next to the technique, we also present its test estimate of prediction error vii

8 Table 17: Performance of tree-based techniques fitted to the complete Abalone dataset Table 18: Table showing all the scenarios to be analysed under the MCAR mechanism in the Abalone dataset Table 19: Table showing all the scenarios to be analysed under the MAR mechanism in the Abalone dataset Table 20: Table showing all the scenarios to be analysed under the stochastically right censored NMAR mechanism in the Abalone dataset Table 21: Table showing all the scenarios to be analysed under the stochastically mixed censored NMAR mechanism in the Abalone dataset Table 22: Table showing the estimated prediction performance in our simulated dataset of each of the techniques proposed across a number of different missing data scenarios. The hyphen sign - shown in some techniques for some scenarios indicates that no output could be obtained from the statistical software R Table 23: Correlation structure of the Vertebral Column dataset, output from R Table 24: Variance Covariance Matrix of the Vertebral Column dataset, output from R Table 25: Table illustrating the missing variables and their corresponding predictors selected for imputing with the bootstrap imputation methods, Vertebral Column dataset Table 26: First two techniques showing the best prediction performance in each of the scenarios analysed in the Vertebral Column dataset. Next to the technique, we also present its misclassification estimate based on a test set Table 27: Performance of tree-based techniques fitted to the complete Vertebral Column dataset Table 28: Table showing all the scenarios to be analysed under the MCAR mechanism in the Vertebral Column dataset Table 29: Table showing all the scenarios to be analysed under the MAR mechanism in the Vertebral Column dataset Table 30: Table showing all the scenarios to be analysed under the NMAR mechanism in the Vertebral Column dataset Table 31: Table showing the estimated misclassification rate based on a test set of each of the techniques proposed across a number of different missing data scenarios, Vertebral Column dataset viii

9 Summary Having missing data is a common issue in real datasets. In the prediction context, handling this problem incorrectly can lead to problems such as biased error estimates and bad prediction results. In our study, we want to focus on missing data present in the feature variables. To tackle this problem, we compare ten techniques that deal with missing values, either by themselves or through the implementation of an imputation method, and which eventually use Classification and Regression trees (CART) or Random Forest (RF) to generate predictions. The techniques in question are: CART (surrogate splits), Random Forest imputing Missing Values by median/mode, Random Forest imputing missing values data using proximity matrix, Bagging (surrogate splits), Multiple Imputation for missing data via sequential Regression Trees using CART for prediction, Multiple Imputation for missing data via sequential Regression Trees using RF for prediction, bootstrap method to impute missing data based on the mean of the predicted distribution using CART for prediction, bootstrap method to impute missing data based on the mean of the predicted distribution using RF for prediction, bootstrap method to impute missing data based on random draws from the predicted distribution using CART for prediction and bootstrap method to impute missing data based on random draws from the predicted distribution using RF for prediction. In reality, we don t know how the missing data was generated, that is, the mechanism which was used to generate the missing values. Thus, we look for technique(s) showing good prediction results in different scenarios of missingness. For those selected techniques, we also wanted to explore how the imputations performed as compared to its equivalent tree-based technique fitted to the full data. In particular, we fitted each of the ten techniques in scenarios formed by all possible combinations of three different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR); and three different percentages of missing fields: 5%, 10% and 25%. We implemented these techniques in a simulation dataset (continuous response), a continuous response real dataset and a class response real dataset. We evaluated the performance of the techniques in terms of prediction error by using a test set, larger than the learning set, where we computed the test error estimate of the prediction error. After the implementation, the technique Multiple Imputation of missing data via sequential Regression Trees fitting RF for predicting was shown to have the best prediction performance 1

10 in almost all the scenarios of the regression settings, and a good prediction performance in the classification setting. The imputations made by this technique were shown to be fairly good, and the test error estimates of the prediction error were shown to be very stable across different percentages of missing data and different missingness mechanisms. In second place, the technique Bootstrap method to impute missing data based on the mean of the predictive distribution using Random Forest for prediction was shown to have very good prediction results in the classification dataset and good prediction results for the regression datasets for almost all the scenarios, namely, the ones showing a low amount of missing values. These first two techniques were very computationally intensive. In third place, Random Forest imputing missing values in the predictor data using the proximity matrix emerged as an alternative for being a fast computational technique and showing good prediction results in the regression and classification real datasets in some of the scenarios of missing data considered. However, there is no guarantee it will perform equally well in different data settings. In general, the CART and RF Naïve techniques showed an acceptable prediction performance when having a low amount of missing data in some of the datasets analysed, but for medium to large amounts of missing data, the predictions became very bad. This suggests that we should address the missing values issue in an appropriate way, and that skipping this step might be too detrimental. 2

11 1 Introduction Having observations with missing values for one or more feature variables constitutes a major problem before implementing a prediction technique, or before starting to analyse the dataset using appropriate statistical tools. As statisticians, we always aim to obtain valid inference results from a population. However, besides the uncertainty produced in the sampling process, we are now confronted with the uncertainty of those missing fields, with unknown observed values. This definitely makes a statistical analysis harder; the variability around an estimator/prediction is higher and only more imprecise information can be obtained from the sample. In this project, we are primarily interested in assessing the prediction power of ten techniques that will be applied to datasets with missing values. Each of them will have its own strategy for dealing with missing data and generating predictions. For the latter, we will use Classification and Regression Trees (CART) and Random Forest (RF) in this study. Furthermore, we will focus on missing data in the input features. Clearly, the prediction capability will be reduced in the presence of missing data due to the uncertainty that will be added to the predictions (the prediction variability will be increased). Thus, we still have to find a way of dealing with those missing observations before the training phase (or perhaps during the training phase), in order to be able to generate predictions from the sample. The first three techniques are CART, Bagging and Random Forest, which are not only used for prediction in classification and regression settings, but also as learning algorithms that by themselves deal with missing values in the input features. In other words, they deal with missing values directly in the learning process of the training set. On the other hand, we also want to investigate whether we can further improve the prediction capability of the tree-based techniques in the presence of missing data by filling in the missing values first through an imputation technique, and afterwards fitting CART or Random Forest to train the dataset and obtain predictions. In particular, the fill-in step in the proposed techniques will be based on the following three imputation methods: (1) Multiple Imputation for missing data via sequential Regression Trees (Burgette and Reiter, 2010), (2) non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution (He, 2006), and (3) non-parametric bootstrapping to impute missing data based on random draws from the predictive distribution. Multiple Imputation for missing 3

12 data via sequential Regression Trees is a novel imputation technique that uses CART in the imputation process to approximate the conditional distribution of any missing variable from multiple predictors. The non-parametric bootstrap method to impute missing data is implemented after initially imputing the missing values by means of unconditional mean imputation on each bootstrap sample. Next, an update of the imputations takes place based either on the mean of the predictive distribution or based on random draws from the predictive distribution of the missing variables. This project uses the latter if we want to smooth the possibly not true rigid linear relationship between the missing target variable and its predictors. After imputations, CART/Random Forest will be applied on each imputed bootstrap sample, and predictions are produced by averaging (regression) the trees/forests or by using majority vote (classification) among the trees/forests (like in Bagging), with the hope of averaging out the uncertainty around the sample and around the imputation model. The literature (He, 2006) states that some of the important advantages of the non-parametric bootstrap method are that it does not depend on any missing data mechanism, which is considered a disadvantage for multiple imputation methods, and that it requires no knowledge of either the probability distribution or model structure of the full data (observed and missing values). In this study we will explore all these properties. Likewise, we want to see the predictive performance of two Naïve approaches: CART fitted only on complete records and Random Forest fitted only on complete records. It is generally known that these naïve techniques can be used when having a low amount of missing data, and that otherwise they should be avoided. Thus, the need of treating the missing values emerges in order to try to still obtain good prediction results when less information is available. We divide this task into 9 sections. We describe intuitively and with some hints of theory each of the techniques proposed throughout Section 5. Then in Section 6 we fit each of them into a simulated regression setting, a real regression dataset and a real classification dataset. At the end of each implementation we will analyse the performance of each technique. In practice, we don t know which missing data situation we are presented with when we have a dataset at hand with missing values. Thus, from our analysis, we want to indentify the technique that best performs in most of the missing data scenarios considered, in terms of prediction and imputation quality. 4

13 2 Objectives of the study Assess the prediction capability in regression and classification settings (either simulated or real datasets) of ten techniques dealing with missing data (some of them by themselves and others through a previous imputation procedure), each using a tree-based model in the training process for generating predictions (either CART or RF), under all possible combinations of three different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR); and different percentages of missing fields: 5%, 10% and 25%. Evaluate the imputation quality of the winner techniques under the combination of the same scenarios described above, using real and simulated datasets. Thus, from our analysis we hope to come up with a technique that best performs in most of the scenarios considered, in terms of prediction and imputation quality We know from previous studies concerning missing data that the missing mechanism plays an important role in the validity of imputations, and thus in the prediction that can be obtained from the (imputed) sample. Thus, this thesis will pay special attention to the effect of each missing mechanism, in both the imputation/surrogates splits quality and the prediction quality of each of the techniques to be implemented. Finally, we want to assess the most naïve technique: just discard the missing observations and train on the complete cases data using CART/Random Forest. This approach is used most of the time, and if we are not aware of the consequences of the mechanism that generated the missingness in the inferences/predictions, we may produce wrong results most of the time. However, it is generally known it is argued that this approach can still be used if the relative amount of missing data is relatively small. We will see how it performs in our simulated datasets and in the regression and classification real dataset to be analysed. 3 Approach In this study, we will generate artificially missing values in the covariates using different mechanisms (MCAR, MAR, NMAR). Then, in order to handle the missing data and generate predictions using tree-based techniques we will use the different strategies summarized in Table 1: 5

14 Table 1: Summary of the techniques dealing with missing data to be applied throughout this study Strategy for dealing with missing data Training technique Approach 1 CART (surrogate splits) CART Rely on the learning algorithm to deal with missing values in its training phase 2 Random Forest: Impute Missing Values by median/mode (option na.roughfix in library randomforest in R) 3 Random Forest: Impute missing values data using proximity matrix (rfimpute function in library randomforest in R) Random Forest Random Forest Rely on the learning algorithm to deal with missing values in its training phase Rely on the learning algorithm to deal with missing values in its training phase 4 Bagging (surrogate splits) Bagging Rely on the learning algorithm to deal with missing values in its training phase 5 Multiple Imputation for missing data via sequential Regression Trees CART Impute all missing values before training 6 Multiple Imputation for missing data via sequential Regression Trees Random Forest Impute all missing values before training 7 Non-parametric bootstrap method to impute missing data based on the mean of the predicted distribution 8 Non-parametric bootstrap method to impute missing data based on the mean of the predicted distribution 9 Non-parametric bootstrap method to impute missing data based on random draws from the predicted distribution 10 Non-parametric bootstrap method to impute missing data based on random draws from the predictive distribution) CART Random Forest CART Random Forest Impute all missing values before training Impute all missing values before training Impute all missing values before training Impute all missing values before training with Random Forest 11 None CART Discard all observations with any missing values and then apply CART on this new dataset 12 None Random Forest Discard all observations with any missing values and then apply Random Forest on this new dataset 4 Methodology In order to evaluate the prediction performance of each technique in each scenario, we will use the Mean Square of Prediction Error (MSPE) in regression settings and the misclassification rate estimates based on a test set. The loss of prediction capability due to missing data is caused by an interaction of two factors: 1. The tree-based prediction model fit on an incomplete dataset, on multiple imputed data sets or on imputed bootstrap samples will produce less accurate predictions than if we were to do the fit using the complete data, and 6

15 2. It will be more difficult to predict future cases having missing values or having their missing fields already imputed. In this study, we will only measure the first separately by generating missing data using a missing scenario only on the learning set, and leaving the test set complete. Thus, we want to measure the loss of prediction capability (accuracy) caused by constructing inaccurate tree(s)/forest(s) on incomplete/imputed datasets. In order to assess the loss in accuracy of prediction caused by the second factor separately, we will have to create missing data only on the test set. Then, if the technique requires an imputation step, we may impute the test set before dropping its cases down the corresponding tree-based technique fitted on the complete dataset, and generate a prediction for each of its cases. Or if the technique does not require an imputation like in CART and Bagging (they use surrogates), we may drop each of the cases of the incomplete test set down either the tree built on CART or the set of trees built by Bagging in order to generate a prediction for each case of the test set. For techniques like CART or Bagging, which don t need a previous imputation step, this is a more genuine assessment. However, for techniques where we will first have to impute the missing values of the test set fields, we will lose the complete structure of the techniques, since we will do imputations on one set (test set) and the fitting with the other set (learning set). Since we want to show the complete sequence of steps made when fitting each technique proposed in this study, we will limit our discussion in this study on generating missing data exclusively on the learning set. We could also delete values in both the learning set and the test set. Then we may want to fit the techniques on the learning set with missing data, and drop/or first impute before dropping the cases of the test set down the fit of the corresponding technique. After that, we could compare the error estimate for the case where the test set was missing and the test error estimate of the situation where the test set was complete. This allows us to measure the loss in accuracy of the test error estimates of the prediction when having missing data in the test set. Of course, this is going one step further. Thus, in this study we will just limit our analysis to having missing data in the learning set only. Thus, the test set error estimate computed for each of the scenarios will reflect the loss of prediction capability (accuracy) caused by constructing inaccurate tree(s)/forest(s) in the presence of missing data under that scenario. We will do this in simulated and real datasets. 7

16 In order to evaluate the quality of the fill-in procedure within the technique or the quality of the surrogate splits in CART and Bagging, we will compare the test error estimate of the whole technique fitted in the presence of missing data and the performance of an equivalent technique fitted on the complete dataset (full data analysis). By doing that, we will measure the impact of the presence of missing data on the test error of the equivalent tree-based method fitted on a complete dataset. We will do the latter for different missingness scenarios, and using real and simulated datasets. If the imputation/surrogate is of good quality, we may expect this test error not to increase much. We will also consider variants in the imputation step of some techniques that will be detailed when describing each procedure. In order to evaluate the quality of the fill-in procedure within the technique or the quality of the surrogate splits in CART and Bagging, we will compare the test error estimate of the whole technique fitted in the presence of missing data and the performance of an equivalent technique fitted on a complete dataset (full data analysis). We will then measure the impact of the presence of missing data on the test error of the equivalent tree-based method for the scenarios proposed in this study, using real and simulated datasets. If the imputation/surrogate is of good quality, we may expect this test error not to increase much. We will also consider variants in the imputation step of some techniques that will be detailed when describing each procedure. In addition, a table summarizing each technique with its equivalent tree-based technique will be provided in the following sections. We will not be able to assess the imputation quality of all the techniques proposed, since for some of them we could not find an equivalent tree-based method. 5 Description of the techniques to be implemented 5.1 Classification and Regression Tree (CART) This method is mostly known for its flexibility and ability to capture complex interaction structures in the data. Basically it consists of partitions of the predictor space by a sequence of binary splits into terminal nodes, so that at each partition the measure of impurity is minimized. The process of subsequently making partitions of the predictor space can be well illustrated by a tree representation, in which the division of nodes into child nodes represents the partition (each node showing the splitting criterion) and the division of nodes into terminal 8

17 nodes or leaves at the bottom represents the final regions in which the feature space is divided (the final fits are given at the terminal nodes). To perform the best split at each node, one has to look at the split variable and the split point that gives us the most homogeneous child nodes possible. Thus, CART is formed by iteratively splitting nodes so as to maximize the decrease in impurity node. A constant is fit at each node, e.g. in classification settings the majority class in each region and in regression settings the average of the outcomes corresponding to the cases at that region. From this, we can compute the resubstitution impurity measure within each node (misclassification rate or sum of squared error) used to make the splitting decision (there may be other measures like the classification Gini index or the classification cross-entropy to do this). In order to generate predictions, we look at the terminal nodes. In classification settings the (test) observations at node t are assigned to the majority class in leaf m (m = 1,..., M), whereas in regression setting the prediction at each terminal node m is the average of the outcome for the observations within that terminal node. Therefore, the constant fitted at the regions is the conditional expectation of the response, given the corresponding predictor measurements, and the outcome values in each terminal node represent the conditional distribution of the outcome for units in the data with predictors that satisfy the portioning criteria that define that leaf. The latter definition will be important in MICE based on trees (see Section 5.4.2) in which the imputations will be drawn from the approximate conditional distribution of the missing target variable given multiple predictors. The question then arises of how large the tree should be in order to generate good predictions. If we were to build very large trees, then we will certainly overfit the training data, imposing low bias but high variability. On the other hand, if we were to construct very small trees, then we will certainly impose a large bias, although less variability. Thus, a good compromise between bias and variability has to be found (bias-variance trade-off). Normally, we grow a very large tree (usually to have a maximum of five cases at each terminal node in regression settings or until reaching the condition of pure node in classification settings). Then, we may need to prune the tree through weakest link pruning, by successively collapsing the internal node that produces the smallest per-node increase in the cost complexity criterion, and continue until we produce the single-node tree. We define the cost complexity criterion as: C α( ) = 9

18 where α 0 is a tuning parameter governing the trade-off between tree size and its goodness of fit to the data, T the number of terminal nodes in subtree T and is the impurity measure. For each α, it can be shown that there is a unique smallest subtree that minimizes the cost complexity criterion. We choose a value of α so as to minimize the cross-validation estimate of the error. Although we aim to construct more stable trees by pruning, the CART procedure tends to be very unstable. Very often, small changes in the data can lead to very different series of splits. The reason for this is the hierarchical nature of the procedure; an error in a split at the top of the tree can be propagated down to all the splits below, eventually generating very different fits and predictions Technique 1: Rely on the CART learning algorithm to deal with missing values in its training phase One of the advantages of CART is the ability to deal with missing data in its training phase, by allowing a case with missing data to be sent down the tree through surrogate splits. Hence, in the search for the best split at each node, we consider all the predictors x p (d = 1,..., p), and for each of them we use only cases having a value for x d. If a case has the value of the variable selected for the primary split as missing (so that the best split is not defined for that case), then among all nonmissing variables in the case, we can find the predictor and corresponding split point that best mimics the split of the training data achieved by the primary (best) split, that is, the split having the highest measure of predictive association λ (λ >>> 0). This measure is the relative reduction obtained by using the surrogate split to predict the primary split as compared with the most simple prediction rule of the best split: max(ul; ur), which says, simply send the case to the child with the largest relative frequency at that node (Breiman et. al., 1984). Thus, if an observation is missing all the surrogate splits, then the rule max(ul; ur) is used in this study. In R, we use the rpart function in the library rpart for performing the CART algorithm. It includes features like surrogate splits and the use of the max(ul; ur) rule (option usesurrogate = 2). The option surrogatestyle was set to 0 (default), so that the program uses the total number of correct cases classified to the child nodes as the criterion for choosing a surrogate variable. Therefore, we will severely penalize covariates with a large number of missing values. 10

19 5.2 Random Forest Random Forest consists of growing an ensemble (a forest) of trees on bootstrap samples, where each tree is built using a random feature selection. Hence, when growing a tree on a bootstrapped dataset, before each split, we select g p of the input variables at random as candidates for splitting. This technique was introduced with the idea of reducing the instability produced when building a single tree, by averaging out the variability accounted in trees built in different sets of data sampled from an approximate distribution. It is also meant to improve the variance reduction achieved in Bagging by building de-correlated trees in the forest. If the subset of random features candidates for splitting is sufficiently smaller than the whole set, it can be shown that overfitting of the training data is prevented, since for each of the splits we are not providing the algorithm with all the available information, but only a random part of it. However, this is achieved at the price of introducing more bias to the predictions. Thus, the need arises of finding the most appropriate number of predictors for the random feature selection, to make a trade-off between bias and variance Rely on the Random Forest learning algorithm to deal with missing values in its training phase Random Forest also allows us to handle missing data in its learning phase. The function for the algorithm that is available in R is randomforest in the randomforest library, which provides two ways of dealing with missing data: Technique 2: Impute missing values by median/mode This is a very simple way of handling missing data. For numerical variables NAs are simply replaced with column medians, for factor variables NAs are replaced with the most frequent levels (breaking ties at random) and if a data matrix contains no NAs, it is returned unaltered. This is performed in R by the function na.roughfix in the randomforest library. Once the data is filled-in, we implement Random Forest and generate predictions. Although this can be seen as a pure imputation procedure, we will consider it as part of the Random Forest algorithm, since the na.roughfix function is within the library implementing the Random Forest algorithm. 11

20 5.2.3 Technique 3: Impute missing values in predictor data using proximity matrix A more sophisticated way of handling missing data in the randomforest library in R is by making use of the freely available proximity matrix. In growing the forest, an n x n proximity matrix is created for the training data (where n is the number of cases in the training data). Each time any pair of observations shares a terminal node in the forest, their proximity increases by one. This is an intrinsic proximity measure, inherent in the data and in the Random Forest algorithm. Thus, each field in the proximity matrix shows the proportion of trees over which any pair of observations fall in the same terminal node. The algorithm starts by first imputing NAs using the na.roughfix function. Then, the Random Forest algorithm is implemented on the initially imputed dataset. Next, the proximity matrix from the Random Forest algorithm is used to update the imputation of the NAs. For continuous predictors, the imputed value is the weighted average of the non-missing observations, where the weights are given by the proximities. For categorical predictors, the imputed value is the category with the largest average proximity. This process is iterated 5 times (default in the function). So, cases that are more like the cases with the missing data are given greater weight. We will also use the iter = 10 and ntree = 500 options values in R below when implementing the techniques (Section 6) The option ntree refers to the number of trees used for constructing the proximity matrix. Although we find these learning methods that deal with missing data on their training phase interesting since there is no need of an imputation before training the data, it is argued in the literature that they lack a sound theoretical rationale (He, 2006) We will thus evaluate their prediction capability through simulations and real datasets in later sections. 5.3 Technique 4: Bagging Bagging is a tree-based technique for reducing the variance of the estimated prediction function. It generates bootstrap samples and fits a tree (CART) on each of them. Predictions for each (new) case are obtained after dropping that case down each of the trees, obtaining a fit for each tree, and then averaging all the fits. For classification, each tree contributes with a vote for the predicted class. It was developed, as an alternative to CART, to reduce the instability that the latter produces when it is fitted on slightly different datasets. Bagging is also a technique that deals by itself with missing data. In R, its function bagging in the library ipred performs the same task as the rpart function. It uses surrogate splits for 12

21 classifying an observation having as missing the variable of the primary split, in each of the many trees fitted on bootstrap samples. 5.4 Imputing missing data Imputations are means or draws from a predictive distribution of the missing values, and require a method of creating a predictive distribution for the imputation based on the observed data (Little and Rubin, 2002). In practical applications, the major dilemma resides precisely on the derivation of an appropriate predictive distribution. To generate distributions, there are two generic approaches: explicit modelling and implicit modelling. Explicit modelling bases the predictive distribution on a formal statistical model (e.g. multivariate normal), whereas implicit modelling focuses on an algorithm providing an underlying predictive model. One can impute missing data only once (single imputation) or several times (multiple imputations). Single imputation can thus be the mean of the estimated predictive distribution or simply a draw from it. An important drawback of single imputation methods is that the standard variance formulas of a statistic applied to the filled-in data systematically underestimates the uncertainty around the statistic, even if the model used to generate imputations is correct. Thus, after imputation, the filled-in data are actually treated as if it were observed (fixed), ignoring the variability due to the imputation, which itself introduces additional uncertainty into estimates and predictions from the response model. Hence, the need arises to take into account this additional uncertainty, and this is done precisely by performing multiple imputations (creating many different training sets differing only on the imputed fields). However, for this to hold, the missing data had to be generated through the missing at random mechanism (MAR), the model used to generate the imputed values must be correct in some sense, and the model used for the analysis must be in accordance in some sense with the model used in the imputation (Rubin 1987, 1996), conditions that are usually very difficult to meet and assess. The fact that most multiple imputation procedures rely on the MAR mechanism implies a great limitation, since in reality this condition is violated in most cases. In practice, data are missing for reasons beyond the control of the researchers, and thus, one can never be sure about which missingness mechanism generated the missing data. In fact, to speak of a single missingness mechanism (as indicated by this assumption) can be misleading, because in 13

22 most of the studies the missingness happens for a variety of reasons; some of them may even be entirely unrelated to the data in hand. For generating imputations for the missing fields, a probability model of the full data must be imposed. In most of the traditional multiple imputation methods, it is assumed this joint distribution to have a parametric form (e.g. multivariate normal), but in reality data rarely conform to this. As mentioned above, the imputation model should be rich enough to preserve the relations in the data, so that the model for the post-imputation analysis yields good prediction results. However, as we will see in Section 5.4.2, this is sometimes difficult to achieve, since there may be complex and even interactive relations in the data, which traditional multiple imputation procedures will not be able to handle. The literature states that MICE using CART, which is a non-parametric approach for performing multiple imputations, was formulated to address this problem (Burgette and Reiter, 2010). Another difficulty that arises for imputation models is when the study involves a large number of variables, especially when there are a large number of categorical variables. Then, the problems of curse of dimensionality and sparse cells can easily occur when performing the conditional regressions (He, 2006), and certainly imprecise imputations can be obtained (there is no area where we can fit well in high dimensional predictor space). We address this problem in the bootstrap imputation methods by using the best subset selection procedure for continuous missing target variables and stepwise model selection by exact AIC criterion for factor variables to somehow reduce the predictor space, and especially focus on variables that most help in predicting the missing target variable. Some other problems that may crop up while imputing missing data are: Many predictors missing, leading to a sequence of imputation problems Variables can be of different types (e.g. categorical, binary, nominal, ordered, continuous). Consequently, one should perform the conditional regressions under an appropriate theoretical model, e.g. Logistic regression if target missing variable is categorical, Gaussian regression or a non-parametric regression if target variable is continuous, Poisson regression if target variable is a count. The latter will also depend on which imputation technique is being used. Factor variables defined with many measurement levels (>>> 2) 14

23 Circular dependence can occur between predictors, for instance when imputing X 1 given X 2, and imputing X 1 given X 2 (e.g. correlated variables), even given other variables With large p (number of predictors) and small n (sample size) especially, collinearity and empty cells may occur The imputations could themselves produce impossible combinations The order in which variables are imputed may be of particular importance In our study, some techniques will focus on the imputation of multivariate data using novel and reformulated single and multiple imputation procedures, with the aim of completing the datasets before fitting a tree-based prediction technique on it, for classification and regression settings. In particular, we will use three imputation procedures: Multiple Imputation for missing data via sequential Regression Trees, the non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution and the non-parametric bootstrap method to impute missing data based on random draws from the predictive distribution. One can visualize in Table 1 (techniques 5 10), in which of the techniques proposed each of them are implemented. Although the bootstrap imputation method produces multiple imputed bootstrap samples, the imputation algorithm used there actually performs a single imputation in each of the bootstrap samples Multivariate Imputation by Chained Equations (MICE) implementation To understand how the MICE based on CART imputation procedure works (see later Section 5.4.2), we first present how the traditional MICE algorithm works. Suppose that we have a n p+1 multivariate random data matrix M, whose first column contains the fully observed outcomes, columns from d = 2,, p+1 represent predictors, and whose rows represent observations or subjects. Further suppose that from d = 2 on we arrange M such that at the left side we encounter the columns corresponding to feature variables with missing data (partially observed variables), e.g. the Xi block, and at the right side the columns corresponding to feature variables that are completely observed, e.g. the Xc block: M = (Y, Xi, Xc), where d, i = 2,. r and d, c = r+1, p+1. Additionally, suppose the columns of Xi are arranged such that, from left to right, the input features have an increasing number of missing fields. Then, the process of imputing missing values is done in a 4-step strategy: 1. Going from left to right, we look at initially filling values for the variables (columns) of the Xi block of matrix M. We begin the process by e.g. filling in initial values for the 15

24 missing fields in X 2 with draws from the predictive distribution of X 2 conditional on Y, Xc. Next, we fill in initial values for the missing fields in X 3 with draws from the conditional distribution given Y, Xc and the completed version of X 2. We keep doing this process until X i=r, where we fill in initial values for missing fields with draws from the conditional distribution given Y, Xc and all initially imputed variables X i = 2, 3, r-1 before X i=r 2. For d, i = 2, r, replace the originally missing values of X i with draws from the predictive distribution given M -i, the matrix M without its column i after step 1. In other words, we now take draws from the predictive distribution of X i (a variable that originally belonged to the partially observed variables block Xi) conditional on all the originally completed variables and on the initially imputed variables on step 1 (except for i of course). We update the initially imputed values for all variables in block Xi, one by one, by repeating this process. 3. Repeat step 2 for a number of w iterations. 4. Repeat steps 1 3 m times, so that we produce m imputed data sets. This process makes use of a Gibbs sampler, which aims to bring convergence in the distribution. We remark that the ordering of the predictor variables in the matrix M (according to an increasing number of missing values) is only for having as much information as possible when building the imputation model for variables with a relatively larger number of missing fields. It can be shown that this algorithm reaches satisfactory convergence at w = 10 (Burgette and Reiter, 2010). The literature about this states that MICE creates statistically independent imputations, and thus, no iterations need to be wasted for reaching independence between draws (as is the case in Markov Chain Methods), Groothuis-Oudshoorn and Van Buuren, As can be seen, in the standard MICE implementation the imputer specifies a set of conditional distributions for the missing data modelled through conditional regressions. In this way the multivariate problem is split into a series of univariate problems (that is where the chain equations come up). Thus, there is no need to specify a multivariate distribution for the entire dataset (observed and partially observed variables) as in some traditional multiple imputation techniques under some conditions. We assume that the multivariate distribution exists, and that the process of sampling from this distribution can be attained by iteratively sampling from the conditional models. It can be shown that the MICE algorithm produces the 16

25 posterior distribution of the vector of parameters θ by iteratively sampling from the conditional models (Groothuis-Oudshoorn and Van Buuren 2011). These are the ones that completely specify the joint distribution of the data. The algorithm seems to be very convenient since it is clearly easier to specify conditional distributions than a hypothetical existing joint distribution of all the data. Multivariate imputation models may lack the flexibility to address specific features of the data, which can be argued as an argument in favour of MICE (Schafer, 1997). However, this may push us to specify a set of conditional distributions for which no known joint distribution exists. When the specification of conditional distributions is incompatible, and thus there is no distribution to converge to, the algorithm will alternate between isolated conditional distributions. In conditional linear regression, for instance, this takes place exceptionally. The corresponding joint distribution, given some specific regularity conditions, would be multivariate normal. In the literature, it is also argued that the order in which variables are imputed should be sensible to avoid potential incompatibility of conditional distributions (actually the mice package provides an application for this Groothuis- Oudshoorn and Van Buuren 2011). Despite all these potential drawbacks, MICE is widely used due to its easy implementation and flexibility Multiple Imputation for missing data via sequential Regression Trees This technique belongs to the family of imputation procedures, although we want to produce predictions afterwards by fitting CART on each imputed dataset. In particular, this technique takes us to the family of multiple imputation methods. This technique is also a reformulated version of the previously released procedure Multiple Imputation by Chained Equations (MICE), implemented as mentioned before, by specifying sequential conditional regression models for all variables with missing data. However, with a large number of variables, specifying those models is not easy since there may be interactions and nonlinear relations among the variables (complex models), and so, MICE would not guarantee any success. That is why a non-parametric approach of MICE using CART was developed, motivated by all these difficulties. CART seeks to approximate the conditional distribution of the missing values from multiple predictors, and has several other interesting features that suggest using it as an imputation engine in MICE. As mentioned before, CART is flexible enough to capture 17

26 interactions, non-linear relations, and complex distributions without any parametric assumption or data transformation. As shown in the literature, MICE using CART gives more reliable inferences than the traditional MICE (Burgette and Reiter, 2010). Since CART provides the conditional distribution for missing cases given various value combinations of the corresponding predictors, it can effectively result in models with many interaction effects. Hence, if the real response analyses to be performed on the imputed datasets involve flexible and interactive relations, as it is often the case (it rarely involves a too rigid and strictly linear relationship), the resulting imputation model may have some accordance with the real response analysis. The latter is a desirable property of the imputation techniques, as mentioned before. In addition, the fact that CART is used as the imputation engine makes the algorithm a non-parametric one. It can be shown that if the parametric assumption is adequate, the algorithm may experience a decrease in efficiency relative to parametric models. Basically, the idea of the MICE algorithm is now translated to a scenario where CART is used to approximate the conditional distribution of the missing variable given multiple predictors. In particular, the values of the missing target variable (outcome) at each terminal node (leaf) of the tree represents the conditional distribution of the missing target variable for a particular case with predictor values that satisfy the partitioning criteria defining that leaf. Thus, to implement MICE based on CART, we use CART in steps 1 4 of the MICE implementation (Section 5.4.1) instead of the conditional regressions. For instance in step 1, where we filled in initial values for the missing fields, for let s say x 3 (a missing realization), we sample elements from the leaf that corresponds to the values of y, xc (row vector) and (the initially imputed) variable x 2 of the case of interest. Additionally, in order to reflect the uncertainty about the population distributions in the leaves, MICE based on CART performs a Bayesian bootstrap within each of the leaves before sampling. As we aim to make imputations with CART, the objective of looking for a good size of trees for getting good interpretations and predictions can be put aside. Rather, we may want to grow large trees to impute, so as to minimize the bias. One difficulty for CART, when using it as an imputation engine, is the presence of categorical predictors with many levels, as CART scans through all possible partitions of input variables 18

27 when selecting primary splits. A categorical variable with a large number of levels can result in a tree with an excessive number of potential partitions Technique 5: Multiple Imputation for missing data via sequential Regression Trees (MICE based on CART) and subsequent training with CART technique implementation Implementation Once we have done the imputations, we fit CART on each of the m imputed datasets. Therefore, each tree yields a prediction for each (new) observation and by using majority voting among the m trees, in classification settings, or by taking the average fit among the m trees, in regression settings, we obtain the overall prediction for each (new) observation. For assessing the prediction capability of the technique using a test set (as we will do in this study), we simply drop the test set cases in each of the trees grown in each imputed dataset. Then by majority voting among the m trees (classification case) or by taking average fit among the m trees (regression case), we get the overall prediction for each of the test set cases. Next, the predicted class/value is compared to its real class/numerical value for each of the units in the test set for the computation of the overall test error estimate of the prediction error An intuition behind the algorithm It is generally well known that if we do single imputation, then the sampling variability under the imputation model for non-response is not taken into account and thus, only underestimated estimates of the variance statistics can be obtained. Similarly, one can state in the prediction context that if we do a single imputation, the estimate of the prediction error will be underestimated due to the variance under the imputation model that is not being taken into account, and hence, only optimistic prediction error measures can be obtained. MICE based on CART takes this into account exactly. We model a predictive distribution for the missing target variable given multiple predictors, from which we will randomly draw values to impute the missing data several times. Thus, this eventually yields m imputed datasets, which preserves the original non-missing values and only differs in the fields where the imputations were made. By fitting CART in these imputed datasets, we would be taking into account the variability coming from the imputation model, and thus we would gain stability by averaging out the fitted trees. However, in doing this, we will still be fitting a very instable method (CART), whose fit may greatly vary when implemented on a slightly different 19

28 dataset. In this case, the m datasets differ from one another due to the different draws for imputation, but not due to variability of datasets that results when sampling from the (approximate) population distribution (bootstrap samples recreate this process). In other words, the variability accounted in trees built in different datasets sampled from the (approximate) distribution is not being averaged out, and so, we may expect that fitting CART on top of this imputation technique would still return a very unstable prediction just like when fitting the simple and single CART. Generally speaking, multiple imputation procedures make the assumption that the mechanism generating the missing data is missing at random (MAR). In the literature it is stated that MICE can handle both MAR and NMAR mechanisms through its implementation in R (mice package), although under NMAR additional modelling assumptions that influence the generated imputations may be required (Groothuis-Oudshoorn and Van Buuren, 2011). The CART-based MICE imputation technique, however, has not been explored yet with regard to the influence of any of the missingness mechanisms in post-imputation analysis, as is predicted in this study. In addition, the MICE based on CART imputation technique was not initially proposed for interpretations or for making inferences, but to provide sensible imputations and preserve complexity in the data. However, we want to build a tree-based prediction model out of the imputations made by the CART-based MICE algorithm, a procedure that will be used in this case as a building block to make predictions Technique 6: Multiple imputation for missing data via sequential Regression Trees: impute all missing values before training with Random Forest Implementation In this technique we again make use of MICE based in CART to impute missing data. Now, once we have imputed the dataset several times (say m times), we can fit the Random Forest algorithm on each of the imputed datasets. This will eventually yield m forests fitted. This can be claimed to be a very computationally intensive procedure, since a whole forest will be grown on each imputed dataset, which implies also producing a number of bootstrap samples for each imputed dataset. However, this could be of help for reducing the instability that remained after implementing CART on the imputed datasets. 20

29 An intuition behind the algorithm We argued previously that when fitting the CART-based MICE using CART for generating predictions, its prediction may remain unstable since the variability due to the noisy fit of CART was not averaged out. Now, we will take random samples from the approximate distribution on top of the imputation made. That means, once the imputed datasets are available, we will take from each of them a random number of samples with replacement (as for simulating the sampling process from the distribution of each imputed dataset - bootstrapping), eventually forming a number of bootstrap samples for each imputed dataset. Then, we fit CART using random feature selection in each of the bootstrap samples of the s th imputed dataset (s = 1,, m), resulting at the end a forest for each of the m imputed datasets. When averaging the m forests, we average out not only the variability present in the trees built in each bootstrap sample drawn from the (approximate) distribution of the s th imputed dataset (s = 1,, m), but also the variability accounted in fitting a random forest in each of the m different imputed datasets. We are actually fitting a forest in a large number of samples (number of bootstrap samples taken from each imputed dataset * m) drawn from the estimated joint distribution of the data, in which we are considering the variability due to the imputation model (between variability) and the variability due to the sampling from the distribution of the s th imputed data (intra-variability). In other words, there is variability on top of variability that is averaged out. Now, we are averaging out these two sources of variability, and we intuitively expect this technique to give good prediction results Bootstrap methods to impute missing data before training with CART/Random Forest This procedure was introduced by He (2006). It is an ensemble method, in the sense that it is an ensemble of CART or Random Forest models. He (2006) argues that the imputation method involved in these techniques does not depend on any missing mechanism, in contrast to other imputation procedures which do. In addition, it was stated that no distributional assumptions of the data have to be made in order to impute the missing data. The first statement will however be checked in this study in the context of predictions, when performing the non-parametric bootstrap imputation method using the mean/random draw from the predictive distribution and fitting CART/RF to predict on simulated and real 21

30 datasets. In total there will be 4 techniques analysed (7-10 in Table 1) that base their predictions on a previous bootstrap imputation step Implementation of the Bootstrap imputation method 1. Draw B bootstrap samples from the original incomplete sample 2. For each bootstrap sample, b = 1, 2, B, impute missing values as follows: 2.i. Replace missing values with the mean (if the predictor is quantitative) or with the mode (if the predictor is qualitative), a.k.a. rough fix ; 2.ii. Regress each variable originally with missing fields on correlated set of selected predictors (e.g. using Logistic regression if the target variable is categorical, using Gaussian regression if the target variable is continuous, using Poisson regression if the target variable is Count (integer-valued)). The selection of the predictors for each missing target variable will be made through forward best subset selection (continuous missing target variable case) or with stepwise model selection by exact AIC criterion (factor variables). 2.iii. Update the imputations by filling in the originally missing field a) the predicted value from the corresponding regression equation (mean of the predictive distribution or b) the predicted value from the corresponding regression equation plus some noise (a random draw from the predictive distribution). 3. Apply CART/Random Forest to each imputed bootstrap sample. Drop any new case down each tree/forest to obtain a fitted value/fitted class 4. Use majority voting among the B trees/forests to generate a predicted class (classification settings) or, take the average of the fits among the B trees/forests to generate a predicted value for each new case. With this procedure, we first perform the bootstrapping and later on we perform the imputation on each bootstrap sample. Although we consider a single imputation on each bootstrap sample, we end up with imputed bootstrap samples, which may reproduce the variability under the imputation model. Afterwards, either CART or RF are fitted on each imputed bootstrap sample so as to account the variability when constructing trees on (slightly) different datasets. In this way, when obtaining predictions we may be considering both the variability under the imputation model and the variability caused when fitting trees on slightly different datasets, which we expect to be averaged out. As with the technique in 5.4.4, there is variability on top of variability that may be averaged out. 22

31 When updating the imputations in 2.iii with the predicted value from the regression equation we may be imposing a very rigid linear relationship between the missing target variable and its more important predictors, which in reality may not be true. We could try to smooth that relationship by updating the originally missing values in 2.iii using a random draw from the predictive distribution.. According to Yan He (2006), this approach for imputing missing data should be applied with caution, since it assumes large samples, and requires 2000 or so bootstrap replications to obtain reasonable numerical accuracy if the bootstrap estimation is non-normal. The latter makes this procedure a very computationally intensive technique. One of the limitations of the bootstrap imputation method developed by Yan He (2006) was found when updating imputations in 2.iii using either Logistic regression, Gaussian regression or Poisson regression, most correlated predictors were asked to be included in the imputation equations and thus, there was no assurance of having all relevant covariates included in the regression equation. In this study, we perform the procedure in step 2.iii by making use of forward best subset selection (continuous missing target variable case) or stepwise model selection by exact AIC criterion (factor variables), applied only for selecting models with main effects. By doing this, we try to avoid the imputed values having large errors that could potentially reduce the accuracy of prediction. Other solutions that could improve the imputations within this framework would be to allow more complex relations with the missing target variable, e.g. by imposing models with interactions, or fitting a more flexible regression such as regression splines or smoothing splines for imputing missing target variables. This can be the object of future work regarding this procedure. In our simulation and real datasets study (see Results section), we want to empirically investigate how each of the four techniques proposed, based on the bootstrap imputation method, behaves in terms of prediction power when dealing in different scenarios involving missing data Technique 7: Non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution: impute all missing values before training with CART In this technique, we will update the imputations in 2.iii by means of the predicted value from the corresponding regression equation. Once we have the imputed bootstrap samples at hand, 23

32 we fit CART to obtain predictions as described in steps 3 and 4. In this way, we expect the variability due to noisy fits along with the variability under the imputation model to be averaged out Technique 8: Non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution: impute all missing values before training with Random Forest The second variant also performs the imputations of the originally missing values in 2.iii using the predicted value from the corresponding regression equation (as in ). Now, with the aim of stressing more on the uncertainty surrounding imputations and tree fits, we fit a whole forest in each bootstrap sample. First, we mimic the sampling process from the distribution of the data (bootstrapping), then we take a number of bootstrap samples again, but now on each already imputed bootstrap sample after step 2.iii, and on these bootstrap samples we will eventually fit de-correlated trees (Random Forest) Technique 9: Non-parametric bootstrap method to impute missing data based on random draws from the predictive distribution: impute all missing values before training with CART The third variant of this procedure consists of updating the imputations of the original missing values in 2.iii using a random draw from the predictive distribution. We then fit CART on each imputed bootstrap sample to generate predictions. We want to introduce the variability under the imputation model by ending up with imputed bootstrap samples, on which we will fit CART. From the construction of CART on each imputed bootstrap samples, we want to mimic the variability accounted when fitting trees on (slightly) different datasets. Now, by adding some noise to the fitted values in the regression equation, we impose a more flexible relationship between the missing target variable and its predictors when imputing the missing data Technique10: Non-parametric bootstrap method to impute missing data based on random draws from the predictive distribution: impute all missing values before training with Random Forest The last variant of this procedure corresponds to the situation where the imputations of the originally missing values are updated in 2.iii using a random draw from the predictive distribution. Once we have the imputed bootstrap samples at hand, we fit the Random Forest algorithm to each of them. As in , we aim to strees more on the uncertainty 24

33 surrounding imputations and tree fits by fitting a forest on each imputed bootstrap sample. But now, we impose a more flexible relationship between the missing target variable and its predictors when imputing the missing data through random draws from the predictive distribution. 6 Implementation of the techniques proposed 6.1 Simulation datasets We want to empirically assess the prediction performance of each of the methods proposed by means of simulated datasets under different scenarios Simulated regression setting We first simulated a dataset of 5500 observations, with continuous response, in which we considered seven predictors. For this study, we used the following function: f(x)= 10(sin(x )) + 5e x2 + 16(x 3 + 1) x 4 8. x x x 7 This function has a nonlinear additive dependence on the first three variables, a linear dependence on the next two and is independent of the last two (pure noise) variables. For generating the continuous response values, the following model was thus applied: y i = f(x i ) + ε i, (i = 1, 2,, N); where ε i is a i.i.d. random term, following a normal distribution with mean 0 and variance = 2, (ε i ~ N (0, 2) ). Variables X 1, X 6 and X 7 follow a normal distribution with common variance equal to 1, but with different means: X 1 ~ N (5, 1) X 6 ~ N (0, 1) X 7 ~ N (15, 1) Variables X 2, X 3 and X 4 are correlated, each pair in the combination has a correlation approximately equal to 0.8. They were simulated as being uniformly distributed correlated variables: X 2 ~ U (0, 1) X 3 ~ U (0, 1) X 4 ~ U (0, 1) 25

34 Variable X 5 is a factor with 3 levels: 0, 1, 2. It was generated by sampling from the binomial distribution as follows: X 5 ~ Bin (2, 0.4) Intuitively, we generate X 5 by doing two trials in each randomly generated Bernoulli experiment, with a probability of success of 0.4. Thus, the higher the probability of success, the more likely it will be to have two successes in the two Bernoulli experiments. We show in Table 2 the correlation structure of our simulated dataset: Table 2: Correlation structure of the simulated dataset. Output extracted from R We wanted our first simulated dataset in the regression context to be a very complex function, in which some variables were highly correlated (X 2 and X 3, X 2 and X 4, X 3 and X 4), some others almost not correlated (e.g. X 1 and X 2, X 1 and X 3, X 1 and X 4, X 2 and X 6, X 4 and X 6) and include some noisy variables (X 6 and X 7) in order to evaluate the performance of the ten techniques proposed in the presence of missing data Test Set and Learning Set The simulated 5500 observations dataset, is split at random into two non-equal sized groups. The first group consists of 500 observations, taken at random from the original dataset, which we will call the learning set. This will be the training data, on which the techniques under study will be applied. The second group consists of 5000 observations, similarly taken at random from the original dataset. We will use these observations to generate predictions from the model built on the training dataset. These predictions will be used subsequently to calculate the test set estimate of the prediction error Missing data As we mentioned before, we will exclusively generate missing data in the learning set in this study and we will leave the test set complete. By doing this, we want to measure the loss of prediction capability (accuracy) caused by constructing inaccurate tree(s) /forest(s) on incomplete/imputed datasets. 26

35 We artificially generated missing data in the learning set by means of the MCAR, MAR and NMAR mechanism, and we deleted data so that we ended up with 5%, 10% and 25% of the total fields missed. We deleted fields in three variables in the learning set: X 1, X 4 and X 5. We selected these variables to be missing in order to consider a rich scenario for evaluating our techniques. Thus, we will have a missing predictor variable (X 4) having a strong correlation with some other variables (X 2, X 3, Y), and a missing predictor variable (X 1) showing a very low correlation with all the other variables. In order to add more variety to the missing features selected, we also chose X 5, a categorical variable with three levels. The number of variables deleted per case on average and the percentage of complete cases for each scenario are given in Table 7 to Table 9. A case was said to be incomplete if it had at least one missing predictor variable. We tried to keep each of the missing variables, either X 1, X 4 and X 5, with the same proportion of missing data as possible, in such a way that we could also reach the different percentages of missing data established for this study (5%, 10% and 25%), for the MCAR, MAR and NMAR mechanisms. By trial and error, we accomplished the different percentages of missing data defined for this study, although we more or less achieved the condition of having the same proportion of missing data for each missing variable. For generating missing data with the MAR mechanism, determining variables are needed. Variable X 7 was selected as the determining variable for generating missing values in X 1. High values of X 7 make it more likely to have X 1 missing in that case. These two variables show a very low negative correlation, as illustrated in Table 2 (-0.015). The variable X 2 was chosen as the determining variable for X 4. Only values in X 2 higher than or equal to 0.4 can make X 4 missing in that record (for 5% and 10% of data missed), and the higher the value, the more likely it is to have X 4 missing. When 25% of the data is missing, only values of X 2 higher than or equal to 0.25 can make X 4 missing in that record, and as for the other two, the higher the value, the more likely it is to have X 4 missing. The correlation between them is rather high (almost 0.80). For X 5, the factor variable with three levels, X 6 was chosen as the determining variable. High values of X 6 make it more likely to have X 5 missing in that case. These variables were selected as determining variables in order to have a rich scenario: one correlated determining variable (X 2 for X 4), one uncorrelated determining variable (X 7 for X 1) and one continuous variable that determines the missingness of the factor covariate (X 6 for X 5). One can also see that the determining variables are uncorrelated between 27

36 themselves. This will help to have a good number of missing fields for this determining variables design. Data are NMAR if the mechanism resulting in its omission is dependent of its (unobserved) value. In our simulation study, we generated missing data under the NMAR mechanism on each of the target variables X 1, X 4 and X 5, as shown in Table 3: Table 3: Structure of the generation of NMAR data for the simulated regression setting Missing Type of % Missing NMAR missing data generation target variable variable X 1 Continuous 5% - 10% - 25% High values of X 1 were more likely to be missing X 4 Continuous 5% Only values of X can make X 4 missing. The higher the value of X 4 (when X 4 0.4), the more likely it is that X 4 results in a missing case. 10% The same as 5% missing 25% Only values of X can make X 4 missing. The higher the value of X 4 (when X ), the more likely it is that X 4 results in a missing case. X 5 Factor 5% prob.x5.0=0.25 prob.x5.1=0.08 prob.x5.2= % prob.x5.0=0.40 prob.x5.1= prob.x5.2= % prob.x5.0=0.75 prob.x5.1=0.68 prob.x5.2= Objectives and methodology The main objective of this analysis will be to predict the response variable generated through function (2), using seven simulated covariates in the presence of different missing data scenarios, by fitting each of the ten techniques described in Table 1, which deals with missing fields, some of them by themselves or others through a previous imputation procedure, and which eventually uses a tree-based model in the training process for generating predictions. The missing data scenarios whose performances will be analysed consist of all possible combinations of: The three different percentages of missing data (5%, 10% and 25%) The three different missingness mechanisms for generating the missing data (MCAR, MAR and NMAR) As mentioned before, some of the covariates were relevant in the generation of the outcomes, whereas some others were not at all. In fact, there were two pure noise variables, X 6 and X 7, generated and included in the response function with a parameter coefficient of 0. 28

37 In order to judge the performance of the proposed techniques, we will use the test error estimate of the prediction error. The technique obtaining the lowest test set error estimate in a missing data scenario will be the one showing the best prediction performance in that scenario. Therefore, the overall selected/preferred technique (if any) will be the one producing good predictions on most of the scenarios to be analysed. We will also compare how good or how bad the estimate of the prediction error becomes when we fit each of the proposed techniques to data: with different missingness mechanisms in general, with a higher percentage of missing data within a specific missingness mechanism group, and with the same percentage of missing data but different missingness mechanisms. This way, we can also assess how the imputation quality for that technique is affected by e.g. the missingness mechanisms or the percentage of missing data Results of the analysis The results after fitting the techniques proposed in Table 1 are illustrated in Table 10 (page 37). The first thing we notice under each of the three missingness mechanisms is that in general the higher the percentage of missing data, the higher (worse) the test error estimate of the prediction error. This makes sense, since the smaller the amount of information and/or the more distorted the sample with respect to the true population, the less we will be able to accurately predict the outcomes with any model fitted. We divide this section into two parts: the prediction assessment and the imputation quality assessment Assessment of prediction and selection of the model Although we can allow all these techniques to have a little bias by using trees, the key part for prediction improvement for each of them will be to try to successfully average out sources of prediction variability such as the noisy fits produced by trees and the variability under the imputation model (in cases where the method involves an imputation step). Since the variance of the generated response variable is equal to , we could expect the test error estimate of each of the techniques fitted to have high values. As long as this estimate is lower than the sample variance of the response variable, we can still consider it worthwhile fitting with that technique. However, once the test error estimate of a technique is higher than the sample variance of the response variable, it would be better to generate predictions with the mean value of the outcomes. This can arise whenever we have a lot of missing information on the predictor variables and/or whenever the training data give a 29

38 distorted picture of the true population (e.g. missing data on the covariates generated with the NMAR mechanism). The first was what occurred when we fitted Bagging on data having 25% of missing fields deleted using the MCAR and MAR. The prediction of Bagging was even much worse when we deleted 25% of the data using the NMAR mechanism. One can see that the behaviour of the error estimate does not show much difference in most of the techniques when we compare the MCAR and MAR mechanisms matched by the same percentage of missing data, except for techniques and scenarios such as: CART (10% and 25% of fields deleted), Bagging ntree = 1000 and ntree = 500 (5%, 10% and 25% of fields missed), CART-based MICE algorithm generating 500 imputed datasets and fitting CART on each of them (25% of fields deleted), non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution (CART and Random Forest fitted to each imputed bootstrap sample, 25% of fields deleted), and the naïve approaches: CART (5%, 10% and 25% of the fields deleted) and Random Forest ntree = 1000 and ntree = 500 (25% of fields deleted), both fitted only on cases without missing data. In our prediction context, we don t clearly see an improvement in predictions by techniques applying multiple imputation before training, when fitted on missing data generated by the MAR mechanism, as compared to fitting them on data with missing values generated by the MCAR mechanism. We notice a little improvement in predictions when fitting the CARTbased MICE algorithm using CART to predict on data having 5% and 10% of fields MAR, as compared to fitting to data with 5% and 10% of its fields MCAR, but when the percentage of missingness grows to 25%, we get the opposite result. In the case of the CART-based MICE algorithm (500 imputed datasets) using RF to predict, fitted to data having 5%, 10% and 25% of fields MAR, it shows a slightly better performance than when fitted to data MCAR with the same percentages of fields missed. Thus, when imputing the missing values of X 1, X 4 and X 5 by considering a complex and flexible regression (as the CART-based MICE algorithm does) and fitting CART and RF, predictions may be improved when 5% and 10% of the data is MAR as compared to 5% and 10% of data MCAR. At least this applies for our simulation dataset. In contrast, when imputing the missing data of X 1, X 4 and X 5 by considering a rigid regression (as the non-parametric imputation method does) and fitting both CART and RF, in all the cases the predictions became worse when having 10% and 25% of data MAR as compared to 10% and 25% of data MCAR. The bootstrap imputation methods and the CART-based MICE 30

39 algorithm both fitted with CART showed a worse performance for 25% of data MAR as compared to 25% of data MCAR. As we mentioned before, the bootstrap imputation methods uses the best subset selection for continuous missing target variables and likelihood ratio tests for categorical missing target variables to choose the predictor variables in the regression imputation (although for simplicity in this study we will only permit the main effects model). Table 4 shows the predictors chosen for implementing the bootstrap imputation methods for each of the variables declared missing in our simulation: Table 4: Table illustrating the missing variables and their corresponding predictors selected for imputing with the bootstrap imputation methods Missing Type of variable Predictors chosen for covariate imputing X 1 continuous X 2, X 3, X 4, X 5, Y Best subset selection X 4 continuous X 1, X 2, X 3, X 5, Y Best subset selection X 5 Factor (3 levels) X 1+X 2+X 3+X 4+Y Likelihood ratio tests Method used for choosing the predictors for imputing In general, the prediction performance of each of the techniques proposed fitted to data MCAR and data MAR was better than when they were fitted to data NMAR, making the comparisons by matching the same percentage of missing data. Hence, the fact that observed values in each of the missing variables are sort of censored under the NMAR mechanism and that as a consequence the available sample may give a distorted picture of the true population, causes the predictive capability in almost all the techniques proposed to decrease substantially as compared to the prediction capability under MCAR and MAR mechanisms. Since in reality we don t know the censoring mechanism, the problem is now much more difficult in the imputation step. However, the CART-based MICE algorithm when fitting CART for prediction shows a similar predictive performance, in the scenarios with 5% and 10% of data NMAR as compared with the scenarios with 5% and 10% of data MCAR and MAR. Regarding the nonparametric bootstrap methods, in the literature (He, 2006) it was stated that it does not depend on any missing-data mechanism. However, by obtaining prediction results in an empirical way as in our simulation, one can conclude that the predictions of the bootstrap imputation methods using CART or Random Forest for prediction fitted under the NMAR scenarios are clearly worse than when fitted under the MCAR and the MAR scenarios (slightly worse when having 5% missing data). As mentioned before, these techniques also showed a worse predictive performance under the 10% of data MAR scenario as compared to 31

40 the 10% of data MCAR scenario. Thus, it seems that the nonparametric bootstrap methods do not have the same performance for all the missingness mechanisms. Table 5 contains the two best techniques in terms of predictive performance along with their respective test error estimate for each of the possible combination of scenarios proposed. The non-parametric bootstrap imputation method based on random draws from the predictive distribution (500 bootstrap samples) and Random Forest fitted to each imputed bootstrap sample (ntree = 500) reached the podium three times in our analysis of scenarios with missing data, with second place finishes. It achieved the second place in the scenarios: 25% of data missing, generated by the MCAR mechanism, 5% of data missing, generated by the NMAR mechanism and 10% of data missing, generated by the NMAR mechanism. The non-parametric bootstrap imputation method based on the mean of the predictive distribution (500 bootstrap samples) and Random Forest fitted to each imputed bootstrap sample (ntree = 500) technique reached the podium four times with first and second places in our analysis of scenarios with missing data. It achieved the position of winner in the scenarios 5% of data missing generated by the MCAR mechanism and 10% of data missing generated by the MCAR mechanism, and it achieved second place in the scenarios 5% of data missing, generated by the MAR mechanism and 10% of data missing, generated by the MAR mechanism. We may have expected this relatively great performance, since it may not only average out the variability accounted when fitting a tree on bootstrap samples (noisy fits), but it may also try to reproduce the imputation uncertainty when imputing bootstrap samples, which may eventually be averaged out as well. This procedure shows good prediction results when having 5% and 10% of missing data generated by MCAR and MAR mechanisms. However, compared to these situations, it fails to predict well when having 25% of the fields MCAR and MAR. In addition, for the bootstrap-rf-based techniques, imputing randomly from the predictive distribution constructed in each imputed bootstrap sample brings better results than just imputing with the mean of the predictive distribution constructed in each imputed bootstrap sample, when having 25% of missing data and/or under all scenarios of the NMAR mechanism. In contrast, for the bootstrap-rf-based techniques, imputing with the mean of the predictive distribution produced a better prediction performance than when imputing using random draws from the predictive distribution in scenarios where we had 5% and 10% of the 32

41 data MCAR and MAR (that technique was actually present in the podium in those scenarios). CART-based MICE algorithm producing 500 imputed datasets, and Random Forest fitted to each imputed dataset is the technique winning in most of the scenarios studied. Despite being a very computational intensive technique, this procedure seems to average out not only the variability accounted in (decorrelated) trees fit, each built in a bootstrap sample drawn from the (approximate) distribution of the s th imputed data (s = 1, m), but also the variability accounted when fitting a Random Forest to each of the m different imputed datasets (variability due to the imputation model). In other words, there is variability on top of variability that is averaged out. Thus, empirically one could argue that this technique succeeded in averaging out these important sources of variabilities, and as a result, the best predictions under almost all the scenarios were obtained. Actually, the CART-based MICE algorithm using Random Forest for generating predictions is shown to have very stable error estimates, with very little increase from scenario to scenario, excluding the NMAR mechanism which causes the test error estimates of all the techniques analysed to increase. This technique is the winner in almost all the scenarios studied, showing the lowest estimate of prediction error in all scenarios except for the 5% and 10% of data missing, generated by the MCAR mechanism scenarios, but for very little (actually it achieved the second place in both scenarios). In fact, in the scenario 5% of data missing, generated by the NMAR mechanism, we considered this same technique to be the winner (as noted with the same colour in Table 5), since the only change for this variant is that instead of 500 imputed datasets we imputed only 10 datasets. For our simulated dataset, the procedure imputing 500 datasets shows a slightly better performance than the variant imputing only 10 datasets, in almost all the scenarios, except for the 5% of data missing, generated by the NMAR mechanism scenario. It can be the object of other studies to see whether the number of imputed datasets in this technique significantly influences the prediction error estimate or not. We also want to point out that the prediction results of the Naive CART and Naïve RF were generally not good. As it is generally known, these Naïve approaches can sometimes give acceptable prediction results when having only a small amount of data missing. As we can see in Table 10 (page 37), the Naïve CART fitted when having 5% and 10% of fields deleted by the MAR mechanism still show good prediction results as compared to CART fitted to the 33

42 full data. However, as the percentage of missing data increases to 25% within this mechanism, bad predictive results are already obtained as compared to CART fitted to the full data. We never have good prediction results when fitting this Naïve approach under the MCAR and NMAR mechanisms. On the other hand, the RF Naïve approach fitted to data having 5% of missing values under the MCAR and MAR mechanisms are the only cases in which good imputation results are obtained as compared to the RF technique fitted to the full data. As can be seen, if we are not aware of this, and we just delete cases with any missing fields, predicting on the remaining data can lead to disastrous consequences, especially if the data don t have a small amount of missing values. Thus, the technique selected to predict the outcome variable generated using function (1) based on 7 predictors generated as (3) (4) and (5), which gave in general good prediction results in most of the missingness scenarios analysed was Multiple Imputation for missing data via sequential Regression Trees, 500 imputed datasets, and Random Forest fitted to each imputed dataset (ntree = 500). 34

43 Table 5: The first two techniques showing the best prediction performance in each of the scenarios analysed in this study. Next to the technique, we also present its test estimate of prediction error MSPE estimate of the Position Winning techniques prediction error based on the test set Scenario 1 2* Non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution (500 bootstrap samples), and Random Forest fitted to each imputed bootstrap sample (ntree = 500) Multiple Imputation for missing data via sequential Regression Trees, 500 imputed datasets, and Random Forest fitted to each imputed dataset (ntree = 500) Non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution (500 bootstrap samples), and Random Forest fitted to each imputed bootstrap sample (ntree = 500) Multiple Imputation for missing data via sequential Regression Trees, 500 imputed datasets, and Random Forest fitted to each imputed dataset (ntree = 500) Multiple Imputation for missing data via sequential Regression Trees, 500 imputed datasets, and Random Forest fitted to each imputed dataset (ntree = 500) Non-parametric bootstrap method to impute missing data based on random draws from the predictive distribution (500 bootstrap samples), and Random Forest fitted to each imputed bootstrap sample (ntree = 500) Multiple Imputation for missing data via sequential Regression Trees, 500 imputed datasets, and Random Forest fitted to each imputed dataset (ntree = 500) Non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution (500 bootstrap samples), and Random Forest fitted to each imputed bootstrap sample (ntree = 500) Multiple Imputation for missing data via sequential Regression Trees, 500 imputed datasets, and Random Forest fitted to each imputed dataset (ntree = 500) Non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution (500 bootstrap samples), and Random Forest fitted to each imputed bootstrap sample (ntree = 500) Multiple Imputation for missing data via sequential Regression Trees, 500 imputed datasets, and Random Forest fitted to each imputed dataset (ntree = 500) Random Forest: Impute Missing Values by median/mode (ntree = 500) Multiple Imputation for missing data via sequential Regression Trees, 10 imputed datasets, and Random Forest fitted to each imputed dataset (ntree = 500) Non-parametric bootstrap method to impute missing data based on random draws from the predictive distribution (500 bootstrap samples), and Random Forest fitted to each imputed bootstrap sample (ntree = 500) Multiple Imputation for missing data via sequential Regression Trees, 500 imputed datasets, and Random Forest fitted to each imputed dataset (ntree = 500) Non-parametric bootstrap method to impute missing data based on random draws from the predictive distribution (500 bootstrap samples), and Random Forest fitted to each imputed bootstrap sample (ntree = 500) Multiple Imputation for missing data via sequential Regression Trees, 500 imputed datasets, and Random Forest fitted to each imputed dataset (ntree = 500) Random Forest: Impute Missing Values by median/mode (ntree = 1000) % of data missing, generated by the MCAR mechanism 10% of data missing, generated by the MCAR mechanism 25% of data missing, generated by the MCAR mechanism 5% of data missing, generated by the MAR mechanism 10% of data missing, generated by the MAR mechanism 25% of data missing, generated by the MAR mechanism 5% of data missing, generated by the NMAR mechanism 10% of data missing, generated by the NMAR mechanism 25% of data missing, generated by the NMAR mechanism Some other details We would like to point out that the default settings in the rfimpute function in R, iter = 5 ntree = 300 and the settings iter = 10 ntree = 500 give us approximately the same prediction performance (for ntree = 1000 and ntree = 500, options of RF). Of course, the latter 35

44 happens whenever a comparison is possible, since the rfimpute function shows problems whenever we use its default settings under the scenarios 25% of data missing, generated by the MAR mechanism and 25% of data missing, generated by the NMAR mechanism. Apparently, this function has a problem in its implementation when the data have a large amount of missing values. Table 6: Performance of tree-based techniques fitted to the complete datasets % Data missing Technique fitted Technique settings MSPE estimate of prediction error based on the test set 0 % Single Tree: Full Data Analysis minsplit=5, the rest default settings in R Bagging nbagg= Bagging nbagg= Random Forest: Full Data Analysis ntree = 1000, the rest default settings in R Random Forest: Full Data Analysis ntree = 500, the rest default settings in R Table 7: Table showing all the scenarios to be analysed under the MCAR mechanism in our simulated dataset % Data missing Number of variables deleted per case on average Percentage of complete cases 5% % 10% % 25% % Table 8: Table showing all the scenarios to be analysed under the MAR mechanism in our simulated dataset % Data missing Number of variables deleted per case on average Percentage of complete cases 5% % 10% % 25% % Table 9: Table showing all the scenarios to be analysed under the NMAR mechanism in our simulated dataset % Data missing Number of variables deleted per case on average Percentage of complete cases 5% % 10% % 25% % 36

45 Table 10: Table showing the estimated prediction performance in our simulated dataset of each of the techniques proposed across a number of different missing data scenarios. The hyphen sign - shown in some techniques for some scenarios indicates that no output could be obtained from the statistical software R Technique fitted CART surrogate splits Random Forest: Impute Missing Values by median/mode (option na.roughfix in library randomforest in R) Random Forest: Impute Missing Values by median/mode (option na.roughfix in library randomforest in R) Random Forest: Impute missing values in predictor data using proximity matrix ( rfimpute function in library randomforest in R) Random Forest: Impute missing values in predictor data using proximity matrix ( rfimpute function in library randomforest in R) Random Forest: Impute missing values in predictor data using proximity matrix ( rfimpute function in library randomforest in R) Random Forest: Impute missing values in predictor data using proximity matrix ( rfimpute function in library randomforest in R) Bagging Bagging Multiple Imputation for missing data via sequential Regression Trees, 10 imputed datasets, and CART fitted to each imputed dataset Multiple Imputation for missing data via sequential Regression Trees, 500 imputed datasets, and CART fitted to each imputed dataset Multiple Imputation for missing data via sequential Regression Trees, 10 imputed datasets, and Random Forest fitted to each imputed dataset Multiple Imputation for missing data via sequential Regression Trees, 500 imputed datasets, and Random Forest fitted to each imputed dataset Non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution (500 bootstrap samples), and CART fitted to each imputed bootstrap sample Technique settings in R minsplit=5, the rest default settings in R randomforest: ntree = 1000, the rest default settings in R randomforest: ntree = 500, the rest default settings in R rfimpute: default settings in R; Random Forest: ntree = 1000, the rest default settings in R rfimpute: default settings in R; Random Forest: ntree = 500, the rest default settings in R rfimpute: iter=10,ntree=500; randomforest: ntree = 1000, the rest default settings in R rfimpute: iter=10,ntree=500; randomforest: ntree = 500, the rest default settings in R nbagg=1000, the rest default settings in R nbagg=500, the rest default settings in R treemi function: ITER = 10, mincut = 5, mindev = 1e-04, startcut = 10, startdev = 1e-04; CART: minsplit=5, the rest default settings in R treemi function: ITER = 10, mincut = 5, mindev = 1e-04, startcut = 10, startdev = 1e-04; CART: minsplit=5, the rest default settings in R treemi function: ITER = 10, mincut = 5, mindev = 1e-04, startcut = 10, startdev = 1e-04; Random Forest: ntree = 500, the rest default settings in R treemi function: ITER = 10, mincut = 5, mindev = 1e-04, startcut = 10, startdev = 1e-04; Random Forest: ntree = 500, the rest default settings in R CART: minsplit=5, the rest default settings in R MSPE estimate of prediction error based on the test set (MCAR mechanism) MSPE estimate of prediction error based on the test set (MAR mechanism) MSPE estimate of prediction error based on the test set (NMAR mechanism) % Data missing % Data missing % Data missing 5% 10% 25% 5% 10% 25% 5% 10% 25%

46 Technique fitted Non-parametric bootstrap method to impute missing data based on the mean of the predictive distribution (500 bootstrap samples), and Random Forest fitted to each imputed bootstrap sample Non-parametric bootstrap method to impute missing data based on random draws from the predictive distribution (500 bootstrap samples), and CART fitted to each bootstrap sample Non-parametric bootstrap method to impute missing data based on random draws from the predictive distribution (500 bootstrap samples), and Random Forest fitted to each imputed bootstrap sample Naïve approach: discarding missing observations before training with CART Naïve approach: discarding missing observations before training with Random Forest Naïve approach: discarding missing observations before training with Random Forest Technique settings in R Random Forest: ntree = 500, the rest default settings in R CART: minsplit=5, the rest default settings in R Random Forest: ntree = 500, the rest default settings in R CART: minsplit=5, the rest default settings in R ntree = 1000, the rest default settings in R ntree = 500, the rest default settings in R MSPE estimate of prediction error based on the test set (MCAR mechanism) MSPE estimate of prediction error based on the test set (MAR mechanism) MSPE estimate of prediction error based on the test set (NMAR mechanism) % Data missing % Data missing % Data missing 5% 10% 25% 5% 10% 25% 5% 10% 25% Real datasets Regression setting: Abalone Dataset This dataset was taken from the website of the Machine Learning Repository ( and comes from an original (non-machine learning) study (Nash et al., 1994). The complete dataset consists of 4177 cases. It contains eight physical attributes of abalones, namely feature variables. Each sample also contains the number of rings of the corresponding abalone, which is the value to predict. A brief description of the predictors and outcome variables is given in Table

47 Table 11: Brief description of the variables gathered in our Abalone Dataset Name Attribute Type Measurement unit Description Sex Nominal M, F, and I (infant) Length Continuous mm Longest shell measurement Diameter Continuous mm Perpendicular to length Height Continuous mm With meat in shell Whole weight Continuous grams Whole abalone Shucked weight Continuous grams Weight of meat Viscera weight Continuous grams Gut weight (after bleeding) Shell weight Continuous grams After being dried Rings Integer +1.5 gives the age in years The objective of the analysis of this dataset is to predict the age of the abalone from the eight physical attributes available from abalones (seven continuous measurements and one nominal feature with three levels). In practice, the age of the abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. The Abalone Dataset already provides the number of rings for each abalone, from which we can compute the age in years of the mollusc by adding 1.5 to its actual value. There are other measurements used to predict the age, which are easier to obtain. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem, but are not available in this dataset Test set and learning set The original dataset of 4177 observations is split at random into a learning set and a test set in such a way that we have enough cases in the test set to make accurate estimations of the prediction error. Thus, 500 observations were assigned to the learning set randomly and 3177 observations were assigned to the test set randomly Missing data The cases with missing values are removed from the original data (the majority having the predicted value missing) and the ranges of the continuous values are scaled for use with an ANN (by dividing by 200). In order to be able to assess the loss of prediction accuracy due to missing values caused by the construction of tree-based prediction methods using incomplete/imputed data, we generate artificially missing values based on some predictor variables of the training set, and for checking the performance of each technique we will use the 3177-case test sample of complete data vectors. We will generate missingness by means of the MCAR, MAR and NMAR mechanisms, deleting 5%, 10% and 25% of the total fields for each of these 39

48 mechanisms. Specifically, we will delete fields in four covariates of the Abalone dataset: Length, Diameter, Shucked weight and Viscera weight. All the covariates in this dataset show high correlations between each other as shown in Table 13. Thus, it will be interesting to see how our techniques perform under this situation. Table 18 to Table 21 show the number of variables deleted per case on average and the corresponding percentage of complete cases for each of the scenarios proposed. A case is said to be incomplete if it has at least one missing predictor variable. We tried to keep each of the missing variables with the same proportion of missing data as much as possible, in such a way that we could also reach the different percentages of missing data established for this study (5%, 10% and 25%). By trial and error, we accomplished the different percentages of missing data established for this study, although we more or less achieved to have the same proportion of missing data for each missing variable. For generating missing data with the MAR mechanism, determining variables are needed. For missing data in Length the determining variable was Height. Thus, high values of Height made it more likely to have Length missing in that case. These two variables showed a high correlation of The variable Whole weight was chosen as the determining variable for Diameter. Only (standardized) values in Whole weight higher than or equal to 0 could make Diameter missing for that record. Of course, the higher the (standardized) value of Whole weight (when Whole weight 0), the more likely it becomes that Diameter results in a missing case. The correlation between them is rather high (= 0.93). For Shucked weight, we let it depend on the observed values of Shell weight. High values of Shell weight made it more likely to have Shucked weight missing in that case. Their correlation is equal to Finally, we let the missingness of the variable Viscera weight depend on the observed values of the factor variable Sex, which possess three levels. Data cases having a value of female for the variable sex are more likely to have Viscera weight missed in that record. All the determining variables are uncorrelated between themselves as well. A brief summary of the generation of missing data under the MAR mechanism for the Abalone dataset is shown in Table

49 Table 12: Brief description of the structure of missing data generation under the MAR mechanism for the Abalone dataset Missing target Type of % Missing Determining variable (type MAR missing data generation variable variable of variable) Length Continuous 5% - 10% - 25% Height (continuous) High values of Height make it more likely to have Length missing in that case Diameter Continuous 5% - 10% - 25% Whole weight (continuous) Only values in Whole weight higher than or equal to 0 can make Diameter missing. The higher the value of Whole weight (when Whole weight 0), the more likely it is that Diameter is missing. Shucked weight Viscera weight Continuous 5% - 10% - 25% Shell weight (continuous) High values of Shell weight make it more likely to have Shucked weight missing in that case Continuous 5% Sex (factor, 3 levels) prob.sex.female = 0.22 prob.sex.infant = 0.08 prob.sex.male = % prob.sex.female = 0.45 prob.sex.infant = 0.18 prob.sex.male = % prob.sex.female = 0.96 prob.sex.infant = 0.65 prob.sex.male = 0.15 To generate missing values under the NMAR mechanism, we let the missingness of each of the target variables depend on its (unobserved) values. Therefore, we set high values of all the four missing target variables: Length, Diameter, Shucked weight and Viscera weight to be more likely to be missing. Since the missing covariates show high positive correlations between them, letting the missingness depend exclusively on high values of each of the four target variables will result in tendencies such as having a lot of missing fields in the same case, e.g. if we have one case missing for one variable, the others are also very likely to be missing because they may possibly have a high value as well. This missingness generation is a form of stochastic right censoring, and as can be seen from Table 20, it leaves a larger amount of complete cases as compared to the same scenarios in our previously simulated dataset. To compare this situation with a situation where more missing cases are present under the NMAR mechanism, we generate more incomplete observations by doing a form of mixed censoring, with observed values of Length and Shucked weight stochastically right censored and observed values of Diameter and Viscera weight stochastically left censored, i.e. observed high values of Length and Shucked weight are more likely to be missing, observed low values of Diameter and Viscera weight are more likely to be missing. Indeed, we ended up with more incomplete records. Table 21 presents a summary of the percentage of complete cases as well as the number of variables deleted per case on average for 5%, 10% and 25% of values missing under the mixed censoring way of generating NMAR data. 41

50 Objective and methodology The main objective of this analysis will be to predict the age of the abalone (i.e. the number of rings) using the available features under the scenarios described in the Objective section: with missing data in the predictor variables generated by different missingness mechanisms and with different percentages of missing data. In particular, we will artificially generate missing data in the training set by means of the MCAR, MAR and NMAR mechanism, and we will delete data so that we end up with 5%, 10% and 25% of the total fields missed. As before, we will study all possible combinations of missing mechanisms and percentages of missing data. To this eventually incomplete abalone dataset, we will fit each of the ten techniques described in Table 1 (Approach), which deals with missing data, some of them by themselves or others through a previous imputation procedure, and which eventually use a tree-based model in the training process for generating predictions. Accuracy in prediction for each technique will be determined using the test set estimate of prediction error. The technique producing the best prediction results in each of the missing data scenarios proposed will be the one with the lowest test set error estimate. Therefore, the overall selected/preferred technique (if any) will be the one showing good prediction results (low test error estimate) on most of the scenarios to be analysed. We will also compare how good or how bad the estimate of the prediction error becomes for each technique when we fit each of them to data: with different missingness mechanisms in general, with a higher percentage of missing data within a specific missingness mechanism group, with the same percentage of missing data but different missingness mechanisms. This way, we can also assess how the imputation/surrogate quality of each technique is affected by e.g. the missingness mechanisms, percentage of missing data. One of the reasons we selected this dataset was that it showed high correlations among the predictor variables, as shown in Table 13, which could result in a multicollinearity problem. Specifically, we want to see the effect of having predictors that are highly correlated on the predictive capability of each of the ten techniques described in Table 1. However, a stricter analysis of the effect of the relations between covariates on the predictive capabilities of our techniques can be done by means of simulations, where we can control those relations by ourselves. This and other stricter analyses will be left for future studies. 42

Table 13: Correlation structure of the Abalone dataset, output from R One can also see from Table 13 that there is a moderate linear relation between the predictors and the outcome variable (Rings).

(usually we don t even know which shape the fit will take).

51 Table 13: Correlation structure of the Abalone dataset, output from R One can also see from Table 13 that there is a moderate linear relation between the predictors and the outcome variable (Rings). We know that the tree-based methods generally gives better predictions when there is a non-linear relationship between predictors and the outcome variable, due to the flexible nature of the tree fit (usually we don t even know which shape the fit will take). Thus, there might be one (or more) data mining method(s) that predict better on this dataset than CART, Bagging or Random Forest, and that could deal better with the potential instability that multicollinearity may produce. Even in the presence of missing values, there may be some technique(s) that fit better than the CART/RF-based techniques involving imputation in a previous step proposed in this study 1. However, in this study we want to focus on the overall performance of techniques using trees for prediction. Evidently, there may be (other) tree-based techniques that can involve some variants when imputing, but we will limit the scope of our study to this Data manipulation We started by standardizing our data, since some variables (e.g. Whole weight and Rings) presented a higher variability than others (as illustrated in Table 14). In addition, the fact that some variables were measured in different units (as shown in Table 11), make the data standardization a recommended step. Table 14: Variance-Covariance Matrix of the Abalone dataset, output from R 1 Of course, in that case we will first have to complete the dataset by using the corresponding imputation procedure (for comparison purposes) before fitting the technique being tested 43

Tree-based methods for classification and regression

Tree-based methods for classification and regression Ryan Tibshirani Data Mining: 36-462/36-662 April 11 2013 Optional reading: ISL 8.1, ESL 9.2 1 Tree-based methods Tree-based based methods for predicting