Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database

Size: px

Start display at page:

Download "Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database"

Rudolph Hardy
5 years ago
Views:

1 8 Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database Cristian Preda a, Alain Duhamel a, Monique Picavet a, Tahar Kechadi b a Faculté de Médecine, France b Department of Computer Science, University College of Dublin Abstract Missing data is a common feature of large data sets in general and medical data sets in particular. Depending on the goal of statistical analysis, various techniques can be used to tackle this problem. Imputation methods consist in substituting the missing values with plausible or predicted values so that the completed data can then be analysed with any chosen data mining procedure. In this work, we study imputation in the context of multivariate data and we evaluate a number of methods which can be used by today's standard statistical software packages. Imputation using multivariate classification, multiple imputation and imputation by factorial analysis are compared using simulated data and a large medical database (from the diabetes field) with numerous missing values. Our main result is to provide a control chart for assessing data quality after the imputation process. To this end, we developed an algorithm for which the input is a set of parameters describing the underlying data (e.g., covariance matrix, distribution) and the output is a chart which plots the change in the prediction error with respect to the proportion of missing values. The chart is built by means of an iterative algorithm involving four steps: () a sample of simulated data is drawn by using the input parameters; (2) missing values are randomly generated; (3) an imputation method is used to fill in the missing data and (4) the prediction error is computed. Steps to 4 are repeated in order to estimate the distribution of the prediction error. The control chart was established for the 3 imputation methods studied here, assuming a multivariate normal distribution of data. The use of this tool on a large medical database was then investigated. We show how the control chart can be used to assess the quality of the imputation process in the pre-processing step upstream of data mining procedures. Keywords: Statistical models; Databases; Data mining; Missing values; Imputation;. Introduction Dealing with missing data is a major problem in Knowledge Discovery in Databases (KDD). This type of operation must be performed with caution in order to avoid deterioration in the performance of data mining procedures. The area has attracted much research interest over recent years and the mainstream statistical analysis software packages are starting to offer solutions (Celeux [], Hox [2]). Dealing with missing data in the KDD process comprises three main strategies. The first consists in eliminating incomplete

2 82 observations and has two major limitations. Firstly, the resulting information loss can be considerable if many of the variables have missing values for various individuals. Secondly, this method also runs the risk of introducing bias if the process behind the missing values is not completely random (Missing Completely At Random, MCAR [3]), i.e. if the subset analysed is not representative of the sample as a whole. The second strategy consists in using a method specifically adapted to the data mining algorithm employed (for example, CART [4]). These methods implicitly presuppose a MCAR mechanism and most have not been critically appraised. The third strategy is imputation: the missing data problem is tackled in the pre-processing step of the KDD process by replacing each missing value with a predicted value. This method is particularly well-suited to the KDD process, since the completed database can then be analysed with any chosen data mining procedure. The goal of the present work is to compare different imputation methods according to their predictive power and as a function of the proportion of missing values in the database analysed. We consider here the case of quantitative variables and MAR-type missing data (Missing At Random [3]). One can say that the process is MAR if conditionally on observed data for certain variables X, the appearance of a missing value for Y is random. There are numerous imputation methods: single imputation using the mean, median or mode (Schafer [5]), regression-based methods (Horton [6]) and more complex methods such as those based on classification procedures (Benali [7]), the NIPALS algorithm (Tenenhaus [8]), multiple imputation (Rubin [9], [0], Allison [], Donzé [2]) or association rules (Ragel [3]). Here, we studied three methods: multiple imputation, imputation via the NIPALS algorithm and imputation based on classification. These methods are suitable in cases where the missing values are present across several of the database's variables and can all be implemented easily with standard statistical software. Following a brief introduction to each method in section 2, section 3 introduces an indicator based on the mean square imputation error, which serves as a criterion for comparison. The methods are compared using simulated data and the large DiabCare medical database, generated under the auspices of the WHO for improving care provision to diabetic patients (section 4). 2. On multiple imputation, NIPALS and classification. 2. Multiple imputation One of the great disadvantages of single imputation (i.e. for a single value) is the fact that one is not aware of the uncertainty in predicting the unknown missing value. This can lead to significant bias - for example, systematic underestimation of the variance of the "imputed" variable. Multiple imputation enables this uncertainty to be taken into account when predicting the missing values. The basic idea, developed by Rubin [9], is as follows: (a) impute the missing values by using a suitable model which incorporates the random variation, (b) repeat this operation m times (3 to 5 times, in general) in order to obtain m complete data files. Statistical analyses are then carried out for each completed file and the results are combined in order to obtain the final model. In our simulation study (section 3), a missing value was predicted by the mean of the m predicted values generated in step (b). Different models can be used for imputation of the missing values, such as the MCMC model (Markov Chain Monte Carlo) and that based on the EM algorithm (Expectation Maximization). Version 8.2 of SAS includes two new procedures which enable multiple imputations (MI and MIANALYZE procedures [4]). The MVA module (Missing Value Analysis) in SPSS only performs single imputation (m=).

3 The NIPALS algorithm The aim of the NIPALS (nonlinear iterative partial least squares) algorithm is to perform principal component analysis in the presence of missing data (Tenehaus [8]). Given a rectangular data table of size n p, let us denote by X = { xij}, i n, j p, the matrix representing the observed values of the variables x. j for n statistical units. Next, if X is of rank a, then the decomposition formula for principal component analysis of X is X = a thph h= ', where t h =(t h,,t hi,,t hn ) and p h =(p h,,p hj,,p hp ) are the principal factors and principal components, respectively. Therefore, the NIPALS algorithm estimates a missing value corresponding to the cell (i,j) as = k ji tliplj l= xˆ, where k (k a ) is determined by cross-validation. Implementation of the NIPALS algorithm is very simple, since it is based only on simple linear regressions. The complexity of this algoritm is of order O(a n p C), where C is the number of iterations required for convergence. The NIPALS algorithm is implemented in the SIMPCA-P software (release 0) but not yet in SAS. We programmed a C application which implements NIPALS and used it to compare the method with other approaches. 2.3 Imputation by classification The principle is one of performing a classification of the data as a whole using a the k- means clustering method whilst taking into account the missing values in calculation of the distances via an appropriate metric (the FASTCLUS procedure in SAS). Each individual is assigned to a unique cluster and the missing value for the variable X is then replaced by the mean of X calculated from all the individuals in the cluster. 3. Comparison of imputation methods Imputation quality depends on a range of different parameters, the most important of which are i) the number of missing data, ii) the distribution of the random vector which describes the data table and iii) the distribution of the missing values. Let us suppose the data are normally distributed with a zero-mean and covariance matrix S and that the missing values are uniformly distributed. In order to assess and compare the imputation methods as a function of the proportion of missing data, we suggest a method comprising the following steps: Step. Using simulations, one generates a table T of n lines (individuals) and p columns (variables) representing n realizations of the random vector X~ N (0,S) (N designates a normal distribution). We chose n=00 so as to obtain a sufficiently large sample. Step 2. One generates a fixed percentage p m of missing values distributed uniformly within table T. Step 3. The missing values are imputed by using the chosen technique (here, one of the three above-mentioned methods). Step 4. In order to measure the precision of the imputation, one calculates the mean square error (MSE) defined by: MSE= n p p p ( xij xˆ ij)² i= [ n ] m j= n i= ( xij xj)²

4 84 where n designates the number of individuals, p the number of variables and p m the proportion of missing values. xˆ ij is the imputation of x ij if x ij is missing and xˆ ij = x ij if not. x = n j xij n i= is the mean of the variable X j (prior to random selection of missing values). In the MSE expression and for a given variable, the term between square brackets represents the ratio between the sum of squares of the imputation errors and the variable's variance (in order to take account of the measurement scale for each variable). We then calculate the mean square error by dividing by the number of missing data ( n p pm ). In order to study the behaviour of the MSE as a function of the percentage of missing data, operations to 4 are repeated K times for each percentage p m from % to 5% in % steps. As with bootstrap methods, we set K to 000. For each p m, we thus obtain a series of 000 MSE observations for which we calculate the quartiles and the mean. Figure presents the results obtained for imputation using the NIPALS algorithm (the two other methods gave similar results). The algorithm can be applied to any covariance matrix whatsoever. By way of an example, we chose the matrix S calculated from Fisher's Iris data [5]. This file (comprising 50 individuals and 4 numerical variables) is frequently used by statisticians to assess statistical methods. In order to assess the present method's robustness, we also applied the algorithm to Fisher's data: only step is modified, and the corresponding table T thus refers to real data. 0,0035 0,003 0,0025 MSE 0,002 0,005 0,00 0, Q Mediane Q3 Iris Mediane p_m (% missing values) Figure : the NIPALS method. Change in MSE as a function of p m (the proportion of missing values). Q and Q3 represent the first and third quartiles, respectively. The dotted curve represents the median MSE for the real data from the Iris file. One can observe that as expected, the imputation error increases when the proportion of missing data increases. Even though the MSE is calculated with a hypothesis of multinormal distribution, the proximity of the two median curves (simulated data and Iris data) indicates a certain robustness for this indicator (the real Iris data do not follow a multinormal distribution). One can then use the MSE to predict the order of magnitude of the imputation error by simply using the covariance matrix for the observed data. Let us again take Fisher's Iris data as an example (50 subjects and 4 parameters = 600 data items and S, the covariance matrix). Let us then suppose that for each the 4 parameters, 2% of the data is missing: we must therefore impute 72 values. If we choose to impute with the NIPALS method, we use Figure, where the median MSE is 0.20%. If the variables were reduced (or if their variances were equal) the imputation would then introduce an error estimated at = 4.4% of the total variance. For each imputation method, steps to 4 can be easily programmed so as to dynamically obtain a graph such as that shown in Figure

5 85. The algorithm's input parameters are the matrix S (that one can estimate from the available data) and the size n of the multinormally distributed sample to be generated. 4. Comparison of imputation methods using a large medical database The imputation methods studied here were applied to the DiabCare database (40000 individuals, 250 variables) set up under the auspices of the WHO (the EuroDiabCare program) in order to assess quality care in diabetes. Our work follows one from the DATADIAB research program supported by the French Ministry of Research (ACI 2000) [6]. Here, we focussed on French type II diabetics (249 individuals). The database suffers from missing values for numerous variables. Here, we present the results concerning variables considered to be important for follow-up of diabetic patients. The variables and the corresponding proportions of missing values are as follows: age (%), body mass index (5%), blood cholesterol level (9%), blood creatinine level (%), time since diabetes onset (5%), glycated haemoglobin (7%), height (4%), blood triglyceride level (9%), weight (2%), diastolic blood pressure (5%) and systolic blood pressure (4%). In all, there are 3352 missing values, i.e. 5.7% of the data. The real values of the missing data are not available. We used the method described in section 3 for a priori estimation of the imputation error by supposing that the data follow a multinormal distribution. The table below gives the median MSEs. The covariance matrix S was estimated from the observed data. Imputation method NIPALS Multiple Classification Median MSE (%) i t ti Estimated total error (% of the variance) One can note that the NIPALS and multiple imputation methods give similar results, whereas imputation by classification seems less precise (results given by PCA on the imputation data: statistical units are the missing values, the variables being the imputation methods). We also compared the means, medians and variances calculated first for the available data (one just eliminates the variable's missing values) and then for the complete cases and finally after imputation by the 3 methods. The results obtained can be summarised in the following manner: for the mean, all the estimations are similar. In contrast, for the variance, calculations on complete cases led to systematic under-evaluation with respect to available cases, as expected. Imputation with the three methods produces variances close to those calculated using available cases, except for the creatinine variable. The latter included the highest proportion of missing data (%) and the NIPALS method strongly overestimates the variance whereas classification underestimates it. 5. Discussion We studied three imputation methods: multiple imputation (via the SAS procedure MI), imputation by classification (via the SAS procedure FASTCLUS) and imputation with the NIPALS algorithm. These methods have the advantage of being well-suited to MAR cases and of being practicable with mainstream software. Having compared the methods using both simulated and real data sets, none appeared to differ significantly from the others in terms of the quality of the results. One of the strong points of the MI procedure is that it is quick, easy to use and does not artificially decrease data variance. Imputation via the FASTCLUS procedure is based on a simple idea but one is obliged to choose a number of classes in order to optimize the estimations - and the cost in calculation time can be high for large databases. As for the NIPALS algorithm, it is easy to implement in standard

6 86 programming languages (C, for example). Since this method is based on data reconstitution using PCA, NIPALS imputation takes into account the data's multivariate nature. It can be criticized for being poorly known to end users - except perhaps for fans of the PLS approach. Unsurprisingly, one observes a drop in performance for all methods when the proportion of missing values is high. We have thus developed a method which enables assessment of the imputation error as a function of the percentage of missing data. What we have, in fact, is a "control chart" for the a priori estimation of the quality of imputation of missing values for a given method and for a given covariance matrix table S. This "control chart" appears to us to be a highly valuable tool for use prior to statistical analysis of databases which may be completed with imprecise values. We intend to continue this research by broadening the range of techniques used. An initial approach will consist in developing missing data processing techniques based on non linear tools such as the kernel methods widely used in statistical learning and neural networks. We have already developed a methodology based on a recurrent, multicontext neural network ([7]). This has been validated in different fields (notably for monitoring energy saving) and we believe that such an approach is very well-suited to missing data processing. Of course, imputation is only useful if analysis of the completed database with statistical methods or data mining procedures gives reliable results. This is why the methods' respective performances will be judged according to the results obtained with the completed database for the prediction of the macro- and microvascular complications of diabetes. We shall consider 2 types of validation criteria: mathematical criteria and the judgement of medical experts. 6. References [] Celeux G, Le traitement des données manquantes dans le logiciel SICLA. Rapports Techniques n , INRIA, France. [2] Hox J, A review of current software for handling missing data. Kwantitatieve Methoden 999, 62: [3] Little RJA., Rubin DB, Statistical analysis with missing data, Wiley, New York 987. [4] Breiman L, Friedman JH, Ohlsenn RA, Stone CJ, Classification and regression trees. Belmont, Wadsworth 984. [5] Schafer JL, Imputation Procedures For Missing Data. USA Université de Pennsylvania 999 : [6] Horton NJ, Lipsitz SR, Multiple Imputation in Practice: Comparison of Software Packages for Regression Models With Missing Variables. Statistical Computing Software Reviews. The American Statistician 200, 55(3). [7] Benali H, Escofier B, Nouvelle étape de traitement des données manquantes en analyse factorielle des correspondances multiples dans le système portable d analyse de données. Rapports Techniques n ; INRIA, France. [8] Tenenhaus M, La régression PLS Théorie et pratique. Editions Technip 998. [9] Rubin DB, Multiple imputation for nonresponse in surveys. New York: John Wiley 987. [0] Rubin DB, Multiple imputation after 8+ years. Journal of American Statistical Association 996; 9: [] Allison PD, Multiple Imputation for Missing Data : A Cautionary Tale. Sociological Methods and Research 2000, 28, , USA Université de Pennsylvania. [2] Donzé L, Imputation multiple et modélisation : quelques expériences tirées de l enquête 999 KOF/ETHZ sur l innovation. Ecole polytechnique fédérale de Zurich [3] Ragel A, MVC - A Preprocessing Method to deal with Missing Values, Knowledge Based System 999;2: [4] SAS institute INC., SAS Campus Drive Cary, NC 2753, USA [5] Fisher RA, The use of multiple measurements in taxonomic problems, Annals of Eugenics 936, 7 : [6] Duhamel A, Nuttens MC, Devos P, Picavet M, Beuscart R, A preprocessing method for improving data mining techniques. Application to a large medical diabetes database, Studies in Health Technology and Informatics 2003 IOS press, [7] Huang BQ, Rashid T, Kechadi T, A new modified network based on the Elman network, Proceedings of IASTED International Conference on Artificial Intelligence and Application 2004, Innsbruck, Austria. Address for correspondence Cristian Preda, CERIM, Faculté de médecine, Place de Verdun, F Lille cedex, France, cpreda@univ-lille2.fr

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION