R software and examples

Size: px

Start display at page:

Download "R software and examples"

Henry Hutchinson
5 years ago
Views:

1 Handling Missing Data in R with MICE Handling Missing Data in R with MICE Why this course? Handling Missing Data in R with MICE Stef van Buuren, Methodology and Statistics, FSBS, Utrecht University Netherlands Organization for Applied Scientific Research TNO, Leiden Winnipeg, June, 7 Missing data are everywhere Ad-hoc fixes often do not work Multiple imputation is broadly applicable, yield correct statistical inferences, and there is good software Goal of the course: get comfortable with a modern and powerful way of solving missing data problems Handling Missing Data in R with MICE Course materials Handling Missing Data in R with MICE Reading materials Van Buuren, S. and Groothuis-Oudshoorn, C.G.M. (). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 5(), Van Buuren, S. (). Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton, FL. Chapters 6,. Handling Missing Data in R with MICE Flexible Imputation of Missing Data (FIMD) Handling Missing Data in R with MICE R software and examples R Install from RStudio: Install from R package mice. or higher: from CRAN or from More examples: Handling Missing Data in R with MICE > Time table Time table (morning) Handling Missing Data in R with MICE > Time table Time table (afternoon) Time Session L/P Description L Overview I L Introduction to missing data. -. I P Ad hoc methods + MICE. -.5 PAUSE.5 -. II L Multiple imputation. -. II P Boys data. -.5 PAUSE Time Session L/P Description.5 -. III L Generating plausible imputations. -. III P Algorithmic convergence and pooling. -.5 PAUSE IV L Imputation in practice IV P Post-processing and passive imputation V L Guidelines for reporting

2 Handling Missing Data in R with MICE > I > Handling Missing Data in R with MICE > I > Problem of missing data Why are missing data interesting? SESSION I Obviously the best way to treat missing data is not to have them. (Orchard and Woodbury 97) Sooner or later (usually sooner), anyone who does statistical analysis runs into problems with missing data (Allison, ) Missing data problems are the heart of statistics Handling Missing Data in R with MICE > I > Problem of missing data Causes of missing data Handling Missing Data in R with MICE > I > Problem of missing data Consequences of missing data Respondent skipped the item Data transmission/coding error Drop out in longitudinal research Refusal to cooperate Sample from population Question not asked, di erent forms Censoring Less information than planned Enough statistical power? Di erent analyses, di erent n s Cannot calculate even the mean Systematic biases in the analysis Appropriate confidence interval, P-values? In general, missing data can severely complicate interpretation and analysis. Listwise deletion Listwise deletion Analyze only the complete records Also known as Complete Case Analysis (CCA) Advantages Simple (default in most software) Unbiased under MCAR Correct standard errors, significance levels Two special properties in regression Disadvantages Wasteful Large standard errors Biased under MAR, even for simple statistics like the mean Inconsistencies in reporting Mean imputation Mean imputation Replace the missing values by the mean of the observed data Advantages Simple Unbiased for the mean, under MCAR Frequency 5 Ozone (ppb) Ozone (ppb) Solar Radiation (lang)

3 Mean imputation Regression imputation Disadvantages Disturbs the distribution Underestimates the variance Biases correlations to zero Biased under MAR AVOID (unless you know what you are doing) Also known as prediction Fit model for Y obs under listwise deletion Predict Y mis for records with missing Y s Replace missing values by prediction Advantages Unbiased estimates of regression coe cients (under MAR) Good approximation to the (unknown) true data if explained variance is high Prediction is the favorite among non-statisticians Regression imputation Regression imputation Frequency Ozone (ppb) 5 5 Disadvantages Artificially increases correlations Systematically underestimates the variance Too optimistic P-values and too short confidence intervals AVOID. Harmful to statistical inference Ozone (ppb) Solar Radiation (lang) Stochastic regression imputation Stochastic regression imputation Like regression imputation, but adds appropriate noise to the predictions to reflect uncertainty Advantages Preserves the distribution of Y obs Preserves the correlation between Y and X in the imputed data Frequency Ozone (ppb) Ozone (ppb) Solar Radiation (lang) Stochastic regression imputation Single imputation methods, wrapup Disadvantages Symmetric and constant error restrictive Single imputation does not take uncertainty imputed data into account, and incorrectly treats them as real Not so simple anymore Underestimate uncertainty caused by the missing data Unbiased only under restrictive assumptions

4 Alternatives Handling Missing Data in R with MICE > II > Maximum Likelihood, Direct Likelihood Weighting Multiple Imputation SESSION II Little, R.J.A. Rubin D.B. () Statistical Analysis with Missing Data. Second Edition. John Wiley Sons, New York. Handling Missing Data in R with MICE > II > What is multiple imputation Rising popularity of multiple imputation Handling Missing Data in R with MICE > II > What is multiple imputation Main steps used in multiple imputation Number of publications (log) 5 5 early publications 'multiple imputation' in abstract 'multiple imputation' in title - R - R Year Incomplete data Imputed data Analysis results Pooled results Handling Missing Data in R with MICE > II > What is multiple imputation Steps in mice Handling Missing Data in R with MICE > II > Goal Estimand incomplete data imputed data analysis results pooled results Q is a quantity of scientific interest in the population. mice() with() pool() Q can be a vector of population means, population regression weights, population variances, and so on. Q may not depend on the particular sample, thus Q cannot be a standard error, sample mean, p-value, and so on. data frame mids mira mipo Handling Missing Data in R with MICE > II > Goal Goal of multiple imputation Handling Missing Data in R with MICE > II > Multiple imputation theory Pooled estimate Q Estimate Q by ˆQ or Q accompanied by a valid estimate of its uncertainty. What is the di erence between ˆQ or Q? ˆQ and Q both estimate Q ˆQ accounts for the sampling uncertainty Q accounts for the sampling and missing data uncertainty ˆQ` is the estimate of the `-th repeated imputation ˆQ` contains k parameters and is represented as a k column vector The pooled estimate Q is simply the average Q = mx ˆQ` () m `=

5 Handling Missing Data in R with MICE > II > Multiple imputation theory Within-imputation variance Handling Missing Data in R with MICE > II > Multiple imputation theory Between-imputation variance Average of the complete-data variances as Ū = mx Ū`, () m `= where Ū` is the variance-covariance matrix of ˆQ` obtained for the `-th imputation Ū` is the variance is the estimate, not the variance in the data Variance between the m complete-data estimates is given by B = m mx ( ˆQ` `= Q)( ˆQ` Q), () where Q is the pooled estimate (c.f. equation ) The between-imputation variance is large there many missing data The within-imputation variance is large if the sample is small Handling Missing Data in R with MICE > II > Multiple imputation theory Total variance Handling Missing Data in R with MICE > II > Multiple imputation theory Three sources of variation The total variance is not simply T = Ū + B The correct formula is for the total variance of Q, and hence of (Q The term B/m is the simulation error T = Ū + B + B/m = Ū + + B () m Q) if Q is unbiased In summary, the total variance T stems from three sources: Ū, thevariancecausedbythefactthatwearetakingasample rather than the entire population. This is the conventional statistical measure of variability; B, theextravariancecausedbythefactthattherearemissing values in the sample; B/m, the extra simulation variance caused by the fact that Q itself is based on finite m. Handling Missing Data in R with MICE > II > Multiple imputation theory Variance ratio s () Handling Missing Data in R with MICE > II > Multiple imputation theory Variance ratio s () Proportion of the variation attributable to the missing data = B + B/m, (5) T Relative increase in variance due to nonresponse r = B + B/m Ū These are related by r = /( ). (6) Fraction of information about Q missing due to nonresponse r +/( + ) = +r This measure needs an estimate of the degrees of freedom. Relation between and = (8) The literature often confuses and. (7) Handling Missing Data in R with MICE > II > Statistical inference Statistical inference for Q () Handling Missing Data in R with MICE > II > Statistical inference Statistical inference for Q () The ( )% confidence interval of a Q is calculated as Q ± t (, /) p T, (9) where t (, /) is the quantile corresponding to probability / of t. For example, use t(,.975) =. for the 95% confidence interval for =. Suppose we test the null hypothesis Q = Q for some specified value Q.Wecanfindthep-valueof the test as the probability apple P s =Pr F, > (Q Q) () T where F, is an F distribution with and degrees of freedom.

Handling Missing Data in R with MICE > II > Statistical inference Handling Missing Data in R with MICE > II > Statistical inference Degrees of freedom () Degrees of freedom () With missing data, n is

6 Handling Missing Data in R with MICE > II > Statistical inference Handling Missing Data in R with MICE > II > Statistical inference Degrees of freedom () Degrees of freedom () With missing data, n is e ectively lower. Thus, the degrees of freedom in statistical tests need to be adjusted. The new formula is = The old formula assumes n = : old = (m ) + r m = old obs. old + obs () where the estimated observed-data degrees of freedom that accounts for the missing information is obs = () with com = n com + com ( com + ). () k. Handling Missing Data in R with MICE > II > How many imputations? Handling Missing Data in R with MICE > II > How many imputations? How large should m be? The legacy Classic advice: m =, 5,. More recently: set m higher:. Some advice Use m = 5 or m = if the fraction of missing information is low, <.. Develop your model with m = 5. Do final run with m equal to percentage of incomplete cases. Repeat the analysis with m = 5 with di erent seeds. If there are large di erences for some parameters, this means that the data contain little information about them. Handling Missing Data in R with MICE > II > How many imputations? Handling Missing Data in R with MICE > III > Introductions to multiple imputation Schafer, J.L. (999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8(), 5. Sterne et al (9). Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ, 8, b9. Van Buuren, S. (). Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton, FL. SESSION III 5 6 a We delete gas consumption of observation 7 7 Relation between temperature and gas consumption deleted observation

7 Predict imputed value from regression line Predicted value + noise b c Predicted value + noise + parameter uncertainty Imputation based on two predictors d e Predictive mean matching: Y given X Add two regression lines Predicted given 5 C, Define a matching range ŷ ±

8 Select potential donors Bayesian PMM: Draw a line Define a matching range ŷ ± Select potential donors Imputation of a binary variable Fit logistic model logistic regression Pr(y i = X i, )= exp(x i ) +exp(x i ). () Probability Linear predictor Draw parameter estimate Read o the probability Probability Probability Linear predictor Linear predictor

9 Impute ordered categorical variable Fit ordered logit model K ordered categories k =,...,K ordered logit model, or proportional odds model Pr(y i = k X i, )= exp( k + X i ) P K k= exp( k + X i ) (5) Probability Linear predictor Read o the probability Other types of variables Probability Count data Semi-continuous data Censored data Truncated data Rounded data Linear predictor Univariate imputation in mice Method Description Scale type pmm Predictive mean matching numeric norm Bayesian linear regression numeric norm.nob Linear regression, non-bayesian numeric norm.boot Linear regression with bootstrap numeric mean Unconditional mean imputation numeric L.norm Two-level linear model numeric logreg Logistic regression factor, levels logreg.boot Logistic regression with bootstrap factor, levels polyreg Multinomial logit model factor, > levels polr Ordered logit model ordered, > levels lda Linear discriminant analysis factor sample Simple random sample any Problems in multivariate imputation Predictors themselves can be incomplete Mixed measurement levels Order of imputation can be meaningful Too many predictor variables Relations could be nonlinear Higher order interactions Impossible combinations Three general strategies Imputation of monotone pattern X Y Y Y Y Monotone data imputation Joint modeling Fully conditional specification (FCS)

10 Imputation of monotone pattern Imputation of monotone pattern X Y Y Y Y X Y Y Y Y Joint Modeling (JM) Joint modeling: Software Specify joint model P(Y, X, R) Derive P(Y mis Y obs, X, R) Use MCMC techniques to draw imputations Y mis R/S Plus SAS STATA Stand-alone norm, cat, mix, pan, Amelia proc MI, proc MIANALYZE MI command Amelia, solas, norm, pan Joint Modeling: Pro s Joint Modeling: Con s Yield correct statistical inference under the assumed JM E cient parametrization (if the model fits) Known theoretical properties Works very well for parameters close to the center Many applications Lack of flexibility May lead to large models Can assume more than the complete data problem Can impute impossible data Fully Conditional Specification (FCS) Multivariate Imputation by Chained Equations (MICE) MICE algorithm Specify P(Y mis Y obs, X, R) Use MCMC techniques to draw imputations Y mis Specify imputation model for each incomplete column Fill in starting imputations And iterate Model: Fully Conditional Specification (FCS)

11 Fully Conditional Specification: Con s Fully Conditional Specification: Pro s Theoretical properties only known in special cases Cannot use computational shortcuts, like sweep-operator Joint distribution may not exist (incompatibility) Easy and flexible Imputes close to the data, prevents impossible data Subset selection of predictors Modular, can preserve valuable work Works well, both in simulations and practice Fully Conditional Specification (FCS): Software How many iterations? R mice, transcan, mi, VIM, baboon SPSS V7 procedure multiple imputation SAS IVEware, SAS 9. STATA ice command, multiple imputation command Stand-alone Solas, Mplus Quick convergence 5 iterations is adequate for most problems More iterations is is high inspect the generated imputations Monitor convergence to detect anomalies Non-convergence Convergence mean mean mean hgt wgt sd sd sd hgt wgt mean mean mean hgt wgt sd sd sd hgt wgt Iteration Iteration Handling Missing Data in R with MICE > IV > Handling Missing Data in R with MICE > IV > Modeling choices Imputation model choices SESSION IV MAR or MNAR Form of the imputation model Which predictors Derived variables 5 What is m? 6 Order of imputation 7 Diagnostics, convergence

12 Handling Missing Data in R with MICE > IV > Which predictors Which predictors? Derived variables Include all variables that appear in the complete-data model In addition, include the variables that are related to the nonresponse In addition, include variables that explain a considerable amount of variance Remove from the variables selected in steps and those variables that have too many missing values within the subgroup of incomplete cases. Function quickpred() and flux() ratio of two variables sum score index variable quadratic relations interaction term conditional imputation compositions How to impute a ratio? Method POST weight/height ratio: whr=wgt/hgt kg/m. Easy if only one of wgt or hgt or whr is missing Methods POST: Impute wgt and hgt, and calculate whr after imputation JAV: Impute whr as just another variable PASSIVE: Impute wgt and hgt, and calculate whr during imputation PASSIVE: As PASSIVE with adapted predictor matrix > imp <- mice(boys) > long <- complete(imp, "long", inc = TRUE) > long$whr <- with(long, wgt/(hgt/)) > imp <- longmids(long) Method JAV: Just another variable Method JAV 6 JAV 5 5 passive passive > boys$whr <- boys$wgt/(boys$hgt/) > imp.jav <- mice(boys, m =, seed = 9, maxit = ) Weight/Height (kg/m) Height (cm) 5 5 Method PASSIVE Method PASSIVE, predictor matrix > meth["whr"] <- "~I(wgt/(hgt/))" age hgt wgt hc gen phb tv reg whr age hgt wgt hc gen phb tv reg whr

13 Method PASSIVE Method PASSIVE 5 JAV 5 passive passive Weight/Height (kg/m) 6 5 > pred[c("wgt", "hgt", "hc", "reg"), ""] <- > pred[c("gen", "phb", "tv"), c("hgt", "wgt", "hc")] <- > pred[, "whr"] < Height (cm) Method PASSIVE, predictor matrix Method PASSIVE 5 age hgt wgt hc gen phb tv reg whr 5 passive passive 6 Weight/Height (kg/m) age hgt wgt hc gen phb tv reg whr JAV Height (cm) Handling Missing Data in R with MICE > IV > Diagnostics Derived variables: summary Standard diagnostic plots in mice Since mice.5, plots for imputed data: Derived variables pose special challenges Plausible values respect data dependencies one-dimensional scatter: stripplot If you can, create derived variables after imputation box-and-whisker plot: bwplot If you cannot, use passive imputation densities: densityplot Break up direct feedback loops using the predictor matrix scattergram: xyplot Handling Missing Data in R with MICE > IV > Diagnostics Handling Missing Data in R with MICE > IV > Diagnostics Stripplot stripplot(imp, pch=c(,9)) age. > library(mice) > imp <- mice(nhanes, seed = 998) > stripplot(imp, pch = c(, 9)) chl hyp 5 Imputation number

14 Handling Missing Data in R with MICE > IV > Diagnostics Alargerdataset Handling Missing Data in R with MICE > IV > Diagnostics bwplot(imp) > imp <- mice(boys, seed =, maxit = ) > bwplot(imp) 5 5 age hgt 5 hc 6 8 wgt 5 tv Imputation number Handling Missing Data in R with MICE > IV > Diagnostics densityplot(imp) Handling Missing Data in R with MICE > V >..... hgt.... wgt SESSION V Density hc tv Handling Missing Data in R with MICE > V > Reporting guidelines Reporting guidelines Amount of missing data Reasons for missingness Di erences between complete and incomplete data Method used to account for missing data 5 Software 6 Number of imputed datasets 7 Imputation model 8 Derived variables 9 Diagnostics Pooling Listwise deletion Sensitivity analysis

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION