Missing Data: What Are You Missing?

Similar documents
Bootstrap and multiple imputation under missing data in AR(1) models

Handling missing data for indicators, Susanne Rässler 1

MISSING DATA AND MULTIPLE IMPUTATION

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Handling Data with Three Types of Missing Values:

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data

Missing Data Techniques

The Importance of Modeling the Sampling Design in Multiple. Imputation for Missing Data

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Missing Data Missing Data Methods in ML Multiple Imputation

The Performance of Multiple Imputation for Likert-type Items with Missing Data

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

Multiple-imputation analysis using Stata s mi command

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

Multiple imputation using chained equations: Issues and guidance for practice

R software and examples

NORM software review: handling missing values with multiple imputation methods 1

Missing Data Analysis with SPSS

Missing Data. Where did it go?

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Performance of Sequential Imputation Method in Multilevel Applications

Missing Data Analysis for the Employee Dataset

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY

Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis

Multiple Imputation with Mplus

Missing Data and Imputation

Missing Data in Orthopaedic Research

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

The Use of Sample Weights in Hot Deck Imputation

Package midastouch. February 7, 2016

Research with Large Databases

Survival Analysis with PHREG: Using MI and MIANALYZE to Accommodate Missing Data

HANDLING MISSING DATA

Group Level Imputation of Statistic Maps, Version 1.0. User Manual by Kenny Vaden, Ph.D. musc. edu)

Handbook of Statistical Modeling for the Social and Behavioral Sciences

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Approaches to Missing Data

Missing Data Analysis with the Mahalanobis Distance

3.6 Sample code: yrbs_data <- read.spss("yrbs07.sav",to.data.frame=true)

Missing data analysis. University College London, 2015

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Analysis of Incomplete Multivariate Data

Types of missingness and common strategies

Analysis of Imputation Methods for Missing Data. in AR(1) Longitudinal Dataset

PSY 9556B (Jan8) Design Issues and Missing Data Continued Examples of Simulations for Projects

Handling Missing Data

CHAPTER 1 INTRODUCTION

Tools for Imputing Missing Data

Statistical and Computational Challenges in Combining Information from Multiple data Sources. T. E. Raghunathan University of Michigan

BACKGROUND INFORMATION ON COMPLEX SAMPLE SURVEYS

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Amelia multiple imputation in R

Accounting for Complex Sample Designs in Multiple Imputation Using the Finite. Population Bayesian Bootstrap. Hanzhi Zhou

Improving Imputation Accuracy in Ordinal Data Using Classification

BMC Medical Research Methodology 2012, 12:184

Effects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex

Statistics, Data Analysis & Econometrics

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data

Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database

Using Imputation Techniques to Help Learn Accurate Classifiers

WELCOME! Lecture 3 Thommy Perlinger

A Fast Multivariate Nearest Neighbour Imputation Algorithm

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Faculty of Sciences. Holger Cevallos Valdiviezo

Statistical matching: conditional. independence assumption and auxiliary information

PcAux. Kyle M. Lang, Jacob Curtis, Daniel E Bontempo Institute for Measurement, Methodology, Analysis & Policy at Texas Tech University May 5, 2017

Missing Data Analysis for the Employee Dataset

Lecture 26: Missing data

11.0 APPENDIX-B: COMPUTATION

Module I: Clinical Trials a Practical Guide to Design, Analysis, and Reporting 1. Fundamentals of Trial Design

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

STATISTICS (STAT) Statistics (STAT) 1

Longitudinal Modeling With Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques

Statistical modelling with missing data using multiple imputation. Session 2: Multiple Imputation

Didacticiel - Études de cas

Package CALIBERrfimpute

Data analysis using Microsoft Excel

Development of weighted model fit indexes for structural equation models using multiple imputation

Hierarchical Mixture Models for Nested Data Structures

IBM SPSS Missing Values 21

Product Catalog. AcaStat. Software

WHAT DO WE DO WITH MISSING DATA?SOME OPTIONS FOR ANALYSIS OF INCOMPLETE DATA

Machine Learning: An Applied Econometric Approach Online Appendix

Analysis of Complex Survey Data with SAS

An Algorithm for Creating Models for Imputation Using the MICE Approach:

Generalized least squares (GLS) estimates of the level-2 coefficients,

Missing data analysis: - A study of complete case analysis, single imputation and multiple imputation. Filip Lindhfors and Farhana Morko

STATISTICS (STAT) 200 Level Courses. 300 Level Courses. Statistics (STAT) 1

Correctly Compute Complex Samples Statistics

A STOCHASTIC METHOD FOR ESTIMATING IMPUTATION ACCURACY

STATISTICS (STAT) 200 Level Courses Registration Restrictions: STAT 250: Required Prerequisites: not Schedule Type: Mason Core: STAT 346:

Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International

Transcription:

Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION Missing data are ubiquitous in clinical research. While many researchers consider missing data a nuisance, ignoring missing data is fraught with potential complications. Naïve methods for handling missing data can introduce substantial bias, reduce precision of estimates, and reduce study power, any of which may lead to invalid study conclusions. MECHANISMS OF MISSING DATA Missing Completely At Random (MCAR): when the probability a value is missing is independent of all other observed and unobserved characteristics of the sample. Missing At Random (MAR): when missing values can be completely explained by observed values in the sample (i.e., independent of unobserved values). Most valid missing data procedures require data to be missing by MCAR or MAR mechanisms. Missing Not At Random (MNAR): when missingness is related to missing values rather than observed values. It is not possible to distinguish between MAR and MNAR mechanisms. PATTERNS OF MISSING DATA Monotone Obs 3 4 6 7 3 4

Missing Data: What Are You Missing? Non-monotone (general) Obs 3 4 6 7 3 4 Two variables never jointly observed (variables 4 and ) Obs 3 4 6 7 3 4 METHODS FOR HANDLING MISSING DATA Complete Case Analysis: excludes all observations with missing values for any variable(s) of interest (e.g., for multivariable analyses), thus limiting the analysis to those observations for which all values are observed, often resulting in biased estimates and loss of precision. Available Case Analysis: complete case analysis for univariate comparisons. Weighted Complete Case Analysis: weighting of complete cases to adjust for nonresponse and to correct for potential bias (at expense of efficiency), generally used in surveys where cases of nonresponse have no available data. Simple (Single) Imputation: imputing one plausible value for each missing variable within a dataset, then conducting the analyses as if all data were originally observed. Simple imputation generally results in inappropriately small variances and may produce biased results. There are several types of simple imputation, including: -Mean and median imputation: replacing missing values with the mean or median values from the observed data - can lead to biased parameter estimates because missing values are replaced with values at the center of the distribution for a given variable.

Missing Data: What Are You Missing? 3 -Regression imputation: involves replacing each missing value with a single predicted value from a regression analysis in which fully observed cases are used to determine the predicted values - underestimates the variance because no residual error is assumed around the regression line. -Stochastic regression imputation: replaces missing values with the predicted value from a regression analysis plus its residual error - incorporates uncertainty into the predicted value, thus improving upon the primary limitation of simple regression imputation. -Hot deck imputation: a non-parametric method of matching cases with missing values to observations with observed values for the same variable. Works well in certain settings, though still has a potential for bias, need for variance correction, and need for having an adequate number of complete cases for matching. -Cold deck imputation: replacing missing values with a value(s) from an external data source using methodology similar to the hot deck technique. Data from the external source may differ systematically from the primary dataset, adding additional bias to estimates. -Last observation carry forward: used during longitudinal repeated measures research - involves imputing missing follow-up measures with the last recorded value for a respondent. Although simple, this method assumes (often mistakenly) that the value remains unchanged after the patient drops out. -Worst-case analysis: imputes the worst-case value for observations with missing values, most commonly for missing outcome data. Multiple Imputation (MI): MI is an extension of simple imputation, where each missing value is replaced by a set of m > simulated values (generally -0) that exist in m complete datasets. Each of the m datasets is then analyzed by standard analytic methods, and the results from each dataset are combined using a standard set of rules that appropriately account for the uncertainty (i.e., variance) in the MI process.

Missing Data: What Are You Missing? 4 Assumptions: -MCAR or MAR mechanism for missingness -Multivariate normal distribution for variables -Adequate sample size The Data Model: The first step in MI is to generate a probability model that relates the complete data set Y (consisting of missing values Y mis and observed values Y obs ) to a set of parameters. Using Bayesian methods, observed values are used to generate a predictive distribution for missing values, after which multiple random draws are made (equivalent to the m number of complete data sets generated), thus producing the multiple imputations for each subject originally missing data. Selection of variables for the data model: -target variables: variables to be included in subsequent analyses, including the outcome measure -auxiliary variables: variables highly predictive of target variables or important in explaining the missingness of variables -sample design variables: variables important in sampling design (e.g., clusters and strata) Determining the number of imputed values/datasets to generate: selection of the number of imputations (m) is based on the desired relative efficiency of MI estimates and is approximated using the formula: = ( + /m) - where is the rate of missing data. In almost all instances, m = 0 provides a high rate of efficiency. Proportion missing data ( ) # imputations (m) 0% 30% 0% 70% 90% 3.97.9.86.8.77.98.94.9.88.8 0.99.97.9.93.9 0.00.99.98.97.96 30.00.99.98.98.97 Analyzing Multiply Imputed Data: Once the m imputed datasets have been created, the data in each dataset can be analyzed using standard statistical procedures. Parameter estimates (Q) are averaged over the m number of datasets: aveq m = m Qi The within-imputation variance (U m ) is represented by the average of the m complete data variances: The between-imputation variance (B m ) is: aveu m = m U i B m = (m ) (Q i aveq m )

Missing Data: What Are You Missing? The total variance (T m ) is generated by combining both the within-imputation and between-imputation variances, T m = aveu m + ( + m )B m An interval estimate (e.g., 9% confidence interval) is generated by: aveq m t v ( /)T m / where t v ( /) represents the upper 00( /) percentage point of the student t distribution with v degrees of freedom: v = (m ) [ + aveu m /( + m )B m ] Special Situations with Multiple Imputation: -Imputer versus Analyst -Coding missing -Probability samples:

Missing Data: What Are You Missing? 6 -Interaction terms (parallel chains of MI): -Split sample MI MI versus Maximum likelihood (ML): -ML methods are model specific -ML is computationally intensive -ML accounts for missing values during the analysis (one step) -ML is slightly more efficient than MI, though the difference is generally not noticeable -software using ML may be less able to integrate auxiliary variables, endangering the MAR assumption HOW MUCH IS TOO MUCH MISSING DATA? -difficult to generate guidelines for what constitutes a small portion of missing data -small amounts (e.g., 3%) of missing data may still generate substantial bias and reduce precision -there is no proportion of missing data under which valid results and preservation of study power can be guaranteed EXAMPLES: National Automotive Sampling System Crashworthiness Data System (NASS CDS) a probability sample of all drivers 6 years involved in MVCs with passenger vehicles or light trucks during the years 99-003 (n = 38,67). For the majority of examples, the same 0-variable logistic regression model (outcome = Injury Severity Score 6) was used. The figures demonstrate how various missing data factors affect one categorical (RESTRAINT USE) and one continuous (DELTA V) variable included in such models.

Odds Ratio Odds Ratio % sample available for analysis Missing Data: What Are You Missing? 7 Example. Effect of increasing the number of variables in a multivariable logistic regression model on results for RESTRAINT USE (MAR pattern, baseline rate of restraint missing = 30%). 00% 90% 80% 70% 60% 0. 0% 40% 30% 0% 0% 0.0 3 7 9 3 Number of variables in model 0% Truth Complete Case Multiple Imputation Example. The effect of different underlying mechanisms for missing values on study results for RESTRAINT USE in a multivariable logistic regression model (baseline rate of restraint missing = 30%). 0. MCAR MAR MNAR 0.0 Truth Complete Case Multiple Imputation

Mean DeltaV Odds Ratio Patients Missing Data: What Are You Missing? 8 Example 3. The effect of increasing the proportion of missing data on study results for RESTRAINT USE in a multivariable logistic regression model (MAR pattern). 3000 30000 000 0. 0000 000 0000 000 0.0 0% 0% 0% 30% 40% 0% 60% 70% 80% 90% % missing data 0 Truth Multiple Imputation Complete Case Subjects available for CC analysis Example 4. The effect of increasing the proportion of missing data on study results for mean DELTA V (MAR pattern, univariate analysis). 40 38 36 34 3 30 8 6 4 0 0% 0% 0% 30% 40% 0% 60% 70% 80% 90% % missing data Truth Complete Case Simple Imputation Multiple Imputation

Odds Ratio Missing Data: What Are You Missing? 9 Example. The effect of increasing the proportion of missing data for restraint use on results for a separate covariate (LATERAL IMPACT) with a fixed proportion of missing data (4%) in a multivariable logistic regression model (MAR pattern). 0 0. 0% 0% 0% 30% 40% 0% 60% 70% 80% 90% % missing data Truth Complete Case Multiple Imputation USEFUL IMPUTATION INFO: The imputation listserv: http://lists.utsouthwestern.edu/mailman/listinfo/impute Joseph Schafer s list of multiple imputation software (including free NORM download and S-Plus macros): http://www.stat.psu.edu/~jls/misoftwa.html Sample of software that offer multiple imputation procedures and methods to analyze/combine results: -SAS v8 and v9 (PROC MI and PROC MIANALYZE) -IVEware (free SAS-callable software: http://www.isr.umich.edu/src/smp/ive/) -Multiple Imputation for Chained Equations (MICE) for S-Plus and R -macros for S-Plus (Joseph Schafer) -NORM (free stand-alone windows package) -SPSS -LISREL -SOLAS Additional software that do not perform MI, but can analyze multiply imputed data: -SUDAAN v9 -Stata

Missing Data: What Are You Missing? 0 CONCLUSIONS Missing data is a major problem in clinical research. Ignoring missing values can increase bias, reduce study power, and reduce precision of estimates, sometimes drastically. Inappropriate methods for handling missing values (e.g., complete case analysis and simple imputation) can potentially invalidate study results. Provided certain assumptions are met, multiple imputation is one method for generating valid estimates and preserving study power, while appropriately accounting for the uncertainty in the imputation process.

Missing Data: What Are You Missing? SELECT REFERENCES: Texts:. Little RJA, Rubin DB. Statistical analysis with missing data. New York: Wiley, 989.. Allison PD. (00) Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-36. Thousand Oaks, CA: Sage. 3. Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley, 987. 4. Schafer JL. Analysis of incomplete multivariate data. Boca Raton, Florida: Chapman & Hall/CRC, 997. SAS Multiple imputation coding:. Yuan YC. Multiple imputation for missing data: concepts and new development. P67-. SAS Institute Inc. IVEware background and multiple imputation coding: 6. Raghunathan TE, Lepkowski, Van Hoewyk J, Solenberger PW. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 00;7:8-9. 7. Raghunathan TE, Solenberger PW, Van Hoewyk JV. IVEware: Imputation and variance estimation software user guide. Survey Methodology Program, Institute for Social Research, University of Michigan, March 00. Additional missing data references: 8. Sinharay S, Stern HS, Russell D. The use of multiple imputation for the analysis of missing data. Psychological Methods. 00;6:37-39. 9. Schafer JL. Multiple imputation: a primer. Statistical Methods in Medical Research. 999;8:3-. 0. Collins LM, Schafer JL, Kam CM. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods. 00;6:330-3.. Reiter JP, Raghunathan TE. Multiple imputation for missing data in surveys with complex sample designs. Technical Report, Institute for Social Research, University of Michigan, Ann Arbor, MI. 00.. Rubin DB, Schenker N. Multiple imputation in health-care databases: an overview and some applications. Stat Med 99;0:8-98. 3. Horton NJ, Lipsitz SR. Multiple imputation in practice: comparison of software packages for regression models with missing variables. The American Statistician. 00;:44-4.