Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Size: px

Start display at page:

Download "Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background"

Lenard Johnston
5 years ago
Views:

1 An Introduction to Multiple Imputation and its Application Craig K. Enders University of California - Los Angeles Department of Psychology cenders@psych.ucla.edu Background Work supported by Institute of Educational Sciences award R35D Missing Data Theory Rubin s (1976) missing data mechanisms describe different relations between the probability of nonresponse and the data Each incomplete variable consists of observed scores and hypothetical values for the non-responders Mechanisms describe different relations between nonresponse and the observed / hypothetical scores 3 Motivating Example participants from a smoking cessation study Participants report the number of years smoking and number of cigarettes smoked 25% of respondents do not report the number of cigarettes smoked Years Cigarettes 7 9 NA NA 15 NA NA NA

2 Missing Completely At Random (MCAR) Missing At Random (MAR) The probability of missing data on Y is unrelated to other variables and to the hypothetical values of Y No systematic determinants of missingness e.g., A data collection app failed to capture or transmit data, participants forget to respond for idiosyncratic reasons The probability of missing data on Y is related to observed scores but not to the hypothetical values of Y Differences between the complete cases and nonresponders vanish after controlling for observed variables e.g., Long-time smokers have a greater tendency for nonresponse, but the number of cigarettes smoked carries no additional information 5 6 Not Missing At Random (NMAR) Why Mechanisms Matter The probability of missing data on Y is related to the hypothetical values of Y itself The complete cases and non-responders systematically differ even after controlling for observed variables e.g., Participants who smoke more frequently are less likely to respond, even after adjusting for years smoking Mechanisms function as analysis assumptions, estimates are biased when assumptions are violated Older approaches such as deletion require MCAR, other methods make no attempt to satisfy any mechanism Multiple imputation, maximum likelihood, and Bayesian estimation assume MAR 7

3 Missing at Random Example Years Cigarettes Scatterplot of Observed and Hypothetical Scores 6 1 Number of years smoking and number of cigarettes smoked 25% of respondents do not report the number of cigarettes smoked The likelihood of nonresponse increases with years smoking NA NA NA NA Incomplete cases 15 NA 9 1 Impact of Deletion on Impact of Deletion on Full data Complete cases Incomplete cases (observed data) 11 Full data (hypothetical) 22 Incomplete cases (missing values) Complete cases

4 Multiple Imputation and Maximum Likelihood Why Imputation? Multiple imputation and maximum likelihood are MARbased methods that are widely available in software Multiple imputation fills in the data prior to analysis, maximum likelihood estimates parameters directly from the observed data Methods are equivalent with normally distributed data Imputation is often better because it allows researchers to tailor the missing data handling procedure to honor the features of the data and/or a particular analysis e.g., Mixtures of continuous and categorical variables, scale scores computed from questionnaire items, multilevel data 13 1 Multiple Imputation Overview Multiple Imputation Steps: Imputation Multiple imputation generates several complete data sets (e.g., M = or more), each with different imputations Unique regression coefficients generate each data set Analyzing multiple complete data sets provides a mechanism to adjust standard errors for missing data The imputation phase creates multiple copies of the data, each with different replacement values X Y Z 3 3 NA NA NA NA NA 6 X Y Z X Y Z X Y Z

5 Multiple Imputation Steps: Analysis Multiple Imputation Steps: Pooling In the analysis phase the researcher analyzes and obtains estimates from each complete data set X Y Z X Y Z X Y Z The pooling phase combines estimates and standard errors into a single set of results ˆθ = θ 1 +θ 2 +θ 3 3 X X X θ1 θ2 θ3 Y Y Y X θ1 Y X θ2 Y X θ3 Y 17 1 Imputation Phase Multiple Imputation: Imputation Phase The imputation phase uses a regression model to define a distribution of plausible replacement scores Imputations are randomly sampled from this distribution An iterative Bayesian estimation algorithm provides unique parameter estimates for each round of imputation 19

6 Imputation Example: Years = Imputation = Predicted Score + Noise Yˆ Cigs = β + β 1 Years = () = 9.77 Cigs mis ~ N Y ˆ 2 ( Cigs, σ ε ) ~ N 9.77,.17 Cigs mis = ˆ Y Cigs + ε Imputation Example: Years = 1 Imputation = Predicted Score + Noise Yˆ Cigs = β + β 1 Years = (1) = 1.69 Cigs mis ~ N Y ˆ 2 ( Cigs, σ ε ) ~ N 1.69,.17 Cigs mis = ˆ Y Cigs + ε 23 2

7 Imputation Example: Years = Imputation = Predicted Score + Noise Yˆ Cigs = β + β 1 Years = () = Cigs mis = ˆ Y Cigs + ε Cigs mis ~ N Y ˆ 2 ( Cigs, σ ε ) ~ N 11.59, Imputation Scatterplot Updating Parameter Values The next round of imputation requires new regression parameters A Bayesian estimation algorithm samples new estimates from a distribution of plausible values Akin to estimating the regression from the filled-in data and randomly perturbing the estimates 27 2

Alternate Regression Lines Imputation Example: Years = Alternate (perturbed) regressions Complete-data regression 29 Yˆ Cigs = β + β 1 Years = 7.32 +.3() = 1.

8 Alternate Regression Lines Imputation Example: Years = Alternate (perturbed) regressions Complete-data regression 29 Yˆ Cigs = β + β 1 Years = () = 1.76 Cigs mis ~ N Updated regression line Y ˆ 2 ( Cigs, σ ε ) ~ N 1.76, Bayesian Estimation Steps for a Single Iteration Burn-in Interval 1. Update residual variance 2. Update coefficients 2 σ β ε β 1 Iteration 1 β (t ) 2(t P( β σ ) ε, Cigs (t 1) imp, Years) 2(t σ ) ε ~ P σ 2 ε β (t 1), Cigs (t 1) imp, Years 2 3 Iterate... Burn-in interval (t ) Cigs mis 3. missing values (t ~ N β ) (t ) 2(t ) + β 1 ( Years), σ ε Save Data Set

9 Thinning (Between-Imputation) Interval Thinning Interval, Continued Iteration Iteration Thinning interval 2 Thinning interval 3 3 Iterate... Iterate... Save Data Set 2 6 Save Data Set Multivariate Missing Data FCS Imputation Scheme Fully conditional specification (chained equations or sequential regression imputation) imputes incomplete variables one at a time in a sequence FCS imputation uses a series of univariate regression models to impute incomplete variables in a sequence Joint model imputation uses multivariate regression to impute the incomplete variables in single step Update Y1 Parameters Update Y2 Parameters Update Y3 Parameters Variable-by-variable and multivariate imputation are equivalent with normally distributed data Y1 Y2, Y3 Y2 Y1, Y3 Y3 Y1, Y2 Save a data set 35 36

10 Years Cigs Efficacy Smoking Data 7 9 NA NA NA 1 11 Algorithmic Steps for a Single Iteration of FCS Smoking cessation study where number of cigarettes smoked and efficacy to quit are incomplete Pattern Years Cigs Efficacy 1 O O O 2 O M O 3 O O M O M M NA 1 15 NA NA NA NA β β 1 β 2 σ ε 2 1. (t ) Cigs mis (t 1) ~ N ˆ Y Cigs, σ ε 2 Yˆ Cigs = β + β 1 Years + β 2 SE imp 2: Self-Efficacy to Quit (t ) SE mis ~ N ˆ Y SE, σ e 2 Yˆ SE = γ + γ 1 Years σ e 2 γ γ 1 γ 2 (t ) + γ 2 Cigs imp Years Cigs Efficacy Years Cigs Efficacy Years Cigs Efficacy Multiple Imputation: Analysis and Pooling Phases

11 Analysis and Pooling Pooling Estimates In the analysis phase the researcher analyzes and obtains estimates from each complete data set The pooling phase combines the estimates and standard errors into a single set of results Significance tests are performed on the pooled values The multiple imputation point estimate is the arithmetic average of the M complete-data estimates Pooled estimate ˆθ = M m=1 M ˆθ m Estimate from imputed data set m Number of data sets 1 2 Example: Descriptives Example: Correlations Data Set 1 Data Set 2 Data Set 3 Data Set 1 Data Set 2 Data Set 3 M SD N M SD N Years Cigs SE Years Cigs SE Years Cigs SE Years Cigs SE Years Years Years Years 1. Years 1. Years 1. Cigs Cigs Cigs Cigs.5 1. Cigs.3 1. Cigs.5 1. SE SE SE SE SE SE Pooled Estimates Pooled Estimates M ˆθ m m=1 ˆθ = M = = 1.3 M SD N Years Cigs SE M ˆθ m m=1 ˆθ = M = =.37 Years Cigs SE Years 1. Cigs.9 1. SE

12 Pooling Standard Errors Standard Error Decomposition Averaging standard errors underestimates sampling variability because the component standard errors are computed from complete data sets Imputation standard errors consist of two components The imputation standard error combines complete-data sampling error and missing data uncertainty Average squared standard error Variance of estimates across imputed data sets Correction for using finite imputations Within-imputation variance estimates complete-data sampling error, and between-imputation variance captures additional noise from the missing data SE = V T = V W + V B + V B M V W + V B + V B M = V T 5 6 Significance Test A test statistic is based on the pooled estimate and standard error Pooled estimate Hypothesized value Single-Level Imputation with the Blimp Graphical Interface t ( or z) = ˆθ θ SE Pooled standard error 7

Blimp Software and Data Motivating Example The Blimp application for Mac OS and Windows was developed with support from Institute of Educational Sciences award R35D1556 Blimp can accommodate mixtures

html A math problem solving intervention randomly assigns students to an intervention or a control curriculum Probsolv = β + β 1 ( Efficacy) + β 2 Disab2 + β 3 ( Disab3) + β 5 ( Txcode) + ε +β

identifier variable Nominal Choose Import Data from the File menu, then select the location of the input text file txcode Treatment code ( = control, 1 = intervention) Nominal pctminor Percentage of

13 Blimp Software and Data Motivating Example The Blimp application for Mac OS and Windows was developed with support from Institute of Educational Sciences award R35D1556 Blimp can accommodate mixtures of categorical (nominal or ordinal) and continuous variables in data sets with up to three levels Software, raw data, and analysis scripts available at appliedmissingdata.com/multilevel-imputation.html A math problem solving intervention randomly assigns students to an intervention or a control curriculum Probsolv = β + β 1 ( Efficacy) + β 2 Disab2 + β 3 ( Disab3) + β 5 ( Txcode) + ε +β Teachexp The analysis is a regression model that predicts problemsolving scores from the intervention code and covariates 9 5 Input Data Import Data Variable Description Metric school School identifier variable Nominal Choose Import Data from the File menu, then select the location of the input text file txcode Treatment code ( = control, 1 = intervention) Nominal pctminor Percentage of minority students Numeric teachexp Teacher experience Numeric stanmath Standardized math scores Numeric probsolv End-of-year problem-solving scores Numeric efficacy Math self-efficacy (6-point rating scale) Ordinal disab Disability classification (three groups) Nominal 51 52

Data View From the Data View tab, specify the delimiter (space or comma), enter the missing value

variables, then click Done 53 5 Specifying an Imputation Model From the pull-down, select Specify

An interface will appear that allows you to specify the variables to be included in the imputation

14 Data View From the Data View tab, specify the delimiter (space or comma), enter the missing value code, and click Import Variable View From the Variable View tab, assign names and scales to the variables, then click Done 53 5 Specifying an Imputation Model From the pull-down, select Specify Model. An interface will appear that allows you to specify the variables to be included in the imputation model as well as various algorithmic options. Model Tab From the Model tab, click Single-Level Imputation and use the right (left) arrow to select (remove) variables from the imputation model

15 MCMC Tab From the MCMC tab, specify the algorithmic options. The radio buttons at the bottom of the page can be left at their default values. Output Tab From the Output tab, specify a name and format for the imputed data sets. Click the PSR ratio button for convergence diagnostics. Preliminary iterations Iterations separating each data set Number of imputed data sets Seed for random number generator Imputations in comma or space delimited files Default estimation settings (no need to change in most cases) Imputations in a stacked file (R, SPSS, SAS) or separate files (Mplus) Potential scale reduction (PSR) factor diagnostic tables 57 5 Blimp Command Script Clicking the Done button on the Output tab generates a Blimp command script that reflects the options selected from the graphical interface. Running Blimp From the pull-down, select Run. A dialog box will prompt you to save the Blimp command script. 59 6

16 Blimp Output Window Mplus Analysis of Blimp Data Blimp will begin running immediately after saving the file. An output window will appear that displays computational progress, the variable order for the imputed data set(s), and diagnostic tables (if selected). data: file = imputationslist.csv; type = imputation; variable: names = school txcode pctminor teachexp stanmath probsolv efficacy disab; usevariables = probsolv efficacy teachexp txcode disab2 disab3; define: if (disab eq 1) then disab2 = ; if (disab eq 1) then disab3 = ; if (disab eq 2) then disab2 = 1; if (disab eq 2) then disab3 = ; if (disab eq 3) then disab2 = ; if (disab eq 3) then disab3 = 1; center efficacy disab2 disab3 teachexp (grandmean); model: probsolv on efficacy disab2 disab3 teachexp txcode; output: stdyx; R Analysis of Blimp Data SAS Analysis of Blimp Data # load libraries library(mitml) library(nlme) # read stacked blimp file path <- c("~/desktop/example/imputations.csv") impdata <- read.csv(file = path, head = FALSE, sep = ",") names(impdata) = c("imp", "school", "txcode", "pctminor", "teachexp", "stanmath", "probsolv", "efficacy", "disab") impdata$disab2[impdata$disab == 1] <- impdata$disab3[impdata$disab == 1] <- impdata$disab2[impdata$disab == 2] <- 1 impdata$disab3[impdata$disab == 2] <- impdata$disab2[impdata$disab == 3] <- impdata$disab3[impdata$disab == 3] <- 1 # split stacked data into separate files implist <- split(impdata, impdata$imp) implist <- as.mitml.list(implist) # regression with lm model <- with(implist, lm(probsolv ~ efficacy + disab2 + disab3 + teachexp + txcode)) n <- 1 numpredictors <- 5 dfdenom <- n - numpredictors - 1 testestimates(model, df.com = dfdenom) 63 * read data and compute dummy codes. data imputations; infile '/folders/myfolders/imputations.csv' delimiter = ','; input _imputation_ school txcode pctminor teachexp stanmath probsolv efficacy disab; disab2 = ; disab3 = ; if disab = 2 then disab2 = 1; if disab = 3 then disab3 = 1; run; * estimate regression model; proc reg data = imputations outest = estimates covout; model probsolv = efficacy disab2 disab3 teachexp txcode; by _imputation_; run; * pool estimates; proc mianalyze data = estimates edf = 99; modeleffects Intercept efficacy disab2 disab3 teachexp txcode; run; 6

17 SPSS Analysis of Blimp Data * read data and compute dummy codes. data list free file = '/users/craig/desktop/example/imputations.csv' /imputation_ school txcode pctminor teachexp stanmath probsolv efficacy disab. compute disab2 =. compute disab3 =. if (disab = 2) disab2 = 1. if (disab = 3) disab3 = 1. exe. Two-Level Imputation with the Blimp Graphical Interface * split file into separate data sets. sort cases by imputation_. split file layered by imputation_. * analysis and pooling. regression /descriptives mean stddev corr sig n /dependent probsolv /method=enter efficacy disab2 disab3 teachexp txcode Motivating Example Input Data A math problem solving intervention randomly assigns schools to an intervention or a control curriculum Probsolv ij = β + β 1 ( Efficacy j ) + β 2 Disab2 ij + β 3 ( Disab3 ij ) + β 5 ( Txgrp j ) + u j + u 1 j ( Efficacy j ) + ε ij +β Teachexp j The analysis is a random slope regression model that predicts problem-solving scores from the intervention code and covariates Variable Description Metric school School identifier variable Nominal txcode Treatment code ( = control, 1 = intervention) Nominal pctminor Percentage of minority students Numeric teachexp Teacher experience Numeric stanmath Standardized math scores Numeric probsolv End-of-year problem-solving scores Numeric efficacy Math self-efficacy (6-point rating scale) Ordinal disab Disability classification (three groups) Nominal 67 6

18 Import Data Choose Import Data from the File menu, then select the location of the input text file Data View From the Data View tab, specify the delimiter (space or comma), enter the missing value code, and click Import 69 7 Variable View Specifying an Imputation Model From the Variable View tab, assign names and scales to the variables, then click Done From the pull-down, select Specify Model. An interface will appear that allows you to specify the variables to be included in the imputation model as well as various algorithmic options

73 7 Specifying a Random Slope Select the Random Slopes from the Build Terms dropdown, and select the pair of variables that have a random association.

19 Model Tab From the Model tab, click Single-Level Imputation and use the right (left) arrow to select (remove) variables from the imputation model. Model Tab From the Model tab, move the level-2 identifier variable to the Cluster-Level Identifier Variables box, and use the right (left) arrow to select (remove) variables from the imputation model Specifying a Random Slope Select the Random Slopes from the Build Terms dropdown, and select the pair of variables that have a random association. MCMC Tab From the MCMC tab, specify the algorithmic options. The radio buttons at the bottom of the page can be left at their default values. Preliminary iterations Iterations separating each data set Number of imputed data sets Seed for random number generator Default estimation settings (no need to change in most cases) 75 76

20 Output Tab From the Output tab, specify a name and format for the imputed data sets. Click the PSR ratio button for convergence diagnostics. Blimp Command Script Clicking the Done button on the Output tab generates a Blimp command script that reflects the options selected from the graphical interface. Imputations in comma or space delimited files Imputations in a stacked file (R, SPSS, SAS) or separate files (Mplus) Potential scale reduction (PSR) factor diagnostic tables 77 7 Running Blimp Blimp Output Window From the pull-down, select Run. A dialog box will prompt you to save the Blimp command script. Blimp will begin running immediately after saving the file. An output window will appear that displays computational progress, the variable order for the imputed data set(s), and diagnostic tables (if selected). 79

21 Mplus Analysis of Blimp Data Mplus Analysis of Blimp Data, Continued data: file = imputationslist.csv; type = imputation; variable: names = school txgrp pctminor teachexp stanmath probsolv efficacy disab; usevariables = probsolv efficacy teachexp txgrp disab2 disab3; cluster = school; within = efficacy disab2 disab3; between = teachexp txgrp; define: if (disab eq 1) then disab2 = ; if (disab eq 1) then disab3 = ; if (disab eq 2) then disab2 = 1; if (disab eq 2) then disab3 = ; if (disab eq 3) then disab2 = ; if (disab eq 3) then disab3 = 1; center efficacy teachexp (grandmean); analysis: type = twolevel random; model: %within% effslope probsolv on efficacy; probsolv on disab2 disab3; %between% probsolv on teachexp txgrp; probsolv; effslope; probsolv with effslope; 1 2 R Analysis of Blimp Data R Analysis of Blimp Data, Continued # load libraries library(mitml) library(nlme) # read stacked blimp file path <- c("~/desktop/example/imputations.csv") impdata <- read.csv(file = path, head = FALSE, sep = ",") names(impdata) = c("imp", "school", "txcode", "pctminor", "teachexp", "stanmath", "probsolv", "efficacy", "disab") impdata$disab2[impdata$disab == 1] <- impdata$disab3[impdata$disab == 1] <- impdata$disab2[impdata$disab == 2] <- 1 impdata$disab3[impdata$disab == 2] <- impdata$disab2[impdata$disab == 3] <- impdata$disab3[impdata$disab == 3] <- 1 # split stacked data into separate files implist <- split(impdata, impdata$imp) implist <- as.mitml.list(implist) # multilevel regression with lm require(lme) model <- with(implist, lmer(probsolv ~ efficacy + disab2 + disab3 + teachexp + txcode + (efficacy school), REML = TRUE)) restricted <- with(implist, lmer(probsolv ~ (efficacy school), REML = TRUE)) # pooled estimates testestimates(model, var.comp = TRUE, df.com = NULL) # wald test testmodels(model, restricted, method = c("d1")) 3

22 SAS Analysis of Blimp Data SAS Analysis of Blimp Data, Continued * read data and compute dummy codes. data imputations; infile '/folders/myfolders/imputations.csv' delimiter = ','; input _imputation_ school txcode pctminor teachexp stanmath probsolv efficacy disab; disab2 = ; disab3 = ; if disab = 2 then disab2 = 1; if disab = 3 then disab3 = 1; run; * estimate mlm; ods _all_ close; proc mixed data = impdata noclprint; class school; model probsolv = efficacy disab2 disab3 teachexp txcode /solution covb; random intercept efficacy / subject = school type = un; by _imputation_; ods output SolutionF = estimates CovB = covb; ods listing; run; 5 * pool estimates; proc mianalyze parms = estimates; modeleffects efficacy disab2 disab3 teachexp txcode; run; * wald test with mult option; proc mianalyze parms = estimates mult covb(effectvar = rowcol) = covb;; modeleffects efficacy disab2 disab3 teachexp txcode; run; 6 SPSS Analysis of Blimp Data * read data and compute dummy codes. data list free file = '/users/craig/desktop/example/imputations.csv' /imputation_ school txcode pctminor teachexp stanmath probsolv efficacy disab. compute disab2 =. compute disab3 =. if (disab = 2) disab2 = 1. if (disab = 3) disab3 = 1. exe. * split file into separate data sets. sort cases by imputation_. split file layered by imputation_. * analysis and pooling. mixed probsolv with efficacy disab2 disab3 teachexp txcode /print = solution testcov /fixed = intercept efficacy disab2 disab3 teachexp txcode /random = intercept efficacy subject(school) covtype(un). 7

Blimp User s Guide. Version 1.0. Brian T. Keller. Craig K. Enders.

Blimp User s Guide. Version 1.0. Brian T. Keller. Craig K. Enders. Blimp User s Guide Version 1.0 Brian T. Keller bkeller2@ucla.edu Craig K. Enders cenders@psych.ucla.edu September 2017 Developed by Craig K. Enders and Brian T. Keller. Blimp was developed with funding