Missing Data Analysis with SPSS

Size: px

Start display at page:

Download "Missing Data Analysis with SPSS"

Gertrude Foster
5 years ago
Views:

1 Missing Data Analysis with SPSS Meng-Ting Lo Department of Educational Studies Quantitative Research, Evaluation and Measurement Program (QREM) Research Methodology Center (RMC)

2 Outline Missing Data Patterns and Mechanisms Traditional Techniques Listwise and pairwise deletion Mean substitution Regression and stochastic regression Hot deck imputation Averaging the available items Last observations carried forward Maximum Likelihood (ML) and Multiple Imputation (MI) SPSS with Multiple Imputation (demonstration and practice) Practical Issues/ Myths 2

High school longitudinal study of 2009: public-use data NCES secondary longitudinal studies, more than 21,000 9th graders in 944 schools Hsls09_MissingDataWorkshop_demo

3 High school longitudinal study of 2009: public-use data NCES secondary longitudinal studies, more than 21,000 9th graders in 944 schools Hsls09_MissingDataWorkshop_demo Hsls09_MissingDataWorkshop_demo2_imputed5 Hsls09_MissingDataWorkshop_demo2_IterationHistory Hsls09_MissingDataWorkshop_practice SPSS modules Missing Value Analysis Multiple Imputation Data and Material 3

4 The importance of dealing with missing data Rarely see a dataset that is complete and beautiful Traditional techniques rely on strict assumption about missing data mechanisms (rarely be achieved in real world) The problem of missing data: Treat it inappropriately, obtain unreliable and biased estimates, make incorrect conclusion of results Reduce the statistical power of your test to detect a significant effect (e.g., listwise deletion) 4

Missing data patterns 1 2 3 4.... Where is the missing data in your data set? Describing the location of missing data (shaded area).

5 Missing data patterns Where is the missing data in your data set? Describing the location of missing data (shaded area). In old time: specific missing data handling methods were developed to deal with different missing data patterns. Now: MI and ML work well in any missing data patterns. Figures from p.4 in Enders, C. K. (2010). Applied missing data analysis. Guilford Press. 5

6 Missing data mechanisms (Donald Rubin, 1976) Describe the relationships between measured variables and the probability of missing data and essentially function as assumptions for missing data analysis (Enders, 2010, p.2). Missing complete at random (MCAR), Missing at random (MAR), and Missing not at random(mnar) Why data are missing? Possible explanation for missing data and find evidence to justify our claim. Missing data mechanisms are much important than percentage of missing. Percentage of missing is to know the scope of missing data problem. It governs the performance of different analytic techniques. 6

7 Missing data mechanisms Race DV: Reading Achievement R Asian 0 Asian 0 Caucasian 0 Asian 0 Asian 0 Caucasian 66 1 Caucasian 88 1 Caucasian 95 1 Caucasian Asian 86 1 Asian 56 1 Caucasian 78 1 missing observed Introduced by Rubin (1976), missingness is a binary variable that has a probability distribution Race: complete observed DV: missing for some students R: missing data indicator Whether the probability of missing data on a variable (R) is related to other variables in the dataset? The relationship between probability of missingness and other variables in the dataset is then used to determine the missing data mechanisms. 7

8 Missing not at random (MNAR) The probability of missing data on a variable Y is related to the values of Y itself, even after controlling for other variables (Enders, 2010, p.8). Example: There is no way to verify whether data is MNAR without knowing the actual values of Y. In some situation, you may have some sense about the actual values if you are in the field monitoring data collection process. Needs to use other techniques to handle missing data. 8

9 Missing at Random (MAR) The probability of missing data on a variable Y is related to some other measured variable(s), but not to the values of Y itself (Enders, 2010, p.6). Example: Because we do not know the actual value of Y Theoretical judgement about MAR by providing evidence. ML and MI assume MAR. 9

10 Missing Complete at Random (MCAR) The probability of missing data on a variable Y is unrelated to other measured variables and is unrelated to the values of Y itself (Enders, 2010, p.7). Example: Observed data are just a simple random sample of the hypothetically complete dataset. Find some evidence for MCAR. For example, comparing cases with missing and without missing of a variable on other measured variables, two groups should not have differences! 10

11 Finding evidence for MCAR or MAR: t-test Preforming a series of independent sample t-test to compare a group with missing and a group without missing on the mean of other variables in the dataset (categorical data, chi-square). Selfefficacy DV: Reading Achievement R Available in SPSS Missing Values Analysis module No sig difference implies MCAR A sig difference implies MAR (good) A good way to identify variables that is related to missingness, which can be used in MI (provide information to impute missing value) 11

12 Testing the MCAR: Little (1998) s MCAR Test Multivariate extension of the t-test approach: perform all t-tests simultaneously. A global test of MCAR, available in SPSS Missing Values Analysis module under EM procedure. Testing the Null hypothesis: the data is MCAR. Significant MCAR test and/or significant t-tests = an indication of MAR. Issues: (1) Do not identify variables that violate MCAR. (2) Low statistical power (type II error) when the number of variables that violate MCAR is small or weak relationship between missingness and data. 12

13 Traditional methods for handling missing data Listwise deletion Pairwise deletion Mean substitution Regression and Stochastic regression Hot deck imputation Averaging available items Last observation carried forward 13

14 Listwise Deletion (complete-case analysis)-include only cases with complete data Easy, convenient, available in all statistical software Waste data and resources Reduce sample size and statistical power Assume MCAR (otherwise produce biased estimates) 14

15 Listwise Deletion (complete-case analysis) Problems : 1. The remaining cases do not represent the entire sample well 2. Higher mean estimate 3. Reduce the variability of data Assume MAR for this example data GPA Complete data Listwise deletion Mean Var

Pairwise Deletion (available-case analysis)- analyses (e.g., correlation, regression) are conducted based on different subset of cases Assume MCAR Correlation r= σ XY σ x 2σ y 2 1.

16 Pairwise Deletion (available-case analysis)- analyses (e.g., correlation, regression) are conducted based on different subset of cases Assume MCAR Correlation r= σ XY σ x 2σ y 2 1. Cases with complete data for X&Y 2. Use cases having x or y alone (separate subsample) Estimation problem: r >1 or < Lack of consistent sample size: using different subsets of cases to estimate parameters, difficult to compute standard errors 16

Arithmetic Mean Imputation (mean substitution): using the mean of the available cases to fill in the missing value Schafer &Graham (2002) Y has some missing, replace the missing value for Y with the

17 Arithmetic Mean Imputation (mean substitution): using the mean of the available cases to fill in the missing value Schafer &Graham (2002) Y has some missing, replace the missing value for Y with the mean of Y calculated from cases without missing on Y. Reduce variability of the data and correlations. Severely bias the parameter estimate, even MCAR. X Y

Regression Imputation (conditional mean imputation): using the predicted scores from a regression equation of the complete cases to fill in the

18 Regression Imputation (conditional mean imputation): using the predicted scores from a regression equation of the complete cases to fill in the missing value Predicted score of Yi*=β 0 +β 1 X Schafer &Graham (2002) Reduce variability, overestimate correlations between variables and R 2, even MCAR. 18

19 Stochastic Regression Imputation: using the predicted scores from a regression equation of the complete cases to fill in the missing value + normally distributed error term N~(0,σ 2 ) Schafer &Graham (2002) Schafer &Graham (2002) Predicted score of Yi*=β 0 +β 1 X+ Zi Adding residual terms to the predicted values: restore the variability to the imputed data and eliminate biases. Provide unbiased estimates under MAR just like ML and MI! But attenuate the standard error, inflate type I error rate. 19

20 Hot-Deck imputation: impute the missing values from similar respondents Procedure: some respondents did not report their income, classified respondents into cells (groups) based on their demographic information such as age, gender, marital status; randomly draw an income value from similar respondents Schafer &Graham (2002) Reduce variability to some extent, produce biases on correlation estimates and regression coefficients. 20

21 Averaging the available items (multiple-item questionnaire) Researchers typically compute a scale score by summing or averaging the item responses that measure the same construct. For example, 5 items measuring well-being, a respondent answered 3 items but not all of the items, her/his scale score would be the average of those 3 items. Person mean substitution Potential problem : Cronbach s alpha is incorrect, may bias the variance and correlation. Use with caution, especially with high rate of item nonresponses. ML and MI are better approaches. 21

22 Last observation carried forward: longitudinal designs Observed data ID W1 W2 W3 W Observed data ID W1 W2 W3 W Replace the missing value with the observation that immediately before dropout. Assume the scores do not change from the previous measurement. Likely to produce biased estimate, even when data are MCAR. 22

23 Recommended methods for handling missing data Maximum likelihood method (full information maximum likelihood, FIML) Multiple imputation 23

24 Why FIML or Multiple imputation (MI)? Traditional methods have its own limitation and some of them have strict assumption about missing data mechanisms. Provides you with better and more trustworthy parameter estimates. Make the conclusion about your statistical test more appropriately. Allow you to have rigor on your study. 24

25 Full information maximum likelihood (FIML) Assume MAR and multivariate normality data. Implemented in structural equation modeling program such as Mplus (default) when the outcome is continuous. When used in the missing data context, using all the information in the dataset to directly estimate the parameters and standard errors; handling missing data in one-step. Does not drop any cases with missing values. Does not produce imputed datasets. FIML reads in the raw data of one case at a time, and maximizes the ML function for one case at a time. 25

26 Full information maximum likelihood (FIML) The computations for a case use the information only from the variables and the corresponding parameters for which the case has complete data (Enders, 2010, p.89). Implies: depending on the missing data pattern for that case, the computations differ slightly (the ML function is customized to different missing data pattern). Involving iterative processes, each time using different estimates of the parameters, until it finds a set of parameter values that maximize the likelihood function (Enders, 2010). i.e., maximize the probability of observing the data, find a model that best fit the data. ML converges: The parameter estimates no longer change across successive iterations. 26

27 Full information maximum likelihood (FIML) An iterative process: putting the distribution in all possible locations until the program finds a place where the distribution with a set of parameters that best fit the data (have the highest probability /likelihood of observing the data) Reading achievements 27

28 Multiple imputation (MI) Assume MAR, also called multiple stochastic regression imputation (iterative procedure). Available in Mplus, SAS, Stata, Blimp, SPSS, R and other. Involves three steps: Imputation Phase Analysis Phase Pooling Phase Imputed dataset 1 Imputed dataset 2 Results 1 Results 2 A dataset with missing data Pooled (overall) results Imputed dataset m Results m 28

29 Multiple imputation- imputation phase SPSS uses fully conditional specification (FCS) or chained equations imputation, multivariate imputation by chained equations (MICE) (a Markov Chain Monte Carlo algorithm) Does not rely on the assumption of multivariate normality. Flexible in handling different types of variables. Scale: linear regression Categorical: logistic regression ID Age Income Gender Specify the imputation model on a variable-by-variable basis. For each variable with missing data, a univariate (single dependent variable) imputation model is fitted using all other available variables in the model as predictors, then imputes missing values for the variable being fit (IBM SPSS Missing Values 24). 29

Multiple imputation- imputation phase The imputation process goes through all variables with missing value iteratively, every time with new/updated imputed values.

30 Multiple imputation- imputation phase The imputation process goes through all variables with missing value iteratively, every time with new/updated imputed values. Age Income Gender This process is repeated for several times When the maximum number of iterations is reached (specified by researchers or by default), the imputed values at the maximum iteration are saved (one imputed dataset is created). Request 5 imputations with 200 maximum iterations = SPSS runs the MCMC algorithm 5 times and save the imputed values at 200 th iteration each time. Generally, 5-10 iterations is sufficient, but recommended to be conservative. You may need to increase the number of iterations if the model hasn't converged (save iteration history data in SPSS and plot it to assess convergence). 30

31 Multiple imputation imputation phase What variables should be included in the imputation model? (1) At least the variables that you are going to use in the subsequent analysis should be included. For example, run a regression model and use gender, SES to predict freshman s GPA. Gender, SES, and GPA should be included in the imputation model. (2) Include auxiliary variables: variables are either correlates of missingness or correlates of an incomplete variable (Enders, 2010, p.17); these variables may not the study interest, but help improving the imputation quality and increasing the plausibility of MAR. For example, there are other variables such as parents education level, ACT, SAT, and other variables in the datasets which are correlated with variables of interest or their missingness. 31

32 Multiple imputation imputation phase How many imputed datasets are needed? There are strong associations between statistical power and number of imputations. Convention wisdom: 3-5 imputed datasets; however, study showed that with only 3 or 5 imputed datasets, the power is below its optimal level (Graham et al., 2007). According to Enders (2011), generating a minimum of 20 imputed datasets seems to be a good rule of thumb for many situations. If the proportion of missing data is > 50%, increasing the # of imputations > 40 and be thoughtful about the variables included in the imputation model. 32

33 Multiple imputation analysis phase The imputation phase generate m set of imputed datasets. The analysis phase: analyze the imputed datasets using the normal analysis procedure. For example, a researcher generates 20 datasets and now would like to use multiple regression to analyze the data. She/he will repeat multiple regression analysis 20 times, one analysis for each of the datasets. Dataset1 Dataset2 Paramter β SE Paramter β SE Intercept Intercept SES SES

34 Multiple imputation pooling phase Pooling point estimate: Pooling standard errors: θ = 1 m m 1 θ t m= # of imputed datasets θ t = parameter estimate for t dataset Take an average of the parameter estimates across m datasets The statistical significance of the θ can be calculated in the usual way by calculating the ratio θ / V T V T = V W + V B + V B m ; SE= V T = total sampling variance V W =within-imputation variance V T (the mean of the squared SE across m datasets) V B = between-imputation variance (variability of parameter estimate across m datasets; additional variance that is due to missing) V B = correction factor for a finite number m of imputation 34

35 Using SPSS to Deal with Missing Data 35

36 High school longitudinal study of 2009: public-use data NCES secondary longitudinal studies, more than 21,000 9th graders in 944 schools Selected sample: subsample of 500 students who took math and science course in 2009 Selected measures: The example data 9th grade sex (0=male), race/ethnicity (0=white), socioeconomic status 9th and 11th grade math IRT scores 9th grade math interest (3 items; 4 point Likert scale) 9th grade math self-efficacy (4 items; 4 point Likert scale) Demonstration dataset: Hsls09_MissingDataWorkshop_demo 36

37 Using SPSS to deal with missing data Delete cases with no data on any of the variables. All missing values need to be displayed as system missing (a blank cell) or user-defined missing (a value assigned by researcher, such as 999 or -8888). 37

Using SPSS to deal with missing data Change all

Variables -> Select all of the variables into

38 Using SPSS to deal with missing data Change all missing values (either system missing or user-defined missing value) to a common value Transform-> click Recode into Same Variables -> Select all of the variables into the selection box-> click Old and New Values->

39 Using SPSS to deal with missing data Assign missing values for all the variables: In Variable View -> Click on one cell in the Missing column to assign -999 as a discrete missing value -> Click OK. Right click Copy -> Select all cells with numeric variables --- Click Paste. 39

40 Using SPSS to deal with missing data Define variables : In Variable View -> Under Measure column -> assign the scale for each of the variables. 40

Using SPSS to deal with missing data Analyze the pattern of missing data: Go to Analyze -> Multiple Imputation - > Analyze Patterns Select the variables excluding

41 Using SPSS to deal with missing data Analyze the pattern of missing data: Go to Analyze -> Multiple Imputation - > Analyze Patterns Select the variables excluding the ID to Analyze Across Variables For Minimum percentage missing for variable to be displayed, change to 0 -> Click OK (would like to see everything that is missing) 41

42 Using SPSS to deal with missing data Only 1.83% of the individual values are missing. Variables: the number of variables which contained missing values= 9 out of 12 (green) Cases: 409 cases have complete data (81.8%) (blue) ; 91 cases have at least one missing value on a variable Values: the number of individual values (out of 6000=12*500) that are missing = 110 (1.83%) (green) 42

43 Using SPSS to deal with missing data The number and percent missing for each variable. Notice, the variables are ordered by the amount of values they are missing (i.e. the percentage missing). Examine the percentage of missing for each variable, make sure that each percent missing makes sense based on your knowledge about this dataset! 43

44 Using SPSS to deal with missing data The pattern here is arbitrary. least highest Each pattern (row) reflects a group of cases with the same pattern of missing values (15 patterns of missing and nonmissing data) The variables along the bottom (x-axis) are ordered by the amount of missing values each contains. The percent missing for the 10 most common patterns Pattern 1 = no missing (81%) is the most prevalent pattern. Pattern 10= missing on MATH11 (10%) 44

--- Missing Value Analysis--> Descriptive: Report Student

45 Using SPSS to deal with missing data Request Little s MCAR test and independent sample t-tests for MAR Go to Analyze --- Missing Value Analysis--> Descriptive: Report Student t- test for each pair of continuous variables to examine MAR 45

Using SPSS to deal with missing data Request Little s MCAR test and Separate Variance t tests Go to Analyze --- Missing Value Analysis A note: If you get a

46 Using SPSS to deal with missing data Request Little s MCAR test and Separate Variance t tests Go to Analyze --- Missing Value Analysis A note: If you get a warning message in the SPSS output that the EM algorithm failed to converge in 25 iterations, you can increase the maximum iterations by clicking on the EM button. 46

47 Using SPSS to deal with missing data Request Little s MCAR test and Separate Variance t-tests Scroll down in the SPSS Output window to the EM Means table: Under this table, you can find the result from Little s MCAR test. Non- significant results at p =.054 indicate the data are missing completely at random (MCAR). 47

48 Examine independent sample t-tests A significant t-test indicates the probability of missing is a function of the values on another variables. It s an indication of MAR! We have variables that can be used in the imputation model. 48

49 Analysis model Research Question: Can students SES and math self-efficacy predict their 11th grade math score? Dependent Variable: MATH11 Independent Variables: SES and EFF_total (sum of 4 items) Auxiliary variables (for imputation): SEX, RACE, MATH09, Math interest items Correlation analysis: these variables are correlated with variables of interest to some extent Independent sample t-test: some of them are correlated with missingness for variables of interest 49

50 Before imputation, set a random seed Transform-> Random Number Generators - > select Set Active Generator-> click Mersenne Twister -> select Set Starting Point and Fixed Value -> click OK. 50

51 Using SPSS to deal with missing data Conducting multiple imputation: Analyze-> Multiple Imputation-> Impute Missing Data Values-> Move the variables of interest to the Variables in Model box. 51

52 Variables-> 5 imputations will be implemented for demonstration purpose Missing value will be imputed 5 times and stored Name the dataset below the Create a new dataset button 52

Method-> Since the missing data pattern is arbitrary, selecting FCS Specify the number of maximum iterations = 200 Default =10; Increase the number of iterations if the Markov Chain Monte Carlo

53 Method-> Since the missing data pattern is arbitrary, selecting FCS Specify the number of maximum iterations = 200 Default =10; Increase the number of iterations if the Markov Chain Monte Carlo algorithm hasn't converged. PMM: still uses regression, but the imputed values are adjusted to match the nearest actual value in the dataset (from observations with the same predicted value with no missing on that variable). If the original variable is bounded by 0 and 40, the imputed values will also be bounded by 0 and 40. According to Paul Allison, there are some drawbacks of PMM in SPSS. 53

Constraints-> Click on Scan Data: examine the variable summary 1 You can specify the role of a variable during the imputation and constraint the range of imputed values (min, max, rounding) so that

54 Constraints-> Click on Scan Data: examine the variable summary 1 You can specify the role of a variable during the imputation and constraint the range of imputed values (min, max, rounding) so that they are plausible Obtain integer values = specify 1 as the rounding denomination (6.648->7); obtain values rounded to the nearest cent, specify 0.01 (6.648->6.65) 2 3 This column allows you to specify the smallest denomination to accept. 54

within the specified ranges Errors: if a set of values within the ranges is not

55 Constraints-> If specify the Min and Max: Maximum draw procedure will be activated: it attempts to draw values for a case until it finds a set of values that are within the specified ranges Errors: if a set of values within the ranges is not obtained Increase the maximum draws Demonstration: no constraints on the range of variables 55

56 Imputation model: univariate model type, model effects, and # of values imputed Descriptive statistics: basic information before and after imputation Iteration history: information on the convergence performance 56

57 Outputs Hsls09_MissingDataWorkshop_demo2_imputed5 57

58 Datasets with imputed values are numbered 1 through M, where M is the number of imputations. Select the imputation from the drop-down list in the edit bar in Data view. 58

59 You can distinguish imputed values from observed values by cell background color. 59

60 Create composite score: Transform-> Compute Variable Compute the scale score (composite score) for self-efficacy in the stacked dataset This would apply to all the imputed datasets 60

61 Before the analysis: Data-> Split file Split the file by imputation number This invokes the analysis and pooling phase for multiple imputed datasets 61

62 Analyze data as usual SPSS provides pooled estimate for some analyses but not all Analyses with this icon, indicating that SPSS provides corresponding procedure to accommodate multiple imputed datasets Let s perform a multiple regression 62

63 SPSS outputs for multiple regression-descriptive statistics 63

64 SPSS outputs for multiple regression- correlation matrix 64

65 SPSS outputs for multiple regression- coefficient estimates Coefficients a Standardized Unstandardized Coefficients Coefficients Imputation Number Model B Std. Error Beta t Sig. Original data 1 (Constant) X1 Socio-economic status composite Fraction Missing Info. Relative Increase Variance Relative Efficiency EFF_total Pooled 1 (Constant) X1 Socio-economic status composite EFF_total a. Dependent Variable: X2 Mathematics IRT-estimated number right score Results differ slightly across imputed datasets SPSS provides pooled estimate for unstandardized regression coefficients! 65

66 Imputation Diagnostics 66

67 SPSS outputs for multiple regression- coefficient estimates Fraction missing info: The proportion of total sampling variance that is due to missing data (V B + V B m )/ V T for a parameter estimate, related to percentage missing for that variable for SES: 8.7% of the sampling variance is due to missing data A measure of the impact of missing data on parameter estimates 67

68 SPSS outputs for multiple regression- coefficient estimates Relative Increase Variance: how much the sampling variance would be increased (inflated) because of missingness (V B + V B m )/ V w for EFF_total: compared to the sampling variance for EFF_total assumed it has complete data, the estimated sampling variance for EFF_total (with missing) is 14.1% larger. Variables with larger percentage missingness tend to have larger relative increase variance. 68

SPSS outputs for multiple regression- coefficient estimates Relative efficiency: it is an efficiency estimate from m imputations relative to performing an infinite number of imputations 1/(1+F/M),

69 SPSS outputs for multiple regression- coefficient estimates Relative efficiency: it is an efficiency estimate from m imputations relative to performing an infinite number of imputations 1/(1+F/M), where F= Fraction missing info, M= # of imputation. Close to 1 = more efficient, produce proper SE (won t produce too large SE) Large percentage of missing needs more imputations to achieve sufficient efficiency for parameter estimates The SE got from infinite # of imputations is 98.3% of SE got from 5 imputations (fraction of missing info) SAS documentation for multiple imputation (Horton & Lipsitz, 2001, p. 246) 69

70 Iteration history: Provides mean and standard deviation by iteration and imputation for continuous imputed variables Build the plot to examine the convergence of model 70

71 Assessing the performance of imputations Graphs > Chart Builder> select line chart 71

72 Assessing the performance of imputations

Assessing the performance of imputations 1 2 3 In the

73 Assessing the performance of imputations In the Element Properties, select Value as the statistic to display. 4 73

74 Assessing the performance of imputations

Mean and standard deviation of the imputed values of SES at each iteration (200) for each of the 5 requested imputations (can be requested for each continuous imputed variable).

75 Mean and standard deviation of the imputed values of SES at each iteration (200) for each of the 5 requested imputations (can be requested for each continuous imputed variable). The purpose of this plot is to look for trends or patterns. Model converge: the parameter values bounce around in a random fashion with no trend ( it reaches this phase immediately) and the different lines of imputations should be mixed with each other. 75

Assessing the performance of imputations using trace plots (using Ender s Macro http://www.appliedmissingdata.com/macro-programs.

An indication of the performance of the imputations. For using this macro: 1000 iterations with 2 imputed datasets.

76 Assessing the performance of imputations using trace plots (using Ender s Macro The plot for mean and SD for imputed continuous variables can be requested using Ender s SPSS macro. An indication of the performance of the imputations. For using this macro: 1000 iterations with 2 imputed datasets. Provides additional convergence performance criterion: Potential scale reduction (PSR) for every 100 iteration: the MCMC is regarded as converge when the PSR <

77 Problematic or pathological case of non-convergence: Figure from Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained equations in R. Journal of statistical software,

78 Healthy case of convergence: Figure from Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained equations in R. Journal of statistical software,

79 Practice time! 79

80 High school longitudinal study of 2009: public-use data Selected sample: subsample of 490 participants who took math and science course in 2009 Selected measures: 9th grade sex (0=male), race/ethnicity (0=white), SES 9th and 11th grade math and science GPA 9th grade science utility (3 items; 4 point Likert scale) 9th grade science self-efficacy (4 items; 4 point Likert scale) Nominal Var: SEX, RACE The practice data Scale Var: SES, MGPA12, SGPA12 Ordinal Var: Science utility and self-efficacy items 80

81 Analysis model Research Question: Can students race, SES and science selfefficacy predict their 12 th grade science GPA score? Dependent Variable: SGPA12 Independent Variables: Race, SES and SEFF_total (sum of 4 items) Auxiliary variables for imputation model: Sex, MGPA12, science utility items Examine the correlation analysis and univariate t-tests 81

82 TASKS : YOU CAN DO IT! Change all missing values (either system missing or user-defined missing value) to a common value, e.g., 999 Assign missing values for all the variables in variable view Define variables : In Variable View -> Under Measure column -> assign the scale for each of the variables Analyze the pattern of missing data and examine the percentage of missing (how many percentage of missing?) Request Little s MCAR test (EM) and Separate Variance t-test Conducting multiple imputation: 10 datasets, 100 iterations Remember to set the maximum and minimum value of science and math GPA to 0 and 4 Create a composite score for science self-efficacy Run a regression model to answer the research question Examine the convergence of model by using iteration history 82

83 Practical Issues/ Myths 83

84 Practical issues/myths Is imputation making up the data? Note really! The goal of imputation is not to produce the individual values and treat them as real data, but to estimate the population parameter and preserve important characteristics of the data set as a whole (Graham, 2008). Account for uncertainty associated with missing data. Thus, unbiased estimates can be obtained. 84

85 Practical issues/myths Should both independent variables and dependent variables be included in the imputation model (MI)? At least, all the variables that you will use in your analysis should be included. Why? When the DV is not included, the correlations between it and IVs are assumed to be 0. Excluding it will reduce its relationships with other variables. Taking a liberal approach for variables selection in the imputation phase. Programs did not distinguish whether a variable is IV or DV! 85

86 Practical issues Why including auxiliary variables? Inclusive Analysis Strategy: ML and MI require MAR and since there is no test for MAR, we need to find ways to increase the likelihood to satisfy MAR. Shafer and Graham (2002, p, 173): collecting data on the potential causes of missingness may effectively convert an MNAR situation to MAR. Incorporates a number of auxiliary variables : help increasing statistical power or reduce biases in parameter estimates. Use as many as you can, most useful are those with correlations

87 Practical issues Working with multiple items questionnaire, whether to impute the individual items or scale scores? If doable, imputing individual items, since it maximizes the information for creating the imputations and have more statistical power than imputing scale scores (Enders, 2010, p ). 87

88 Practical issues What if my missing data is MNAR? Using Selection Modeling and Pattern Mixture Modeling (Chapter 10 in Ender s Applied Missing Data Analysis) These two models deal with the NMAR situation by statistically modeling the missing data mechanism. Enders, C. K. (2011). Missing not at random models for latent growth curve analyses. Psychological methods, 16(1), 1. 88

89 What should I report when I write it up? Missing data mechanisms Percentage of missing for each variable & overall percentage of missing Software for missing data imputation Imputation method & algorithm Number of imputed datasets The variables used in the imputation model 89

90 Reference Enders, C. K. (2010). Applied missing data analysis. Guilford Press. Graham, J. W. (2012). Missing data : analysis and design. Springer. Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of psychology, 60, Pigott, T. D. (2001). A review of methods for missing data. Educational research and evaluation, 7(4), Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological methods, 7(2), 147. Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. International journal of methods in psychiatric research, 20(1), Puma, M. J., Olsen, R. B., Bell, S. H., & Price, C. (2009). What to Do when Data Are Missing in Group Randomized Controlled Trials. NCEE National Center for Education Evaluation and Regional Assistance. IBM SPSS Missing Values 21 & 24 (user manual). Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained equations in R. Journal of statistical software,

91 UCLA: idre Recommended websites SAS : Stata : 1_new/ Craig Enders website: Mplus: Blimp: 91

92 Thank you Don t be afraid of missing data! 92

- 1 - Fig. A5.1 Missing value analysis dialog box

- 1 - Fig. A5.1 Missing value analysis dialog box WEB APPENDIX Sarstedt, M. & Mooi, E. (2019). A concise guide to market research. The process, data, and methods using SPSS (3 rd ed.). Heidelberg: Springer. Missing Value Analysis and Multiple Imputation