Missing Data Analysis with SPSS

Similar documents
- 1 - Fig. A5.1 Missing value analysis dialog box

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Missing Data Analysis for the Employee Dataset

Missing Data and Imputation

Missing Data: What Are You Missing?

Missing Data Techniques

Missing Data Missing Data Methods in ML Multiple Imputation

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

WELCOME! Lecture 3 Thommy Perlinger

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

CHAPTER 1 INTRODUCTION

Multiple imputation using chained equations: Issues and guidance for practice

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

NORM software review: handling missing values with multiple imputation methods 1

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

PSY 9556B (Jan8) Design Issues and Missing Data Continued Examples of Simulations for Projects

An introduction to SPSS

Introduction to Mixed Models: Multivariate Regression

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

The Performance of Multiple Imputation for Likert-type Items with Missing Data

Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Missing Data. Where did it go?

IBM SPSS Missing Values 21

MISSING DATA AND MULTIPLE IMPUTATION

Example Using Missing Data 1

PRI Workshop Introduction to AMOS

PASW Missing Values 18

Handling missing data for indicators, Susanne Rässler 1

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

HANDLING MISSING DATA

Introduction to Mplus

An imputation approach for analyzing mixed-mode surveys

Multiple-imputation analysis using Stata s mi command

Predict Outcomes and Reveal Relationships in Categorical Data

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

Handling Data with Three Types of Missing Values:

Generalized least squares (GLS) estimates of the level-2 coefficients,

Excel 2010 with XLSTAT

Blimp User s Guide. Version 1.0. Brian T. Keller. Craig K. Enders.

Correctly Compute Complex Samples Statistics

Estimation of Item Response Models

Missing Data Analysis for the Employee Dataset

AMELIA II: A Program for Missing Data

STATISTICS (STAT) Statistics (STAT) 1

Multiple Imputation with Mplus

Missing Data Analysis with the Mahalanobis Distance

An Introduction to Growth Curve Analysis using Structural Equation Modeling

Data analysis using Microsoft Excel

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS}

Lecture 26: Missing data

Statistical Matching using Fractional Imputation

Missing Data in Orthopaedic Research

A STOCHASTIC METHOD FOR ESTIMATING IMPUTATION ACCURACY

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Performance of Sequential Imputation Method in Multilevel Applications

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

Creating a data file and entering data

Bootstrap and multiple imputation under missing data in AR(1) models

Longitudinal Modeling With Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Applied Regression Modeling: A Business Approach

SPSS INSTRUCTION CHAPTER 9

Robust Linear Regression (Passing- Bablok Median-Slope)

Missing Data Part 1: Overview, Traditional Methods Page 1

Missing data analysis. University College London, 2015

Types of missingness and common strategies

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values

Improving Imputation Accuracy in Ordinal Data Using Classification

Fathom Dynamic Data TM Version 2 Specifications

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

IBM SPSS Categories 23

Multiple Imputation for Continuous and Categorical Data: Comparing Joint and Conditional Approaches

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models

Approaches to Missing Data

Applied Regression Modeling: A Business Approach

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Package midastouch. February 7, 2016

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Faculty of Sciences. Holger Cevallos Valdiviezo

Week 4: Simple Linear Regression II

Enterprise Miner Tutorial Notes 2 1

Canadian National Longitudinal Survey of Children and Youth (NLSCY)

Multidimensional Latent Regression

Multivariate Capability Analysis

Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

MHPE 494: Data Analysis. Welcome! The Analytic Process

R software and examples

An Algorithm for Creating Models for Imputation Using the MICE Approach:

7.4 Tutorial #4: Profiling LC Segments Using the CHAID Option

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

Missing data analysis: - A study of complete case analysis, single imputation and multiple imputation. Filip Lindhfors and Farhana Morko

Development of weighted model fit indexes for structural equation models using multiple imputation

Transcription:

Missing Data Analysis with SPSS Meng-Ting Lo (lo.194@osu.edu) Department of Educational Studies Quantitative Research, Evaluation and Measurement Program (QREM) Research Methodology Center (RMC)

Outline Missing Data Patterns and Mechanisms Traditional Techniques Listwise and pairwise deletion Mean substitution Regression and stochastic regression Hot deck imputation Averaging the available items Last observations carried forward Maximum Likelihood (ML) and Multiple Imputation (MI) SPSS with Multiple Imputation (demonstration and practice) Practical Issues/ Myths 2

High school longitudinal study of 2009: public-use data NCES secondary longitudinal studies, more than 21,000 9th graders in 944 schools Hsls09_MissingDataWorkshop_demo Hsls09_MissingDataWorkshop_demo2_imputed5 Hsls09_MissingDataWorkshop_demo2_IterationHistory Hsls09_MissingDataWorkshop_practice SPSS modules Missing Value Analysis Multiple Imputation Data and Material 3

The importance of dealing with missing data Rarely see a dataset that is complete and beautiful Traditional techniques rely on strict assumption about missing data mechanisms (rarely be achieved in real world) The problem of missing data: Treat it inappropriately, obtain unreliable and biased estimates, make incorrect conclusion of results Reduce the statistical power of your test to detect a significant effect (e.g., listwise deletion) 4

Missing data patterns 1 2 3 4.... Where is the missing data in your data set? Describing the location of missing data (shaded area). In old time: specific missing data handling methods were developed to deal with different missing data patterns. Now: MI and ML work well in any missing data patterns. Figures from p.4 in Enders, C. K. (2010). Applied missing data analysis. Guilford Press. 5

Missing data mechanisms (Donald Rubin, 1976) Describe the relationships between measured variables and the probability of missing data and essentially function as assumptions for missing data analysis (Enders, 2010, p.2). Missing complete at random (MCAR), Missing at random (MAR), and Missing not at random(mnar) Why data are missing? Possible explanation for missing data and find evidence to justify our claim. Missing data mechanisms are much important than percentage of missing. Percentage of missing is to know the scope of missing data problem. It governs the performance of different analytic techniques. 6

Missing data mechanisms Race DV: Reading Achievement R Asian 0 Asian 0 Caucasian 0 Asian 0 Asian 0 Caucasian 66 1 Caucasian 88 1 Caucasian 95 1 Caucasian 100 1 Asian 86 1 Asian 56 1 Caucasian 78 1 missing observed Introduced by Rubin (1976), missingness is a binary variable that has a probability distribution Race: complete observed DV: missing for some students R: missing data indicator Whether the probability of missing data on a variable (R) is related to other variables in the dataset? The relationship between probability of missingness and other variables in the dataset is then used to determine the missing data mechanisms. 7

Missing not at random (MNAR) The probability of missing data on a variable Y is related to the values of Y itself, even after controlling for other variables (Enders, 2010, p.8). Example: There is no way to verify whether data is MNAR without knowing the actual values of Y. In some situation, you may have some sense about the actual values if you are in the field monitoring data collection process. Needs to use other techniques to handle missing data. 8

Missing at Random (MAR) The probability of missing data on a variable Y is related to some other measured variable(s), but not to the values of Y itself (Enders, 2010, p.6). Example: Because we do not know the actual value of Y Theoretical judgement about MAR by providing evidence. ML and MI assume MAR. 9

Missing Complete at Random (MCAR) The probability of missing data on a variable Y is unrelated to other measured variables and is unrelated to the values of Y itself (Enders, 2010, p.7). Example: Observed data are just a simple random sample of the hypothetically complete dataset. Find some evidence for MCAR. For example, comparing cases with missing and without missing of a variable on other measured variables, two groups should not have differences! 10

Finding evidence for MCAR or MAR: t-test Preforming a series of independent sample t-test to compare a group with missing and a group without missing on the mean of other variables in the dataset (categorical data, chi-square). Selfefficacy DV: Reading Achievement R 5 0 1 0 2 0 4 0 2 0 5 66 1 3 88 1 4 95 1 3 100 1 2 86 1 4 56 1 5 78 1 Available in SPSS Missing Values Analysis module No sig difference implies MCAR A sig difference implies MAR (good) A good way to identify variables that is related to missingness, which can be used in MI (provide information to impute missing value) 11

Testing the MCAR: Little (1998) s MCAR Test Multivariate extension of the t-test approach: perform all t-tests simultaneously. A global test of MCAR, available in SPSS Missing Values Analysis module under EM procedure. Testing the Null hypothesis: the data is MCAR. Significant MCAR test and/or significant t-tests = an indication of MAR. Issues: (1) Do not identify variables that violate MCAR. (2) Low statistical power (type II error) when the number of variables that violate MCAR is small or weak relationship between missingness and data. 12

Traditional methods for handling missing data Listwise deletion Pairwise deletion Mean substitution Regression and Stochastic regression Hot deck imputation Averaging available items Last observation carried forward 13

Listwise Deletion (complete-case analysis)-include only cases with complete data Easy, convenient, available in all statistical software Waste data and resources Reduce sample size and statistical power Assume MCAR (otherwise produce biased estimates) 14

Listwise Deletion (complete-case analysis) Problems : 1. The remaining cases do not represent the entire sample well 2. Higher mean estimate 3. Reduce the variability of data Assume MAR for this example data GPA Complete data Listwise deletion Mean 3.19 3.51 Var 0.76 0.67 15

Pairwise Deletion (available-case analysis)- analyses (e.g., correlation, regression) are conducted based on different subset of cases Assume MCAR Correlation r= σ XY σ x 2σ y 2 1. Cases with complete data for X&Y 2. Use cases having x or y alone (separate subsample) Estimation problem: r >1 or < -1 50.01 Lack of consistent sample size: using different subsets of cases to estimate parameters, difficult to compute standard errors 16

Arithmetic Mean Imputation (mean substitution): using the mean of the available cases to fill in the missing value Schafer &Graham (2002) Y has some missing, replace the missing value for Y with the mean of Y calculated from cases without missing on Y. Reduce variability of the data and correlations. Severely bias the parameter estimate, even MCAR. X Y 169 148 126 132 160 169 105 116 125 112 133 94 109 109 106 176 137 128 131 130 145 155 136 146 134 111 97 134 153 112 118 137 101 103 78 151 113 17

Regression Imputation (conditional mean imputation): using the predicted scores from a regression equation of the complete cases to fill in the missing value Predicted score of Yi*=β 0 +β 1 X Schafer &Graham (2002) Reduce variability, overestimate correlations between variables and R 2, even MCAR. 18

Stochastic Regression Imputation: using the predicted scores from a regression equation of the complete cases to fill in the missing value + normally distributed error term N~(0,σ 2 ) Schafer &Graham (2002) Schafer &Graham (2002) Predicted score of Yi*=β 0 +β 1 X+ Zi Adding residual terms to the predicted values: restore the variability to the imputed data and eliminate biases. Provide unbiased estimates under MAR just like ML and MI! But attenuate the standard error, inflate type I error rate. 19

Hot-Deck imputation: impute the missing values from similar respondents Procedure: some respondents did not report their income, classified respondents into cells (groups) based on their demographic information such as age, gender, marital status; randomly draw an income value from similar respondents Schafer &Graham (2002) Reduce variability to some extent, produce biases on correlation estimates and regression coefficients. 20

Averaging the available items (multiple-item questionnaire) Researchers typically compute a scale score by summing or averaging the item responses that measure the same construct. For example, 5 items measuring well-being, a respondent answered 3 items but not all of the items, her/his scale score would be the average of those 3 items. Person mean substitution Potential problem : Cronbach s alpha is incorrect, may bias the variance and correlation. Use with caution, especially with high rate of item nonresponses. ML and MI are better approaches. 21

Last observation carried forward: longitudinal designs Observed data ID W1 W2 W3 W4 1 50 51 2 46 48 50 3 24 55 56 Observed data ID W1 W2 W3 W4 1 50 51 51 51 2 46 46 48 50 3 24 55 56 56 Replace the missing value with the observation that immediately before dropout. Assume the scores do not change from the previous measurement. Likely to produce biased estimate, even when data are MCAR. 22

Recommended methods for handling missing data Maximum likelihood method (full information maximum likelihood, FIML) Multiple imputation 23

Why FIML or Multiple imputation (MI)? Traditional methods have its own limitation and some of them have strict assumption about missing data mechanisms. Provides you with better and more trustworthy parameter estimates. Make the conclusion about your statistical test more appropriately. Allow you to have rigor on your study. 24

Full information maximum likelihood (FIML) Assume MAR and multivariate normality data. Implemented in structural equation modeling program such as Mplus (default) when the outcome is continuous. When used in the missing data context, using all the information in the dataset to directly estimate the parameters and standard errors; handling missing data in one-step. Does not drop any cases with missing values. Does not produce imputed datasets. FIML reads in the raw data of one case at a time, and maximizes the ML function for one case at a time. 25

Full information maximum likelihood (FIML) The computations for a case use the information only from the variables and the corresponding parameters for which the case has complete data (Enders, 2010, p.89). Implies: depending on the missing data pattern for that case, the computations differ slightly (the ML function is customized to different missing data pattern). Involving iterative processes, each time using different estimates of the parameters, until it finds a set of parameter values that maximize the likelihood function (Enders, 2010). i.e., maximize the probability of observing the data, find a model that best fit the data. ML converges: The parameter estimates no longer change across successive iterations. 26

Full information maximum likelihood (FIML) An iterative process: putting the distribution in all possible locations until the program finds a place where the distribution with a set of parameters that best fit the data (have the highest probability /likelihood of observing the data). 0 100 Reading achievements 27

Multiple imputation (MI) Assume MAR, also called multiple stochastic regression imputation (iterative procedure). Available in Mplus, SAS, Stata, Blimp, SPSS, R and other. Involves three steps: Imputation Phase Analysis Phase Pooling Phase Imputed dataset 1 Imputed dataset 2 Results 1 Results 2 A dataset with missing data Pooled (overall) results Imputed dataset m Results m 28

Multiple imputation- imputation phase SPSS uses fully conditional specification (FCS) or chained equations imputation, multivariate imputation by chained equations (MICE) (a Markov Chain Monte Carlo algorithm) Does not rely on the assumption of multivariate normality. Flexible in handling different types of variables. Scale: linear regression Categorical: logistic regression ID Age Income Gender 1 35 0 2 5000 1 3 45 10000 0 4 20 1 5 18 4500 Specify the imputation model on a variable-by-variable basis. For each variable with missing data, a univariate (single dependent variable) imputation model is fitted using all other available variables in the model as predictors, then imputes missing values for the variable being fit (IBM SPSS Missing Values 24). 29

Multiple imputation- imputation phase The imputation process goes through all variables with missing value iteratively, every time with new/updated imputed values. Age Income Gender This process is repeated for several times When the maximum number of iterations is reached (specified by researchers or by default), the imputed values at the maximum iteration are saved (one imputed dataset is created). Request 5 imputations with 200 maximum iterations = SPSS runs the MCMC algorithm 5 times and save the imputed values at 200 th iteration each time. Generally, 5-10 iterations is sufficient, but recommended to be conservative. You may need to increase the number of iterations if the model hasn't converged (save iteration history data in SPSS and plot it to assess convergence). 30

Multiple imputation imputation phase What variables should be included in the imputation model? (1) At least the variables that you are going to use in the subsequent analysis should be included. For example, run a regression model and use gender, SES to predict freshman s GPA. Gender, SES, and GPA should be included in the imputation model. (2) Include auxiliary variables: variables are either correlates of missingness or correlates of an incomplete variable (Enders, 2010, p.17); these variables may not the study interest, but help improving the imputation quality and increasing the plausibility of MAR. For example, there are other variables such as parents education level, ACT, SAT, and other variables in the datasets which are correlated with variables of interest or their missingness. 31

Multiple imputation imputation phase How many imputed datasets are needed? There are strong associations between statistical power and number of imputations. Convention wisdom: 3-5 imputed datasets; however, study showed that with only 3 or 5 imputed datasets, the power is below its optimal level (Graham et al., 2007). According to Enders (2011), generating a minimum of 20 imputed datasets seems to be a good rule of thumb for many situations. If the proportion of missing data is > 50%, increasing the # of imputations > 40 and be thoughtful about the variables included in the imputation model. 32

Multiple imputation analysis phase The imputation phase generate m set of imputed datasets. The analysis phase: analyze the imputed datasets using the normal analysis procedure. For example, a researcher generates 20 datasets and now would like to use multiple regression to analyze the data. She/he will repeat multiple regression analysis 20 times, one analysis for each of the datasets. Dataset1 Dataset2 Paramter β SE Paramter β SE Intercept 2.62 3.41 Intercept 2.18 3.2 SES 1.81 1.6 SES 1 1.9 33

Multiple imputation pooling phase Pooling point estimate: Pooling standard errors: θ = 1 m m 1 θ t m= # of imputed datasets θ t = parameter estimate for t dataset Take an average of the parameter estimates across m datasets The statistical significance of the θ can be calculated in the usual way by calculating the ratio θ / V T V T = V W + V B + V B m ; SE= V T = total sampling variance V W =within-imputation variance V T (the mean of the squared SE across m datasets) V B = between-imputation variance (variability of parameter estimate across m datasets; additional variance that is due to missing) V B = correction factor for a finite number m of imputation 34

Using SPSS to Deal with Missing Data 35

High school longitudinal study of 2009: public-use data NCES secondary longitudinal studies, more than 21,000 9th graders in 944 schools Selected sample: subsample of 500 students who took math and science course in 2009 Selected measures: The example data 9th grade sex (0=male), race/ethnicity (0=white), socioeconomic status 9th and 11th grade math IRT scores 9th grade math interest (3 items; 4 point Likert scale) 9th grade math self-efficacy (4 items; 4 point Likert scale) Demonstration dataset: Hsls09_MissingDataWorkshop_demo 36

Using SPSS to deal with missing data Delete cases with no data on any of the variables. All missing values need to be displayed as system missing (a blank cell) or user-defined missing (a value assigned by researcher, such as 999 or -8888). 37

Using SPSS to deal with missing data Change all missing values (either system missing or user-defined missing value) to a common value -999. Transform-> click Recode into Same Variables -> Select all of the variables into the selection box-> click Old and New Values-> -999 2 1 3 4 38

Using SPSS to deal with missing data Assign missing values for all the variables: In Variable View -> Click on one cell in the Missing column to assign -999 as a discrete missing value -> Click OK. Right click Copy -> Select all cells with numeric variables --- Click Paste. 39

Using SPSS to deal with missing data Define variables : In Variable View -> Under Measure column -> assign the scale for each of the variables. 40

Using SPSS to deal with missing data Analyze the pattern of missing data: Go to Analyze -> Multiple Imputation - > Analyze Patterns Select the variables excluding the ID to Analyze Across Variables For Minimum percentage missing for variable to be displayed, change to 0 -> Click OK (would like to see everything that is missing) 41

Using SPSS to deal with missing data Only 1.83% of the individual values are missing. Variables: the number of variables which contained missing values= 9 out of 12 (green) Cases: 409 cases have complete data (81.8%) (blue) ; 91 cases have at least one missing value on a variable Values: the number of individual values (out of 6000=12*500) that are missing = 110 (1.83%) (green) 42

Using SPSS to deal with missing data The number and percent missing for each variable. Notice, the variables are ordered by the amount of values they are missing (i.e. the percentage missing). Examine the percentage of missing for each variable, make sure that each percent missing makes sense based on your knowledge about this dataset! 43

Using SPSS to deal with missing data The pattern here is arbitrary. least highest Each pattern (row) reflects a group of cases with the same pattern of missing values (15 patterns of missing and nonmissing data) The variables along the bottom (x-axis) are ordered by the amount of missing values each contains. The percent missing for the 10 most common patterns Pattern 1 = no missing (81%) is the most prevalent pattern. Pattern 10= missing on MATH11 (10%) 44

Using SPSS to deal with missing data Request Little s MCAR test and independent sample t-tests for MAR Go to Analyze --- Missing Value Analysis--> Descriptive: Report Student t- test for each pair of continuous variables to examine MAR 45

Using SPSS to deal with missing data Request Little s MCAR test and Separate Variance t tests Go to Analyze --- Missing Value Analysis A note: If you get a warning message in the SPSS output that the EM algorithm failed to converge in 25 iterations, you can increase the maximum iterations by clicking on the EM button. 46

Using SPSS to deal with missing data Request Little s MCAR test and Separate Variance t-tests Scroll down in the SPSS Output window to the EM Means table: Under this table, you can find the result from Little s MCAR test. Non- significant results at p =.054 indicate the data are missing completely at random (MCAR). 47

Examine independent sample t-tests A significant t-test indicates the probability of missing is a function of the values on another variables. It s an indication of MAR! We have variables that can be used in the imputation model. 48

Analysis model Research Question: Can students SES and math self-efficacy predict their 11th grade math score? Dependent Variable: MATH11 Independent Variables: SES and EFF_total (sum of 4 items) Auxiliary variables (for imputation): SEX, RACE, MATH09, Math interest items Correlation analysis: these variables are correlated with variables of interest to some extent Independent sample t-test: some of them are correlated with missingness for variables of interest 49

Before imputation, set a random seed Transform-> Random Number Generators - > select Set Active Generator-> click Mersenne Twister -> select Set Starting Point and Fixed Value -> click OK. 50

Using SPSS to deal with missing data Conducting multiple imputation: Analyze-> Multiple Imputation-> Impute Missing Data Values-> Move the variables of interest to the Variables in Model box. 51

Variables-> 5 imputations will be implemented for demonstration purpose Missing value will be imputed 5 times and stored Name the dataset below the Create a new dataset button 52

Method-> Since the missing data pattern is arbitrary, selecting FCS Specify the number of maximum iterations = 200 Default =10; Increase the number of iterations if the Markov Chain Monte Carlo algorithm hasn't converged. PMM: still uses regression, but the imputed values are adjusted to match the nearest actual value in the dataset (from observations with the same predicted value with no missing on that variable). If the original variable is bounded by 0 and 40, the imputed values will also be bounded by 0 and 40. According to Paul Allison, there are some drawbacks of PMM in SPSS. https://statisticalhorizons.com/predictive-mean-matching 53

Constraints-> Click on Scan Data: examine the variable summary 1 You can specify the role of a variable during the imputation and constraint the range of imputed values (min, max, rounding) so that they are plausible Obtain integer values = specify 1 as the rounding denomination (6.648->7); obtain values rounded to the nearest cent, specify 0.01 (6.648->6.65) 2 3 This column allows you to specify the smallest denomination to accept. 54

Constraints-> If specify the Min and Max: Maximum draw procedure will be activated: it attempts to draw values for a case until it finds a set of values that are within the specified ranges Errors: if a set of values within the ranges is not obtained Increase the maximum draws Demonstration: no constraints on the range of variables 55

Imputation model: univariate model type, model effects, and # of values imputed Descriptive statistics: basic information before and after imputation Iteration history: information on the convergence performance 56

Outputs Hsls09_MissingDataWorkshop_demo2_imputed5 57

Datasets with imputed values are numbered 1 through M, where M is the number of imputations. Select the imputation from the drop-down list in the edit bar in Data view. 58

You can distinguish imputed values from observed values by cell background color. 59

Create composite score: Transform-> Compute Variable Compute the scale score (composite score) for self-efficacy in the stacked dataset This would apply to all the imputed datasets 60

Before the analysis: Data-> Split file Split the file by imputation number This invokes the analysis and pooling phase for multiple imputed datasets 61

Analyze data as usual SPSS provides pooled estimate for some analyses but not all Analyses with this icon, indicating that SPSS provides corresponding procedure to accommodate multiple imputed datasets Let s perform a multiple regression 62

SPSS outputs for multiple regression-descriptive statistics 63

SPSS outputs for multiple regression- correlation matrix 64

SPSS outputs for multiple regression- coefficient estimates Coefficients a Standardized Unstandardized Coefficients Coefficients Imputation Number Model B Std. Error Beta t Sig. Original data 1 (Constant) 45.446 3.777 12.031.000 X1 Socio-economic 8.626 1.072.356 8.046.000 status composite Fraction Missing Info. Relative Increase Variance Relative Efficiency EFF_total 1.879.315.264 5.967.000 Pooled 1 (Constant) 44.126 3.734 11.818.000.158.174.969 X1 Socio-economic 9.242 1.019 9.073.000.087.091.983 status composite EFF_total 1.901.309 6.146.000.130.141.975 a. Dependent Variable: X2 Mathematics IRT-estimated number right score Results differ slightly across imputed datasets SPSS provides pooled estimate for unstandardized regression coefficients! 65

Imputation Diagnostics 66

SPSS outputs for multiple regression- coefficient estimates Fraction missing info: The proportion of total sampling variance that is due to missing data (V B + V B m )/ V T for a parameter estimate, related to percentage missing for that variable. 0.087 for SES: 8.7% of the sampling variance is due to missing data A measure of the impact of missing data on parameter estimates 67

SPSS outputs for multiple regression- coefficient estimates Relative Increase Variance: how much the sampling variance would be increased (inflated) because of missingness (V B + V B m )/ V w. 0.141 for EFF_total: compared to the sampling variance for EFF_total assumed it has complete data, the estimated sampling variance for EFF_total (with missing) is 14.1% larger. Variables with larger percentage missingness tend to have larger relative increase variance. 68

SPSS outputs for multiple regression- coefficient estimates Relative efficiency: it is an efficiency estimate from m imputations relative to performing an infinite number of imputations 1/(1+F/M), where F= Fraction missing info, M= # of imputation. Close to 1 = more efficient, produce proper SE (won t produce too large SE) Large percentage of missing needs more imputations to achieve sufficient efficiency for parameter estimates The SE got from infinite # of imputations is 98.3% of SE got from 5 imputations (fraction of missing info) SAS documentation for multiple imputation (Horton & Lipsitz, 2001, p. 246) 69

Iteration history: Provides mean and standard deviation by iteration and imputation for continuous imputed variables Build the plot to examine the convergence of model 70

Assessing the performance of imputations Graphs > Chart Builder> select line chart 71

Assessing the performance of imputations 1 2 3 72

Assessing the performance of imputations 1 2 3 In the Element Properties, select Value as the statistic to display. 4 73

Assessing the performance of imputations 2 1 74

Mean and standard deviation of the imputed values of SES at each iteration (200) for each of the 5 requested imputations (can be requested for each continuous imputed variable). The purpose of this plot is to look for trends or patterns. Model converge: the parameter values bounce around in a random fashion with no trend ( it reaches this phase immediately) and the different lines of imputations should be mixed with each other. 75

Assessing the performance of imputations using trace plots (using Ender s Macro http://www.appliedmissingdata.com/macro-programs.html): The plot for mean and SD for imputed continuous variables can be requested using Ender s SPSS macro. An indication of the performance of the imputations. For using this macro: 1000 iterations with 2 imputed datasets. Provides additional convergence performance criterion: Potential scale reduction (PSR) for every 100 iteration: the MCMC is regarded as converge when the PSR < 1.05. 76

Problematic or pathological case of non-convergence: Figure from Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained equations in R. Journal of statistical software, 1-68. 77

Healthy case of convergence: Figure from Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained equations in R. Journal of statistical software, 1-68. 78

Practice time! 79

High school longitudinal study of 2009: public-use data Selected sample: subsample of 490 participants who took math and science course in 2009 Selected measures: 9th grade sex (0=male), race/ethnicity (0=white), SES 9th and 11th grade math and science GPA 9th grade science utility (3 items; 4 point Likert scale) 9th grade science self-efficacy (4 items; 4 point Likert scale) Nominal Var: SEX, RACE The practice data Scale Var: SES, MGPA12, SGPA12 Ordinal Var: Science utility and self-efficacy items 80

Analysis model Research Question: Can students race, SES and science selfefficacy predict their 12 th grade science GPA score? Dependent Variable: SGPA12 Independent Variables: Race, SES and SEFF_total (sum of 4 items) Auxiliary variables for imputation model: Sex, MGPA12, science utility items Examine the correlation analysis and univariate t-tests 81

TASKS : YOU CAN DO IT! Change all missing values (either system missing or user-defined missing value) to a common value, e.g., 999 Assign missing values for all the variables in variable view Define variables : In Variable View -> Under Measure column -> assign the scale for each of the variables Analyze the pattern of missing data and examine the percentage of missing (how many percentage of missing?) Request Little s MCAR test (EM) and Separate Variance t-test Conducting multiple imputation: 10 datasets, 100 iterations Remember to set the maximum and minimum value of science and math GPA to 0 and 4 Create a composite score for science self-efficacy Run a regression model to answer the research question Examine the convergence of model by using iteration history 82

Practical Issues/ Myths 83

Practical issues/myths Is imputation making up the data? Note really! The goal of imputation is not to produce the individual values and treat them as real data, but to estimate the population parameter and preserve important characteristics of the data set as a whole (Graham, 2008). Account for uncertainty associated with missing data. Thus, unbiased estimates can be obtained. 84

Practical issues/myths Should both independent variables and dependent variables be included in the imputation model (MI)? At least, all the variables that you will use in your analysis should be included. Why? When the DV is not included, the correlations between it and IVs are assumed to be 0. Excluding it will reduce its relationships with other variables. Taking a liberal approach for variables selection in the imputation phase. Programs did not distinguish whether a variable is IV or DV! 85

Practical issues Why including auxiliary variables? Inclusive Analysis Strategy: ML and MI require MAR and since there is no test for MAR, we need to find ways to increase the likelihood to satisfy MAR. Shafer and Graham (2002, p, 173): collecting data on the potential causes of missingness may effectively convert an MNAR situation to MAR. Incorporates a number of auxiliary variables : help increasing statistical power or reduce biases in parameter estimates. Use as many as you can, most useful are those with correlations.40. 86

Practical issues Working with multiple items questionnaire, whether to impute the individual items or scale scores? If doable, imputing individual items, since it maximizes the information for creating the imputations and have more statistical power than imputing scale scores (Enders, 2010, p.269-270). 87

Practical issues What if my missing data is MNAR? Using Selection Modeling and Pattern Mixture Modeling (Chapter 10 in Ender s Applied Missing Data Analysis) These two models deal with the NMAR situation by statistically modeling the missing data mechanism. Enders, C. K. (2011). Missing not at random models for latent growth curve analyses. Psychological methods, 16(1), 1. 88

What should I report when I write it up? Missing data mechanisms Percentage of missing for each variable & overall percentage of missing Software for missing data imputation Imputation method & algorithm Number of imputed datasets The variables used in the imputation model 89

Reference Enders, C. K. (2010). Applied missing data analysis. Guilford Press. Graham, J. W. (2012). Missing data : analysis and design. Springer. Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of psychology, 60, 549-576. Pigott, T. D. (2001). A review of methods for missing data. Educational research and evaluation, 7(4), 353-383. Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological methods, 7(2), 147. Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. International journal of methods in psychiatric research, 20(1), 40-49. Puma, M. J., Olsen, R. B., Bell, S. H., & Price, C. (2009). What to Do when Data Are Missing in Group Randomized Controlled Trials. NCEE 2009-0049. National Center for Education Evaluation and Regional Assistance. IBM SPSS Missing Values 21 & 24 (user manual). Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained equations in R. Journal of statistical software, 1-68. 90

UCLA: idre Recommended websites SAS : https://stats.idre.ucla.edu/sas/seminars/multipleimputation-in-sas/mi_new_1/ Stata : https://stats.idre.ucla.edu/stata/seminars/mi_in_stata_pt 1_new/ Craig Enders website: Mplus: http://www.appliedmissingdata.com/additionalexamples.html Blimp: http://www.appliedmissingdata.com/multilevelimputation.html 91

Thank you Don t be afraid of missing data! 92