Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Size: px

Start display at page:

Download "Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa"

Maximilian Fletcher
5 years ago
Views:

1 Ronald H. Heck 1 In this handout, we will address a number of issues regarding missing data. It is often the case that the weakest point of a study is the quality of the data that can be brought to bear on the research problem. Missing data can be a problem in both cross-sectional and longitudinal analyses, depending upon the extent to which the data are missing and the patterns of missing data that may be present. The more the analyst can find out about why the data are "as they are," the more she or he can develop a case about the patterns of missing data, as well as a rationale about why the pattern may or may not matter. In some modeling situations, there may be considerable missing data. In this chapter, we will explore some commonly accepted strategies for dealing with missing data. Researchers should consider data preparation and analysis as two separate steps. In some modeling situations, there may be considerable missing data. In this handout, we will concentrate primarily on data preparation when there are missing values. As a first step in preparing the data for analysis, it is often useful to determine the amount of missing data, as well as possible patterns of missing data that exist (e.g., For which variables do missing values occur? Are there specific patterns of missing data?). It is important to keep in mind that the reality is there is no real way to get missing data back (short of actually following up with subjects in a study). In a sense, then, we are always dealing with the problem of missing information to some extent when we use actual data. The quality of our analysis depends on assumptions we make about the patterns of missing responses present and what is reasonable to conclude about those patterns in relation to the study s design (e.g., experimental or quasi-experimental, survey) and data collection (e.g., cross-sectional, longitudinal). What we do about the missing data becomes a more pressing concern. There are a number of available strategies for dealing with missing data. Some traditional approaches (e.g., listwise or pairwise deletion, mean substitution, simple imputation) lead to biased results in various situations (Enders, 2011a; Peugh & Enders, 2004). It helps to know what the defaults are for the different software programs that you might be using. Handling missing data in an appropriate manner depends first on one s knowledge of the data set and why particular values may be missing. The traditional way of handling missing data was to use listwise (i.e., eliminating any case with at least one missing value) or pairwise (e.g., eliminating pairs of cases when missing data are present as in calculating a correlation) deletion or mean substitution. Generally, these are not considered acceptable solutions because they lead to biased parameter estimation. For example, listwise deletion is only valid when the data are missing completely at random (MCAR), as when selecting a random sample from a population. This is seldom the case using real data. The second step in data preparation is to then select an approach for dealing with missing data. At present, the two acceptable solutions mentioned in the literature are multiple imputation

2 Ronald H. Heck 2 (MI) of plausible values and full information maximum likelihood estimation (FIML) with the individuals with partial data included in the analysis (Peugh & Enders, 2004). This latter approach is not generally available in SPSS, although when the data set is structured vertically (i.e., where they may be several lines per subject as when the outcome consists of repeated measures or multiple outcomes), we can retain subjects who are missing one or more pieces of information on the outcomes. Even various regression-based approaches, such as estimating outcomes with dummy-coded missing data flags to determine whether there were differences in outcomes associated with individuals with missing versus complete data, are not optimal approaches for dealing with missing data by themselves (i.e., because they can introduce bias if data are not missing at random). They can, however, be a useful place to begin in determining whether there is likely any systematic bias due to missing data on variables used in a proposed analysis. The FIML approach does not actually impute missing values. Instead, FIML identifies parameter values that yield the highest log likelihood using the available information (i.e., including the individuals with partial data). The estimation procedure borrows information from the available observed data to estimate model parameters, since the existing sample itself is considered as containing incomplete information about the population (Peugh & Enders, 2004). Therefore, ML estimation is not so much dependent on the amount of data available as on its distributional quality; that is, if there are unknown patterns of missing values present in the observed data, this will likely lead to some bias in estimating population parameters with the observed data. Of course, all other things being equal, having at least a certain amount of data will generally yield more accurate estimates of the population values. What is a minimum number of subjects necessary, however, will differ depending on the design of the research, the size of the effects one anticipates, and complexity of the proposed analysis. Types of Missing Data In general, there are three main types of missing data [e.g., see Rubin (1976, 1987) or Peugh & Enders (2004) for further discussion]. These include data that are missing completely at random (MCAR), missing at random (MAR), and non-ignorable missing (NIM), which is also referred to as missing not at random (MNAR). For data to be MCAR strong assumptions must hold. More specifically, the missing data on a given outcome should be unrelated to either a subject s standing on the outcome or to other observed data or unobserved (missing) data in the analysis. Typically, this assumption is only met when the data are missing by design, as in the situation where we draw a random sample of the studied population. For example, suppose we have a sample where 25% are missing data on an outcome such as a math score. It may be the case that students who have missing data on math are distributed relatively evenly across levels of absenteeism. As long as the missing data are spread across the distribution of a continuous covariate such grade point average (GPA) or across different categories of an ordinal outcome such as absenteeism, we can likely conclude the data are MAR.

3 Ronald H. Heck 3 We would try to develop an argument that the individuals with missing data on math are relatively evenly dispersed across levels of the covariates or levels of the factors in the model. For example, for those individuals with missing data on math, but with complete data on absenteeism, we may be able to conclude the distribution of missing data on math is relatively evenly distributed across absenteeism (coded 1 = low, 2 = average, 3 = high). This is summarized in Table 1 below. Table 1. Absences * missing data on math crosstabulation 250 of the 1000 math cases are missing Not missing Missing Total absences 1.00 Count Expected Count Count Expected Count Count Expected Count Total Count Expected Count The data suggests the observed and expected values for complete versus missing data on math are pretty evenly distributed across categories of absenteeism. In fact, if we conduct a chisquare test on the distribution of complete versus missing data on math across levels of absenteeism, the difference in observed versus expected values is not significant (χ 2 = 0.642, 2 df, p >.05). In this first case, therefore, we would likely conclude the data are MAR. Keep in mind, individuals could have missing data on absenteeism and math scores simultaneously and still be MAR, as long as the missing data on one variable seems relatively evenly distributed across cases of the other variable. Of course, if there are a considerable number of individuals with missing data on both math and absenteeism, it may be more difficult to draw a conclusion. Let s suppose now, however, there are individuals in the data set who are also missing data on absenteeism. We will say there are 150 individuals with missing data on absenteeism along with the 250 who are missing data on math). Some of the individuals will therefore likely be missing data on both math and absenteeism. In Table 2, we have individuals with complete data on math (N = 750), but the table suggests that 118 of the individuals with complete data on math have missing data on absenteeism.

4 Ronald H. Heck 4 Table 2. Math categories * missing on absenteeism crosstabulation 150 from the 1000 absenteeism cases are missing Not Missing Missing Total mathcategories 1.00 Count Expected Count Count Expected Count Count Expected Count Total Count Expected Count In this table, we can see there is a tendency for individuals with missing data on absenteeism to have lower achievement in math (i.e., with categories defined as low = 1, average = 2, and high = 3). More specifically, the expected count for low achievement is about 12 individuals with missing data on absenteeism, but among those with missing data on absenteeism, there are 18 observed individuals in the low achievement category. It turns out the chi-square coefficient is not significant for these individuals in the partial data base (χ 2 = 4.086, 2 df, p =.130). For individuals with complete data on math, we can see there is no systematic bias in terms of missing data on absenteeism, although we noted a slight tendency for students who missing on absenteeism to be low achieving in math. It is important to note in passing, however, that in this data set there are actually 150 individuals with data missing on absenteeism (rather than 118), so this means that potentially up to 32 individuals with missing data on absenteeism could also be low achieving in math (category 1), if their math score data were not missing. We cannot be sure about this small, unknown group of individuals. If we now look at all the individuals in the data set (N = 1000), including those with missing data on either absenteeism or math coded as 0, we can examine whether there is a tendency for individuals with missing data on absenteeism to have either missing data on math (which could be still MAR) or to be higher or lower achieving in math (which might then be NIM). In Table 3, we can see for individuals with missing data on absenteeism (coded 0), they are relatively evenly distributed across all math levels, if we compare observed versus expected counts in each category (including the math category with missing data = 0).

5 Ronald H. Heck 5 Table 3. absences * mathcategories Crosstabulation Mathcategories Total Absences.00 Count Expected Count Count Expected Count Count Expected Count Count Expected Count Total Count Expected Count We can see in the first row (absences = 0) a slight tendency for individuals with missing data on absences to be low achieving in math (observed count = 20 versus expected count = 12). However, the observed versus expected counts are relatively similar for other categories of math achievement. On the other hand, for individuals missing data on math, there is no real tendency for them to be high in absenteeism (coded 3). More specifically, the expected count is 70 and the observed count is 68). As noted, if it were the case that the probability of an individual missing data on the math outcome is related to standing on the outcome, even for individuals with the same value on a covariate (e.g., there is more missing low-math data than missing average- and high-math data among students with the same attendance level), then it is likely that the data are NIM. This does not seem to be the case in this example data set. Data that are NIM can produce more bias on model estimation than either of the other missing data situations because the missing data on the absenteeism variable are related to actual values of individual achievement for those subjects who do not take the math test. We would prefer to be able to say the pattern of missing data on math outcomes is relatively similar for students with high, average, and low absenteeism. This would then indicate data that are MAR. Although it is often reasonable to assume that data are MAR, under some circumstances this assumption may not hold (Enders, 2011a). For example, if we again had 1,000 students who took the math test, with the same 250 having missing test data, but now determined that the majority of those 250 missing individuals also had high absenteeism (say 2/3), then we might have to acknowledge there could be some bias present. This bias might be necessary to note if we knew, for example, there was a statistically significant relationship between high absenteeism and low math achievement among those individuals with complete data. Given the large number of missing individuals with high absenteeism, then it would likely be difficult to argue that the

6 Ronald H. Heck 6 missing data on the student absenteeism variable would not affect the estimation of students' math test scores in the population. You can see that often dealing with missing data involves being able to make a reasonable argument about why the data are as they are and whether or not this is likely to bias the estimation of the outcomes in the analysis phase. Other techniques have been developed for data that are NIM. For example, Enders (2011a) demonstrates the usefulness of two of the NIM approaches for longitudinal data (i.e., pattern mixture model, selection models) and demonstrates their use on a real data set. The pattern mixture model, for example, involves stratifying the sample into subgroups that share the same pattern of missing data and the estimating a separate model for each pattern of missing data. SPSS is limited in its ability to deal with missing data appropriately. As the default, the program uses listwise deletion of cases with missing data. This means any individual with data missing on any variable will be dropped from the analysis. As an example, an individual with complete data on nine variables but missing data on the tenth variable would be dropped from the analysis. This obviously can result in a tremendous loss of data from individuals who may have largely complete data otherwise and, generally, it will lead loss of power to detect effects as well as potentially biased parameter estimates, since the estimates will be based on a reduced (and possibly biased) information. Mean substitution treats individuals with missing data as if they were sitting on the grand mean, which may not be very likely. SPSS does provide a number of options for examining missing data in addition to listwise, pairwise, or mean substitution. For example, the Base Statistics program includes a basic routine to replace missing values. This is, however, a limited routine which provides the following replacement methods: series mean, mean of nearby points, median of nearby points, linear interpolation, and linear trend at point. It is accessed from the program s toolbar, where the user can select TRANSFORM and REPLACE MISSING VALUES. It will not be appropriate to use in any real situations, however. Next, we would make a choice about how to address the missing data in our analyses. I actually favor a process where the analyst attempts to triangulate the results with different approaches which are currently recommended for examining missing data. One possible approach is to do something like the following. First, the analyst can try running the model using listwise deletion (which assumes MCAR). This data set is likely to be considerably smaller than the data set that would include individuals with "partially complete (data if FIML estimation were available), but it does provide the analyst with a baseline view of the relationships in the data (albeit likely a biased one) for comparing subsequent results. Second, if there is not too much missing data per variable, the listwise results can be compared against a number of complete data sets generated using a multiple imputation program which can be applied to the existing data under the assumption the data are MAR. Third, if the analyst has access to an SEM program (like Mplus), she or he can also try estimating the model with FIML with the cases with partial data included in the analysis and then comparing these results with the other approaches. Also, if the analysis is using a secondary (existing) data set, there may be sample weights

7 Ronald H. Heck 7 available. In this case, the analyst can (and should) check whether the available sample weights include adjustment for subject nonresponse. If there are such weights, they will be useful in addressing the issue of missing data. Multiple Imputation The only viable option, then, in SPSS is multiple imputation (MI). SPSS has a missing data module that can provide multiple imputation for missing data, but it is designed for singlelevel imputation only (i.e., referred to as design-based imputation). Of course, this is fine if one is not working with multilevel data (i.e., where individuals are clustered in groups such as schools, hospitals, or businesses). The available SPSS approach allows the user to generate and save a number of data sets with random values imputed for missing values. One of the advantages of MI for dealing with missing data is that the program implements random plausible values for missing cases and the user can produce several data sets, which will provide a normal distribution of imputed values. In order to use MI, however, there should be a relatively large number of individuals in the study (ideally, perhaps at least 150 or 200). Each data developed set is filled in with a different set of plausible replacement values for the missing data. Often, the advice is to create 5-10 imputed data sets, analyze them all, and obtain the average of the estimates and their adjusted standard errors across all the data sets as the final set of results. Note that some researchers suggest imputing as many as data sets. Patterns of missing data can first be identified and then plausible values can be imputed using a three-step process. The multiple imputation data sets can be used for subsequent analysis of each data set. Finally, the parameter estimates reported are averaged over the set of imputed data sets, and the corresponding standard errors needed for hypothesis testing are computed using the Rubin formula (Rubin, 1987). We will learn how to compute the standard errors in the next section of the handout, although SPSS can do that automatically, at least for many of the corresponding SPSS analytic procedures (Enders, 2011b) including multiple regression. Imputing Missing Values There are generally three phases of the MI process. First is the imputation phase, where plausible values are imputed into a number of data sets using information from variables the analyst chooses to include in the analysis. This phase is also sometimes referred to as data augmentation. Obtaining the estimates involves an iterative, two-step process where missing values are first imputed using regression equations and then a covariance matrix and mean vector are estimated. At the second (posterior) step (p-step), the regression coefficients are randomly perturbed in order to produce a slightly different equation for the next imputation step. This repeats until the difference between covariance matrices from adjacent iterations differs by a trivial amount [see Peugh and Enders (2004) for further discussion]. The process can be used to create a number of imputed data sets, where each simulates a random draw from the distribution

8 Ronald H. Heck 8 of plausible values for the missing data (Peugh & Enders, 2004). These can be saved as separate data sets and subsequently analyzed. The second MI phase is the analysis phase. This phase produces separate estimates for each of the data sets created. Third is the pooling phase, where the data sets are brought together and the parameters of the model under consideration are estimated across the data sets. As noted previously, in SPSS, the pooled estimates can be produced following Rubin s (1987) approach for most analytic procedures. Similarly, Mplus can also pool and analyze the data sets, producing one set of final estimates for the separate imputed data sets. Rubin s (1987) approach involves combining the results from a data imputation analysis performed m times (i.e., once for each of m imputed data sets), to obtain a single set of results. From each analysis, one must first calculate and save the parameter estimates and standard errors. Suppose that Q ˆ j is an estimated regression coefficient of interest (e.g., student socioeconomic status or SES) obtained from imputed data set j (j=1,2,...,m) and U j is the standard error associated withq ˆ j. The overall estimate is the average of the individual estimates (1) The overall standard error for Q ˆ j is comprised of two components. First is the withinimputation variance, and second is the between-imputation variance, which is defined as additional sampling error due to the presence of missing data. The within-imputation variance can be defined as the average squared standard error from the imputed data sets. (2) The between-imputation variance is the variance of the estimates across the data sets. This represents the usual formula for sample variance. The total variance is estimated as (3) (4) The overall standard error (SE) is then the square root of T. Significance tests of averaged individual parameters (θ) use a t distribution: Another advantage of the MI approach is that other variables not included in the actual analysis can also be used to supply information about missing data when the assumption that the data are MAR is plausible. The variables used in the imputation phase need not be included in (5)

9 Ronald H. Heck 9 the estimation of the proposed model. This can be useful when the analyst wants to use as much information from other variables in the data set to impute plausible values, but the variables used may not all be relevant to the focus of the specific analysis. In Table 4, we have a simple multiple regression model to estimate the effects of gender and socioeconomic status on student reading scores. The data are based on 20 individuals with complete data. We can see that standardized (M=0, SD = 1) socioeconomic status (ZSES) is statistically significant in explaining reading scores (B = 3.538, p <.05), but gender is not significant (p = 0.938). Table 4. Unstandardized estimates with complete data. B Std. Error T Sig. (Intercept) ZSES Female In Table 5, we can estimate the same model but with missing data on ZSES for six individuals. Now, with listwise deletion in SPSS (i.e., where any individual with missing data is eliminated from the analysis), we will lose 30% of the data. We can see that estimating the model with 15 individuals will result in a model where ZSES now does not affect reading scores significantly (B = 2.989, p >.05). We can also note the estimated parameter is smaller and, because of the loss of data, the power to detect the effect if it exists in the population is also reduced. We can therefore see that listwise deletion will result in bias unless the data are MCAR. This would only generally be the case, however, if we drew a random sample from a population (i.e., where the individuals not drawn for the sample could be thought of as missing completely at random). Of course, in Table 5 the missing individuals affect all the parameters in the model not just the one variable with missing data (ZSES). Table 5. Unstandardized estimates with 30% missing data. B Std. Error T Sig. (Intercept) ZSES Female For demonstration purposes, I first estimated the incomplete data using full information maximum likelihood (FIML), so the individuals with partial data are included. As you can see in Table 6, when the model is run using the Mplus software (which has FIML), we can retain all 20 individuals in the data set since the individuals with missing data on SES area included in the analysis. Even though the estimated ZSES parameter value (2.770) is smaller than the

10 Ronald H. Heck 10 corresponding estimate in Table 5 (2.989), it is statistically significant because of the increased power to detect the effect over the listwise data set. This is because all the individuals could be retained in the analysis. Table 6. Unstandardized Mplus model results (FIML estimation with partial data*). Estimate S.E. Est./S.E. Sign. Read on ZSES Female Intercepts Read *Number of observations = 20 Unfortunately, although SPSS has ML estimation available, it will not include individuals with partial data in any of its standard analytic platforms. In general, therefore, the only viable option available in SPSS for dealing with individuals with partial data is to use multiple imputation. As noted, MI will impute plausible values for variables with missing values by borrowing information from similar cases with complete data on the variables with missing values. As a beginning point, it is often useful to gather information about whether the missing values for a variable of interest are likely MAR. In this case, the missing data are on ZSES (and not the outcome), so we might begin by creating a flag for missing SES data. We can use Transform: Recode into Another Variable to create a variable called Missing. We open the dialog box and place ZSES in it. Then we create a new variable called Missing (click change to add it to the data set). Then click on Old and New Values. First, click on System Missing and code it as 0 and add to the dialog box. Then click on All Other Values and code them = 1 and add it in the dialog box. Click continue and this will create a missing value variable for ZSES. Now we can see if there is any systematic missingness associated with ZSES with respect to the reading outcome. You will recall that in the earlier discussion of types of missing data, that data can still be MAR if the probability of data being missing on the outcome is related to missing data on a covariate, but not to subjects standing on the outcome. In this case, however, there is no missing data on the outcome, so it is easier to examine whether or not missing data on SES is related to level of reading outcomes. The multiple regression results suggest that there is no relationship between the missing data on ZSES and reading scores More specifically, we can see that students with missing data on SES did tend to have lower reading scores but the result was not statistically significant (p >

11 Ronald H. Heck 11.05). One way the result might be statistically significant would be if more low (or high) SES students were missing. This observed result is therefore consistent with the idea that the data are likely MAR. Note that it might be possible for there to be missing data on SES that is associated with gender; that is, there can be missing values associated with one or more other predictors. The key issue is whether standing on a predictor (or predictors jointly) is associated with standing on the outcome. So, for example, if low SES students are known to be more likely to have low reading scores, and there is more missing data regarding low SES students, then it will be harder to argue that the data are truly MAR. Table 7. Unstandardized coefficients estimating the relationship between missing values on ZSES and reading levels. Unstandardized Coefficients Model B Std. Error T Sig. 1 (Constant) Female misszses Dependent Variable: read In this instance, however, since missing values on ZSES do not appear to predict levels of reading in Table 7, we could argue that the pattern of missing data with respect to ZSES is likely MAR with respect to reading scores. Creating Imputed Data Sets For demonstration purposes, next we can examine three data sets generated through multiple imputation in SPSS with the assumption that the data are MAR (recall it is suggested to create between 5 and 10 data sets as a minimum). Each of the three imputed is first analyzed separately in Table 8. Table 8. Unstandardized Estimates for 3 Imputed Data Sets and Averaged Estimates Parameter Coefficient SE T Sig. Data Set 1 ZSES Female Data Set 2 ZSES Female

12 Ronald H. Heck 12 Data Set 3 ZSES Female Imputed Estimates ZSES Female The results across the three data sets indicate that ZSES is statistically significant in explaining reading levels (p <.05). Female is not statistically significant in any of the three data sets (p >.05). You can see the size of the parameter estimates differ considerably (due to the small sample size). Keep in mind that if I imputed the data a second time, I would obtain different estimates, since the program selects random plausible values each time for ZSES based on other available information. The last estimates in the table are the averaged imputed estimates for the three data sets. Obtaining the Correct Averaged Estimates for SPSS Using Rubin s Method At the last step, we would provide the results of our analysis which incorporates our approach for dealing with missing data. Here is the series of steps to use in order to take the estimates from each imputed data set in Table 8 and develop the correct set of averaged imputed estimates with corrected standard errors for the imputed data sets. Find the average of the estimates for ZSES and female in the imputed data sets (Eq. 1). For ZSES: = = For female: (-.267) = = Obtain the within-imputation variance (Eq. 2). For ZSES: = 3.36/3 = 1.12 For Female: = 23.54/3 = 7.85 Obtain the between-imputation variance (Eq. 3). For ZSES: [( ) 2 + ( ) 2 + ( ) = 0.57/2 = 0.28 For Female [ )] 2 + [ )] 2 + [ )] = 4.91/2 = 2.46 Obtain the total variance (Eq. 4). For ZSES: (1.33) = 1.49 For Female: (1.33)(2.46)

13 Ronald H. Heck = The standard error is estimated as the square root of the variance. For ZSES ( 1.49 ) = For female ( ) = Finally, we can construct a t-ratio from the ratio of the average estimate to its corrected standard error (Eq. 5). For ZSES 2.892/1.221 = For female 0.118/3.335 = You can see this information in the t-ratio column for imputed estimates in Table 8. You can then estimate the p-value from a table of t-scores (or use a t-score conversion tool obtained online). We can compare the results obtained for the averaged results using SPSS against the Mplus results in Table 9, which use Rubin s (1987) approach. We can see the estimates of the averaged regression coefficients in Mplus are only slightly different from the SPSS averaged coefficients. For example, for ZSES, the SPSS unstandardized estimate is and the Mplus unstandardized estimate is The standard error for ZSES is in SPSS is a little larger (1.221) than the Mplus estimate (1.162). In contrast, the calculated standard error for female is slightly smaller (3.335) than the Mplus estimate (3.390). These small differences are likely due slightly different estimation procedures. When Mplus standard errors for each data set are used and the standard error adjustments for ZSES and female are calculated by hand, they agree with the Mplus output. Table 9. Mplus unstandardized results (ML estimation) Estimate S.E. Est./S.E. Sig. Read On Female ZSES Producing Pooled Estimates in SPSS We can compare our estimates of the averaged effects at the bottom of Table 8 against SPSS pooled estimates in Table 10. SPSS places the original data and the imputed data sets in one (stacked) file. A variable called IMPUTATION_ is used to differentiate the original (0) from the successive number of imputed data sets (1,2,, n). In this example, we will impute 3 data sets, but generally the recommendation is for probably 10 to 20 data sets (the default is 5).

14 Ronald H. Heck 14 MULTIPLE IMPUTATION female Zses read /IMPUTE METHOD=AUTO NIMPUTATIONS=3 MAXPCTMISSING=NONE /MISSINGSUMMARIES NONE /IMPUTATIONSUMMARIES MODELS /OUTFILE IMPUTATIONS='C:\Users\COE Staff\Documents\My Files\EDEP 606\Imputed3.sav' Following is the output regarding the 3 imputations in SPSS. The default imputation method was used, which develops the imputed data sets based on scanning the measurement scales of the data with missing values (e.g., continuous, ordinal, binary). As shown, ZSES was the only variable among the variables used in the imputation process that had missing data. Imputation Specifications Imputation Method Automatic Number of Imputations 3 Model for Scale Variables Linear Regression Interactions Included in (none) Models Maximum Percentage of 100.0% Missing Values Maximum Number of Parameters in Imputation 100 Model Imputation Results Imputation Method Monotone Fully Conditional Specification Method Iterations n/a Dependent Variables Imputed Zses Not Imputed(Too Many Missing Values) Not Imputed(No Missing female,read Values) Imputation Sequence female,read,zses The results below suggest that there were six missing values and there were 18 values imputed (6 missing cases x 3 imputed data sets).

15 Ronald H. Heck 15 Model Imputation Models Type Effects Missing Values Imputed Values Zses Linear Regression female,read 6 18 After the multiple imputation is conducted, before actually analyzing the data using multiple regression, it is necessary to use the SPLIT FILE option in SPSS (DATA: SPLIT FILE). We can split the imputed data sets to examine the original data set with missing data against the imputed data sets and to run a pooled analysis with one set of estimates based on Rubin s (1987) techniques discussed earlier in the chapter. We select Compare Groups. The imputed data have already been sorted. In the SPSS syntax below, we first provide a statement to split the file layered by the imputation (Imputation_) data sets. Afterward, we can conduct the multiple regression analysis. SPLIT FILE LAYERED BY Imputation_. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT read /METHOD=ENTER female Zses. Examining the Parameter Estimates for 3 Imputed Data Sets in SPSS The results in Table 10 across the three data sets indicate that ZSES is statistically significant in explaining reading levels (p <.05). Female is not statistically significant in any of the three data sets (p >.05). You can see the size of the parameter estimates differ considerably (due to the small sample size). Keep in mind if I imputed the data a second time, I would obtain different estimates, since the program selects random plausible values each time for ZSES based on other available information. The last estimates in the table are the averaged imputed ones for the three data sets. You can compare the pooled output produced by SPSS in the table below to the estimates I did by hand in Table 8. We can see that the pooled estimates for female and ZSES are quite similar to the ones calculated by hand.

16 Ronald H. Heck 16 Table 9. Intercepts and Unstandardized Coefficients a Unstandardized Coefficients imputation_ Model B Std. Error t Sig. 0 1 (Constant) female Zses (Constant) female Zses (Constant) female Zses (Constant) female Zses Pooled 1 (Constant) female Zses a. Dependent Variable: read Below in Table 10 are the estimates of the model r-square statistics for the missing data set and the three imputed data sets. You can see there is variability across the imputed data sets. Table 10. Model Summary imputation_ Model R R Square Adjusted R Square Std. Error of the Estimate a a a a a. Predictors: (Constant), Zses, female

17 Ronald H. Heck 17 Finally, we have the overall ANOVA results for each imputed data set in Table 11. The F-tests suggest variability in the overall model in terms of accounting for significant variance in reading scores across the separate imputed data sets. Table 11. ANOVA a Results imputation_ Model Sum of Squares df Mean Square F Sig. 0 1 Regression b Residual Total Regression b Residual Total Regression b Residual Total Regression b Residual Total a. Dependent Variable: read b. Predictors: (Constant), Zses, female Missing Data in Vertical Format in SPSS Using the Mixed Modeling As noted previously, at present SPSS does not support FIML estimation in situations where there may be observations missing, as is found in typical SEM software programs. However, where one can vertically arrange the data (e.g., where a single individual may have repeated observations which comprise several rows in the data set), only that particular piece of missing information will be dropped if it is on the dependent variable. If covariates are missing, however, the subject will also be listwise deleted, which will likely introduce some bias into the analysis. Besides repeated measures data, arranging the outcome data vertically can also be useful in situations where an analyst may wish to examine several univariate outcomes (e.g., individual results on reading, math, and language tests). If there are considerable missing data on each outcome, treating the outcome as multivariate (i.e., with vertical arrangement of the data at level 1), can result in keeping most of the missing data, since only cases where data are missing on all three tests will be dropped. It is important to note that keeping participants with partial

18 Ronald H. Heck 18 data is important for justifying the MAR assumption. Where MAR can be supported, this should lead to estimates that are not biased (Hox, 2010). Here is a simple illustration of how this works with longitudinal data. In Table 10, where there are 3 repeated reading measures per individual and 4 individuals. We can see different patterns of missing data for each particular individual present. SPSS can handle different patterns of missing data (i.e., missing on the first occasion, the second or third occasion, various multiple occasions) and amounts of missing data. Table 10: Vertical Data Format Subject Time Score Some individuals in the table have no missing observations, some have missing data on one occasion, and some have missing data on two occasions. As long as Y is not missing on all occasions, the program will come up with an "estimated" growth over each time interval, as well as an initial status (intercept) estimate, even though the initial data point is missing for subject 4. In Table 11, closer inspection of the data using Missing Values Analysis (MVA) in SPSS suggests individuals have missing data on the outcome. We can also see that of the total data lines (12), only 8 lines are present (so 33% of the data is missing). Table 11. Univariate Statistics N Mean Std. Deviation Missing Count Percent time score Next, in Table 12, we can see that there is more missing data at the second time interval than either the first or third interval. Also we can see that is we only used the 8 lines of data the

19 Ronald H. Heck 19 grand mean over occasions would be This is not as relevant since we are estimating growth over time, rather than one grand mean, but it makes the point that we are losing onethird of the data. Table 12. Tabulated Patterns Missing Patterns a Complete time d Number of Cases time score if... b score c X a. Variables are sorted on missing patterns. b. Number of complete cases if variables missing in that pattern (marked with X) are not used. c. Means at each unique pattern d. Frequency distribution at each unique pattern Below in Table 13 is the Model Dimension table, which is part of the SPSS output from estimating this model. It shows all four subjects are retained in the analysis. This can be important information for analysts concerning how many individuals in the data set are actually being included in the analysis. It is important to be able to accept that the data are MAR, since the maximum likelihood estimates in the MIXED analytic platform in SPSS depend on this assumption in order for the estimates to be unbiased. This is why being able to include all individuals with partial or complete data is generally important for this accepting this assumption as valid. You can see that the more individuals who are dropped from the analysis, the harder it would be to make the case that the data are indeed MAR. Table 13. Model Dimension a Number of Levels Covariance Structure Number of Parameters Subject Variables Number of Subjects Fixed Effects Intercept 1 1 Time 1 1 Repeated Time 3 Diagonal 3 Subject 4 Effects Total 5 5 a. Dependent Variable: score. Importantly, in Table 14, the estimates for students initial intercept score and their change over each interval of time are summarized. The initial status intercept (5.21), which describes the mean when Time = 0, is estimated based on available information, but all four individuals are kept in the analysis, rather than if we just estimated the data based on complete data. For example, if we used RM ANOVA, in this small data set that would be only one individual! This makes the point about the importance of retaining partial data. Of course, if the

20 Ronald H. Heck 20 missing data were also on a predictor (like say gender or SES), then we would have also investigate possible effects of missing data due to the predictors (as well as due to the outcome). Table 14. Estimates of Fixed Effects a Parameter Estimate Std. Error df T Sig. Intercept Time a. Dependent Variable: score. Notice there is one more interesting piece of information in this table. SPSS uses adjusted degrees of freedom, which is something like the relative sample size, in estimating hypothesis tests (i.e., regarding the statistical significance of parameters). You can see that the relative sample size is larger in estimating the effect of growth over each interval of time than in estimating the initial status intercept. This is because there are more data points that can used to estimate the change over time than to estimate the initial status intercept (see Table 12). Summary Much of our discussion about missing data suggests that dealing with missing data is not so much about "How much missing data is allowable?" but, rather, is more about how to develop a process to deal with the missing data. It is incumbent on researchers to be aware of how missing data will affect the analysis. We can definitely improve the quality of our analyses by giving attention to missing data in the preliminary phase of preparing the data for analysis. Even relatively small amounts of missing data on one or more variables can create some bias in the estimated parameters, so it is important to assess what this likely parameter bias might be, and then develop some type of strategy to address the problem (e.g., multiple imputation, ways to retain individuals with partial data, provide analyses under various conditions and compare the results, etc.). I have one handout (which I did not include in this chapter) showing that even with small amounts of missing data (i.e., less than 60 missing cases) and over 6,500 individuals in the data set, each of three analytic approaches I compared made use of differing amounts of data. This was not a problem in such a large sample size; however, it illustrates the point that different analytic approaches can make use of differing amounts of data. It becomes important to know how many cases are being included in an analysis. In smaller data sets, of course, you can see the problem of missing data could be considerably magnified. Class Activity A researcher is interested in examining whether treatment (coded 1) or control group (coded 0) membership is related to knowledge acquisition in math. Students (N = 40) were randomly assigned to treatment or control conditions. They were also assessed in terms of their

21 Ronald H. Heck 21 prior knowledge (pretest). Unfortunately, there is missing data on both the knowledge posttest and the pretest. The data set used for this activity is ch8missingdataactivity.sav 1. Determine how much missing data there is on the two variables of concern and whether missing data on the posttest tends to be associated with group membership and missing data on the pretest. 2. Impute three data sets. Present the average results across the three data sets with the standard errors for group and pretest adjusted for variance in the imputing process. 3. After you obtain your averaged results calculate the t ratio and determine whether the variables in the model are statistically significant at the p =.05 level. You may want to begin by estimating a regression model with the listwise data, just to see where you are starting out. Then you can create dichotomous missing variables for the posttest and for the pretest. Finally, you can examine whether group membership and missing values on the pretest tend to predict missingness on the dependent variable (you can use logistic regression (REGRESSION: Binary logistic) to do this. Remember the more important part of missing data analysis is whether standing on the independent variable is related to standing on the dependent variable. When the missing data is confined to the predictor, it is a bit easier to check whether missing data on, for example, the pretest is related to lower scores on the posttest. So one place we can start is by examining whether the predictor tends is related to higher or lower values of the outcome (which in this case we expect for both group and pretest) and then whether they are related to missing data on the outcome in a systematic way for similar standing on the predictor (i.e., statistically significant relationship). Recall that for non-ignorable missing (NIM) the key is whether the probability of missing on the outcome is related to standing on the outcome, even for individuals with the same value on a covariate. So, for example, if individuals in the control group were responsible for 2/3 of the missing data and we know they have lower scores, it would be harder to argue than the greater missing data in that group might be biasing the results for the overall population estimates. Similarly, if most of the missing data were for low pretest scores, this might affect the overall estimates of the learning at the end. In most cases, this rests on mounting an argument about why the data are as they are and whether this likely has a non-ignorable effect on the outcomes. References Enders, C. (2011a). Missing not at random models for latent growth curve analysis. Psychological Methods, 16, Enders, C. K. (2011b). Analysis of missing data. Workshop at BYU, June 2-3, Hox, J. (2010). Multilevel analysis. Techniques and applications (2 nd Edition). NY: Routledge.

22 Ronald H. Heck 22 Peugh, J. & Enders, C. (2004). Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of Educational Research, 74, Rubin, D.B. (1976). Inference and missing data. Biometrika, 63, Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. J. Wiley & Sons, New York.

Example Using Missing Data 1

Ronald H. Heck and Lynn N. Tabata 1 Example Using Missing Data 1 Creating the Missing Data Variable (Miss) Here is a data set (achieve subset MANOVAmiss.sav) with the actual missing data on the outcomes.