Longitudinal Modeling With Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques

Size: px

Start display at page:

Download "Longitudinal Modeling With Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques"

Brendan Barber
6 years ago
Views:

1 / ORGANIZATIONAL Newman / LONGITUDINAL RESEARCH MODELS METHODS WITH MISSING DATA ARTICLE Longitudinal Modeling With Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques DANIEL A. NEWMAN The Pennsylvania State University For organizational research on individual change, missing data can greatly reduce longitudinal sample size and potentially bias parameter estimates. Within the structural equation modeling framework, this article compares six missing data techniques (MDTs): listwise deletion, pairwise deletion, stochastic regression imputation, the expectation-maximization (EM) algorithm, full information maximization likelihood (FIML), and multiple imputation (MI). The rationale for each technique is reviewed, followed by Monte Carlo analysis based on a threewave simulation of organizational commitment and turnover intentions. Parameter estimates and standard errors for each MDT are contrasted with complete-data estimates, under three mechanisms of missingness (completely random, random, and nonrandom) and three levels of missingness (, 50%, and ; all monotone missing). Results support maximum likelihood and MI approaches, which particularly outperform listwise deletion for parameters involving many recouped cases. Better standard error estimates are derived from FIML and MI techniques. All MDTs perform worse when data are missing nonrandomly. Multivariate longitudinal analyses hold promise for the study of individual change in organizations (see Chan & Schmitt, 2000, and Garst, Frese, & Molenaar, 2000, for examples). Unfortunately, attrition from multiwave studies can lead to large standard errors in parameter estimates because nonresponse is compounded across waves of data collection to produce small longitudinal sample sizes. Furthermore, nonrandom mechanisms leading to attrition can bias model parameters and engender Author s Note: This article is based on the author s doctoral minor project. I would like to thank Mike Rovine for showing me Allison s direct estimation technique, and I am indebted to Kevin Murphy for his early input on the design. The article also benefited from conversations with Hock-Peng Sin and John Graham as well as from comments made on earlier versions by David Chan, Chuck Lance, and two anonymous reviewers. All correspondence should be directed to Dan Newman, The Pennsylvania State University, 429 Bruce V. Moore Building, University Park, PA 16802; dan148@psu.edu. Organizational Research Methods, Vol. 6 No. 3, July DOI: / Sage Publications 328

2 Newman / LONGITUDINAL MODELS WITH MISSING DATA 329 misspecification and misestimation of the substantive model of change (Chan, 1998; Goodman & Blum, 1996; Muthen, Kaplan, & Hollis, 1987). The problems of survey nonresponse (i.e., reduction in statistical power and threat of parameter bias) are a particularly salient challenge for longitudinal researchers. Two important factors to consider when selecting a technique for analyzing missing data are the amount of data missing and the specific type (or mechanism) of missingness (Roth, 1994). The greater the percentage of missing data, the more important missing data approaches are for minimizing bias. As the portion of data missing reaches 15% to 20%, the choice of a missing data estimation technique can have substantial implications for the parameter estimates (Raymond & Roberts, 1987; Roth, 1994). The second factor that has been suggested for consideration in the choice of a MDT is the specific type of missingness, commensurate with the mechanism that gave rise to the missing values (Little & Rubin, 1987). The following section reviews several types of incomplete data and explains which are most likely to give rise to biased sample estimates. Types of Missing Data Survey data are usually arrayed in a rectangular matrix of variables by cases. If one imagines a matrix in which some of the data are missing, it is instructive to question whether the probability that a specific cell of the data matrix is empty is totally random or, alternatively, whether the probability of missingness has any dependence on the value of the variable that would have occupied that cell or on the values of other variables. Understanding missing data that are not missing completely at random is a good first step in determining the appropriateness of any particular missing data procedure. Little and Rubin (1987) offered an explication of various missing data patterns, which has been helpfully abstracted by Roth (1994) and Schafer and Graham (2002). Little and Rubin (1987) explain that when considering a single variable Y, values of Y can be missing (a) randomly, (b) below some cut value on Y (e.g., those with IQ scores below 85 are not considered for hire), or (c) subject to a form of probabilistic censoring proportional to the value of Y (e.g., the less committed one is to the organization, the greater the probability that one fails to respond to the survey measuring commitment). The extent to which missing data is likely to bias sample estimates of the population mean of Y is negligible when data are missing randomly but not necessarily negligible when data are missing systematically. When considering two or more variables at once, there are even more possible missing data patterns, and nonrandom missingness can bias both variable means and covariance estimates (Rubin, 1976). The first type of missing data is referred to as missing completely at random (MCAR) and describes a random mechanism for data loss in which the probability that any given datum is not recorded is equal across all respondents, independent of the values on both the variable with incomplete data as well as on all other variables, measured or unmeasured. This type of missingness is unlikely to bias population mean estimates (Little & Rubin, 1987). For a missing-data mechanism to be classified as MCAR, it must be classified as both missing at random (MAR) and observed at random (OAR). Data are MAR if the missingness pattern does not depend on the values of the data that are missing, whereas data are OAR if the missingness pattern does not depend on the values of data that are observed (inter-

3 330 ORGANIZATIONAL RESEARCH METHODS estingly, it is impossible to test whether data are MAR, but possible to test whether they are OAR). For two variables X and Y, data on Y is MAR if the probability of missingness on Y depends on X but not on Y after controlling for X. For instance, if an organization has three levels of hierarchy and individuals in higher levels are more likely to not report their incomes, then income data can still be MAR, as long as all the data within each level of hierarchy have an equal probability of missingness (independent of income). The MAR-OAR distinction is useful for pointing out circumstances that are MAR but not OAR (and therefore not MCAR). The above example with hierarchy and income is such a circumstance, because the probability of income nonresponse is related to observed levels of hierarchy but unrelated to income when hierarchy is controlled. Note also that the circumstance in which nonreporting of income is related to level of hierarchy and also related level of income within each level of hierarchy is neither MAR nor OAR (alternatively denoted NMAR) (Jamshidian & Bentler, 1999). In a simple summary of the three types of mechanisms of missing data, Schafer and Graham (2002) pointed out that MCAR, MAR, and NMAR can be distinguished by delineating the antecedents of the missing data on variable Y. That is, the probability that data are missing on Y can depend on (a) neither X nor Y (MCAR), (b) X but not Y when X is controlled (MAR), or (c) Y itself (NMAR). Last, the missing data mechanism may be related to variables not included in the study (Graham & Donaldson, 1993), may be attributable to the sensitivity of the information elicited by the item or scale itself (as in self-report measures of illegal behavior), or may be related to a combination of two or more variables (Kim & Curry, 1977; Roth, 1994). The mechanisms accountable for data missingness are of direct relevance to the MDT selected and depend somewhat on the objective of the analysis (Little & Rubin, 1987). For example, MAR is not a problem if one s interest is in studying the conditional distributions of income when given a particular level of hierarchy, yet MAR can lead to bias when estimating parameters across levels of hierarchy, such as mean income (due to selective income data missingness from those at higher levels in the hierarchy) or covariance between income and some other variable (due to loss of variation in measures). For a more in-depth review of these topics, see Little and Rubin (1987), Muthen et al., (1987), and Roth (1994). On the basis of the missing data conceptualization just reviewed, one would predict that the parameters in a longitudinal latent variable model would be biased by nonrandom attrition. To understand precisely how various analytic treatments can reduce the bias created by nonrandom missingness mechanisms in such complex multivariate models, however, more research is required. Ad Hoc Procedures MDTs for Latent Variable Models Listwise deletion. The most common technique for handling missing data problems in latent variable modeling is to analyze only those cases for which data is available on all variables. This technique omits from analysis any cases that are not entirely complete. In practice, this approach can severely reduce the effective sample size (increas-

4 Newman / LONGITUDINAL MODELS WITH MISSING DATA 331 ing standard errors). Although listwise deletion provides relatively unbiased estimates under MCAR, it can lead to parameter bias when the missingness mechanism is not completely random. For instance, Allison (2002) pointed out that listwise deletion biases regression parameters under MAR when missingness on the predictor (X) is based on the criterion (Y). This study assesses the performance of listwise deletion for latent variable models under MCAR, MAR, and NMAR conditions. Pairwise deletion. As an alternative to listwise deletion, it is possible to calculate covariances among each pair of variables using all available cases for each pair. This technique has the advantage of including information in the covariance matrix from cases that would have otherwise been discarded under listwise deletion. Pairwise deletion is unbiased in large samples under MCAR but suffers from potentially serious parameter bias when data are only MAR (Allison, 2002). Currently, the most damning problem of using pairwise deletion for latent variable modeling is the lack of any appropriate method for estimating a single sample size to be used for the analyses (Marsh, 1998). When different amounts of data are missing for different variables, the precision of estimation can vary greatly across parameters in the model. Using a single sample size (e.g., the minimum or mean N per correlation) to estimate all parameters will inevitably give inaccurate standard errors for some parameters in the model. Single imputation. The term imputation refers to a set of techniques that fill in values for the missing data. Imputation is usually carried out as a first step, prior to conducting the statistical analyses as though there were no missing data to begin with. The most commonly implemented forms of imputation are mean substitution (replacing the missing value with the mean value on that variable calculated from all other respondents), mean person substitution (replacing the missing value for an item with the mean value on similar items reported by the same individual), and hot-deck imputation (replacing the missing value with the value reported by another respondent [ donor ], who is either chosen randomly from the sample or selected on the basis of similarity to the recipient in terms of values reported for other variables [i.e., smallest Euclidean distance]). Last, regression imputation is a popular technique, in which the variable with missing data is regressed onto all other variables to produce a regression equation (on the basis of the complete cases). Missing values are then replaced with predicted values from the regression equation. This sort of regression imputation has two problems stemming from the fact that the imputed values perfectly fit a regression line: (a) The variance of the imputed variable is underestimated, and (b) correlations with the imputed variable are overestimated (because the underestimated variance of the imputed variable is in the denominator of the correlation formula). One method used to redress the above problems is the addition of a random error term to the imputed values (this is known as stochastic regression imputation ). The random error term is a random normal variate with mean of zero and standard deviation equal to the standard error of estimate of the regression equation. Allison (2002) pointed out that regression parameter estimates based on regression imputation under MCAR are relatively unbiased in large samples (Gourieroux & Monfort, 1981). Unfortunately, all imputation techniques have the fundamental flaw of underestimating standard errors (due to their addition of imputed data to the incomplete data set, which results in overestimation of the actual sample size).

5 332 ORGANIZATIONAL RESEARCH METHODS Likelihood-Based Estimation Procedures Aside from the ad hoc MDTs reviewed above, newer approaches have been developed that are based on likelihood functions and are well tailored to latent variable modeling. In a review of latent variable techniques for missing data analysis, Rovine (1994) classified the available methods into two types. The first type includes methods that use sufficient statistics to estimate a complete data matrix, which can then serve as input for a latent variable model (e.g., EM-type algorithms) (Dempster, Laird, & Rubin, 1977). The second type of method for latent variable modeling with missing data is direct parameter estimation, which can be implemented using multigroup structural equation modeling (see Allison, 1987) or full information maximum likelihood (FIML) (Finkbeiner, 1979). In general, maximum likelihood (ML) approaches operate by estimating a set of parameters that maximize the probability of getting the data that was observed. Enders (2001) provided a useful review in which he distinguished the three currently available ML algorithms: (a) the EM algorithm, (b) the multiple-group approach, and (c) FIML. Whereas the ad hoc MDTs reviewed above (i.e., listwise and pairwise deletion, single imputation methods) generally require missing data be MCAR, ML methods have the advantage of being theoretically unbiased under both MCAR and MAR conditions. The reason this is true is that ML algorithms implicitly account for the dependencies of missingness on other variables in the data set. For example, if the probability that data on variable Y is missing depends on the value of variable X, ML algorithms produce estimates that incorporate the conditional distributions of the missing data on Y given the observed data on X, whereas ad hoc approaches do not (Enders, 2001). Direct estimation: FIML and multiple-group approaches. The two direct estimation ML approaches (FIML and the multiple-group approach) are essentially alternate versions of the same method, implemented with slightly different algorithms. Although the generic FIML approach has now superseded the multiple-group approach as the direct-estimation method of choice due to its superior flexibility and recent software availability procedures for applying the multiple-group algorithm became widely available first. The multiple-group approaches are described first here, as an introduction to the basic principles of direct estimation ML. In 1987, two techniques were formally proposed for direct ML estimation using the notion of dividing the sample into subgroups based on missing data patterns (Allison, 1987; Muthen et al., 1987). According to Allison (1987), his ML method of linear modeling is more efficient with data than listwise deletion (because it does not discard data from partially complete cases) and gives more consistent estimates of standard errors than pairwise deletion (because it uses the correct sample sizes for each parameter). Essentially, the multigroup approaches divide the incomplete data set into several subsamples based on different patterns of missingness (e.g., one subsample with complete data, one subsample with data missing on variables X1 and Z1 only, etc.). All unique missingness subgroups are then incorporated into a single multigroup structural equation model, which is estimated simultaneously for all subsamples while appropriate equality constraints are imposed across subsamples (Allison, 1987, p. 73). Multiple-group methods are most useful when the number of subsamples is small and the number of cases in each subsample is large.

6 Newman / LONGITUDINAL MODELS WITH MISSING DATA 333 Although several authors have proposed multiple group structural equation modeling approaches for estimating missing data subgroups (Lee, 1986; Muthen et al., 1987; Werts, Rock, & Grandy, 1979), Allison s (1987) phantom variable technique was historically one of the more popular and easy to implement. Allison showed how LISREL s multigroup routine (Jöreskog & Sörbom, 1996) could be used to implement Hartley and Hocking s (1971) likelihood function for a multivariate normal distribution with data missing at random. The technique worked by specifying a separate phantom factor on which all variables in the model had a loading of 1 when those variables were present in a data subsample, and a loading of 0 when the variables were missing from a subsample. Because the multigroup ML procedure calculates separate likelihood functions for each missing data subgroup (which are then aggregated and maximized), it has been touted as loosely analogous to pairwise deletion (Enders, 2001) (see McArdle & Hamagami, 1992, for a multigroup application with nonlinear change). Since the introduction of multiple-group direct estimation approaches, techniques have evolved for executing a more flexible FIML procedure. FIML was introduced by Finkbeiner (1979) and has been included in the popular structural equation modeling software AMOS (Arbuckle, 1995), LISREL (Jöreskog & Sörbom, 1996), and MPLUS (Muthen & Muthen, 1998). Conceptually, FIML is the same as the multigroup approach, except that it begins with individual-level (rather than group-level) likelihood functions (Enders, 2001). As illustrated by Duncan, Duncan, and Li (1998), the χ 2 test for ML models requires a comparison of likelihood functions between specially constructed unrestricted (H 0 ) and restricted (H 1 ) models (the LISREL FIML routine does this automatically) because there is no appropriate single-value for N that applies to the entire model. EM algorithm. The EM algorithm (Dempster et al., 1977) is a maximum-likelihood procedure that produces estimates of the complete-data correlation matrix and means. The algorithm produces these estimates by repeatedly iterating through two steps, called the E-step (for expectation ) and the M-step (for maximization ). The E-step calculates an expected value for the complete-data likelihood function of the missing data, based on the observed (incomplete) data and the current set of parameter estimates (the very first E-step uses the listwise or pairwise deleted correlation matrix and means, whereas subsequent E-steps use the parameters produced from the previous M- step). The E-step is essentially like conducting a series of regression imputations to produce expected missing values, in which the regressions are based on the current correlation matrix (with some random error terms added) and conditioned on the observed values of other variables. After the E-step gives an expectation for the complete-data likelihood function based on the observed data and current parameters, the M-step maximizes this expectation (i.e., it maximizes the likelihood) to produce a new, updated set of parameters (i.e., a new correlation matrix and means). These new parameters are then combined with the observed (incomplete) data to yield a new expectation for the complete-data likelihood function (the second E-step), which is then maximized to produce an even newer set of updated parameters (the second M- step). The iteration between E-steps and M-steps continues until some convergence criterion is met, at which point the algorithm has produced a final correlation matrix and vector of means. The correlation matrix and means can then be used to estimate a

7 334 ORGANIZATIONAL RESEARCH METHODS latent variable model. Because the EM algorithm only produces correlation and mean parameters that must subsequently serve as input for the structural equation model, this technique is considered an indirect ML procedure, in contrast with the multigroup and FIML approaches, which can estimate latent variable models directly from raw data. Some distinctions can be made between the EM algorithm and the direct estimation ML approaches (i.e., FIML). Although both produce ML estimates, the EM algorithm does not impose the restrictions on the covariance matrix implied by the structural model. Enders (2001) suggested an advantage of the EM algorithm over direct ML estimation is the ability of the EM algorithm to incorporate variables into the missing data treatment that are not part of the substantive model being tested (i.e., auxiliary variables). To understand this advantage, recall that all of the ML approaches reviewed here provide some protection against parameter bias under MAR (by keeping track of the conditional distributions of the missing data). Unfortunately, the direct estimation ML procedures as generally applied only gain protection from MAR when the variables supposed to produce the missingness (or correlated with the variables containing missingness) are included in the model being tested. By contrast, the EM algorithm can provide ML estimates of the means and correlations based on a large set of (both central and auxiliary) variables that may be suspected to produce missingness while only using a subset of these variables in the substantive model of interest. An anonymous reviewer of this article suggested that FIML could indeed be used to in combination with variables extraneous to the substantive model that are believed to produce missingness. The recommended technique for combining direct ML with auxiliary variables is to specify a model in which the auxiliary variables are allowed to be correlated with all observed exogenous variables and with the error terms for all observed endogenous variables. Recent work has begun to address the issues of incorporating auxiliary variables into FIML approaches (see Collins, Schafer, & Kam, 2001; Graham, in press). Multiple Imputation (MI) As mentioned earlier, a fundamental problem with single imputation is the inability to get accurate estimates of standard errors. MI is a procedure by which missing data are imputed several times (e.g., using regression imputation) to produce several different complete-data estimates of the parameters. The parameter estimates from each imputation are then combined to give an overall estimate of the complete-data parameters as well as reasonable estimates of the standard errors. MI has one complexity, however. If each of the imputations used in the MI procedure were based on regression parameters from the observed data, then it would be assumed that these regression imputation parameters are the true population parameters, when in fact they are only sample estimates from a sampling distribution of betas. Therefore, when multiple imputation is implemented, rather than using the sample regression parameters for each imputation, new parameters are drawn randomly for each imputation from a Bayesian posterior distribution of the regression imputation parameters. The difficulty created by the random draws in the MI procedure is that slightly different results are recovered each time the procedure is used, even when the procedure is used twice on the same data set.

8 Newman / LONGITUDINAL MODELS WITH MISSING DATA 335 MI assumes that data are MCAR or MAR and requires that data be imputed under a particular model (the multivariate normal model will suffice for most applications, and MI estimates are probably not as sensitive as ML estimates to violations of multivariate normality) (Allison, 2002). Research Questions and Contribution In conducting a Monte Carlo analysis of the above MDTs, several important questions will be answered: Question 1: Which techniques will produce the smallest missing data errors for estimates of three types of structural parameters (cross-lagged, stabilities, and synchronous), based on a three-wave panel model with random monotone missing data? Question 2: Which techniques will produce the most appropriate standard errors for three types of structural parameters (cross-lagged, stabilities, and synchronous), based on a three-wave panel model with random monotone missing data? Question 3: Will the missing data errors, standard errors, and the differences in errors between techniques be altered when the amount of missing data varies from to 50% to at each wave? Question 4: Will the missing data errors, standard errors, and the differences in errors between techniques be altered when the mechanism that produced the missing data changes from a completely random mechanism (MCAR) to a systematic missing data mechanism (MAR and NMAR)? This study represents the first full-scale simulation of all six missing data approaches. Furthermore, the techniques are applied to an increasingly important design (the three-wave longitudinal design), yet one for which missing data is a typical problem. Third, this research investigates bias and standard errors of structural parameters within the SEM framework, which is a particularly promising analytic mode for longitudinal research (Chan, 1998; Lance, Vandenberg, & Self, 2000). Finally, systematic missing-data mechanisms are tested, which helps to determine how the techniques perform when the assumptions of random missingness are violated. Design Method In the following study, MDTs for analyzing incomplete data are assessed via a Monte Carlo experiment with three factors. This design incorporates six MDTs, three levels of missingness (, 50%, ), and three missing data mechanisms (MCAR, MAR, NMAR). Four dependent variables are calculated. First, the average error of parameter estimates (or missing data error) is tabulated as the mean absolute difference between estimates derived from complete data and those derived from the respective MDTs. This can be expressed by the equation Average Error = (complete data estimate missing data estimate) / N. Next, the average standard error estimates observed under each technique will be recorded. These are the estimates that in practice form a basis for significance testing

9 336 ORGANIZATIONAL RESEARCH METHODS and the construction of confidence intervals. In addition to the average observed standard errors, estimates of the true standard errors under each technique will be derived from Monte Carlo simulation as the standard deviation of parameter estimates across replications. The average of the observed standard errors will then be compared to the Monte Carlo estimates of the true standard errors to show the extent of over- or underestimation of standard errors by each MDT. Last, each technique is assessed on whether the structural equation modeling algorithm failed to converge. Procedure The method includes seven steps, partly mimicking those used by Switzer, Roth, and Switzer (1998) and Roth and Switzer (1995). 1. Generate simulated longitudinal data. For the present study, data simulation required a population matrix for three consecutive waves of organizational commitment and turnover intentions data. Although published individual-level organizational research with three-wave longitudinal designs is sparse, the research reported by Farkas and Tetrick (1989) provided a prime example of the sort of data for which the techniques reviewed above might be useful (see Table 1 for population data matrix). The three waves of data are spaced roughly 10 months apart and represent a sample of 440 first-term military personnel (Farkas & Tetrick, 1989). Based on the population matrix, 100 sample data sets were created. Each data set held 440 observations of six variables, generated to be multivariate normal by PRELIS2 (Jöreskog & Sörbom, 1993). These 100 samples were used in each cell of the experimental design. 2. Test the theorized multivariate model on each complete data set. A theoretically predicted three-wave panel model of organizational commitment and reenlistment intentions was estimated in this study (see Figure 1). Theoretically, the predicted cross-lagged model parameters suggest that organizational commitment leads to intentions to remain with the organization during an initial 6-to-12 month developmental period during which intentions are labile (i.e., from Time 1 to Time 2) (Porter, Steers, Mowday, & Boulian, 1974), but that after intentions crystallize (from Time 2 to Time 3), reported commitment levels are adjusted to reflect established intentions (Bateman & Strasser, 1984; Farkas & Tetrick, 1989). The stability parameters linking each construct to itself across time are estimated, as are the synchronous ψ parameters, which account for the cumulative influence of variables external to the model (Anderson & Williams, 1992; Brett, Feldman, & Weingart, 1990). As can be seen in Figure 1, the measurement model is a single-indicator model (the same as used by Farkas & Tetrick, 1989), wherein λ parameters were set equal to the square root of each indicator s Cronbach s α reliability, and random error variance was set at (1.00 α) the variance of the indicator. After testing the model on each complete sample, LISREL 8.50 estimates were recorded for nine key model parameters (see Figure 1). 3. Delete data randomly. In the missingness condition, of the data was deleted at each wave for each data set. Twenty-five percent or more missingness per wave is typical in longitudinal designs such as this one (see Fullagar & Barling, 1989; Meyer, Bobocel, & Allen, 1991). Participants deleted at the second wave did not return for the third wave. This resulted in a loss of 110 participants in the second wave and

10 Newman / LONGITUDINAL MODELS WITH MISSING DATA 337 Table 1 Population Correlation Matrix Organizational Commitment Reenlistment Intentions Variable M SD Time 1 Time 2 Time 3 Time 1 Time 2 Time 3 Organizational commitment Time Time Time Reenlistment intentions Time Time Time Source. Farkas and Tetrick (1989). 192 participants in the third wave (110 participants missing Waves 2 and 3 plus 82 participants missing Wave 3 only). Consequently, each sample was essentially divided into three subsamples, with each subsample exhibiting a unique missing data pattern. The three subsample patterns of missing data created for each sample were as follows: (a) no missing data (248 participants), (b) missing both second and third waves of data (110 participants), and (c) missing third-wave data only (82 participants). This overall scheme of missingness, in which participants who leave the sample at one wave do not return at any later waves, has been referred to as the monotone missing data pattern (Marini, Olsen, & Rubin, 1980) and represents a great portion of the data collected using longitudinal designs. For the 50% and missingness conditions, subgroup sample sizes were and , respectively. Deletion was conducted using the SAS Version 8e software, with variables dropped from cases on the basis of two independent random uniform numbers. 4. Delete data systematically. In the MAR systematic deletion condition, participants were arranged so that the probability of deletion from each wave decreased linearly with organizational commitment reported at the previous wave. This trend of enhanced likelihood of survey noncompliance among individuals with lower levels of organizational commitment is consistent with previous research (Rogelberg, Luong, Sederburg, & Cristol, 2000). One method for achieving the linear data loss pattern was modeled by Switzer et al. (1998) and involves three steps: (a) Observations are rank-ordered by organizational commitment scores from the previous wave, (b) rank scores are linearly transformed to probability of deletion scores to produce deletion per wave in Step 3, and (c) deletion of an observation is determined by comparing the transformed probability value from Step 2 to a random uniform number and deleting cases for which the probability of missingness variable exceeds the random uniform variable. As a result of this process, participants in the missingness condition have a linearly increasing chance of being deleted as their organizational commitment scores decrease, with the

11 338 ORGANIZATIONAL RESEARCH METHODS Figure 1: Three-Wave Panel Model of Organizational Commitment (OC) and Reenlistment Intentions (RIs) Note. Parameters of interest: βs and ψs. least committed respondent having a.5 probability of deletion and the most committed respondent having a.0 probability of deletion. At its basis, Switzer et al. s (1998) approach for simulating systematically missing data involves creating a cut-score by summing a substantive variable theorized to influence missingness (e.g., organizational commitment) and a random uniform variable. Interestingly, when using this approach, the random variable greatly overwhelms the substantive variable in its relation to the missingness. To demonstrate this, correlations were calculated between the cut score and its two components across all cases in the simulation. The resulting correlation between the cut score and its random component was.95, whereas the correlation between the cut score and its substantive component was only.3. This may explain why Switzer et al. (1998) found little empirical difference between their random and systematic missingness conditions. For the present study, it was desirable to simulate stronger systematic missingness. This was accomplished by constraining the variance of the random uniform component of the cut score

12 Newman / LONGITUDINAL MODELS WITH MISSING DATA 339 to equal the variance of the substantive component of the cut score (in this case, organizational commitment), with the result that the cut score was composed of equal parts random and systematic variance (i.e., correlation of each component with the cut score was.7). This approach was used to simulate, 50%, and missingness conditions. To simulate the MAR condition, missingness on Wave 2 variables was based on Wave 1 commitment, whereas missingness on Wave 3 variables was based on Wave 2 commitment. Some missingness on Wave 3 variables was also based on Wave 1 commitment because the monotone missing pattern requires that those cases missing in Wave 2 are also missing in wave 3. For the NMAR mechanism, the probability that a datum is missing on a variable Y must depend on the value of Y itself. To simulate NMAR, missingness on Wave 2 variables depended on Wave 2 commitment, whereas missingness on Wave 3 variables depended on Wave 3 commitment (and on whether the case was already missing in Wave 2). Thus, under the condition labeled NMAR, missingness on Wave 2 commitment scores was strictly NMAR, missingness on Wave 3 commitment scores was partially NMAR, and missingness on Wave 2 and 3 reenlistment intentions was MAR. 5. Test the theorized multivariate model using various MDTs. Listwise deletion. Each data set was subjected to listwise deletion prior to the estimation of the three-wave panel model. All nine parameter estimates and standard errors of interest were computed in LISREL 8.50 and recorded for each sample. Pairwise deletion. Each data set was subjected to pairwise deletion using the PROC CORR routine in SAS software version 8e. The mean sample size per correlation (NPC) was used for each latent variable model, as tentatively recommended by Marsh (1998). Stochastic regression imputation. Regression imputation was carried out using the PROC REG routine in SAS 8e. The regression parameters for each variable with partially incomplete data were estimated on the basis of all available cases with complete data. Complete cases for each partially incomplete variable were regressed on all complete cases from the previous waves. For example, Wave 3 commitment was regressed on Wave 1 commitment and Wave 1 reenlistment intentions (using all available data), and the resulting regression parameters were then used to impute predicted values of Wave 3 commitment for the missing data subgroup in which data from Waves 2 and 3 were missing. As explained earlier, a random error component was added to each predicted score to more accurately retain the overall variability of the partially imputed variables. This random error component was estimated as the product of a random normal variate with the standard deviation of the regression residuals estimated from the complete data for each replication. FIML. FIML is a new feature in LISREL 8.50 that operates by simply adding the command mi =. to the data step. EM algorithm. The EM algorithm was implemented in SAS 8e under the PROC MI routine. The default settings of 200 maximum iterations and convergence criterion of.0001 failed to produce convergence under the missingness conditions and were adjusted as follows: (a) MCAR maximum iterations raised to 400, (b) MAR maximum

13 340 ORGANIZATIONAL RESEARCH METHODS iterations raised to 1,100 and convergence criterion raised to.001, and (c) NMAR maximum iterations raised to 600 and convergence criterion raised to.001. Correlation matrices and means created by the EM algorithm were then input to LISREL 8.50, and models were estimated using the complete sample N of 440. MI. When data are missing in a monotone pattern (e.g., when those respondents in a longitudinal study who leave the sample do not return), then the regression method for MI is appropriate. PROC MI in SAS 8e, which is based the approach described by Schafer (1997), was used to create 10 imputations for each data set. These are stochastic regression imputations based on random draws from a posterior distribution. Once the 10 imputations are completed for each replication, each imputation is analyzed in LISREL 8.50, and the resulting parameters are averaged across imputations to produce final parameter estimates. Estimates of standard errors are given by the equation Standard Error = {1/M s 2 k + [1 + 1/M][1/(M 1)] (p k p mean ) 2 } ½, where M is the number of imputations, p k is the parameter estimate from imputation k, and s k is the standard error from imputation k (Rubin, 1987, as applied by Allison, 2002). 6. Compare parameter estimates from various techniques to those made using complete data. The average errors (differences between estimates from complete data and estimates from various MDTs) were calculated across the 100 studies in each condition. 7. Assess and compare standard error estimates from various techniques. Two estimates of the standard error were computed for each parameter under each condition. First, the true standard errors were estimated as the standard deviations of parameter estimates across replications in the simulation. Next, these estimates are contrasted with the average standard errors of estimates produced by single samples using each MDT. The latter represent the estimates that a researcher would derive when implementing each technique. Results and Discussion Upon evaluation of 100 simulated samples in each of 54 conditions (6 MDTs 3 levels of missingness 3 mechanisms of missingness), average errors and standard errors were computed for each parameter (see Tables 3, 4, and 5 and Figures 2a to 2c). From these results, several trends can be observed. To aid in interpretation of these results, an analysis of variance was conducted to evaluate selected main effects and interactions on average missing date errors of model parameters (see Table 2). No significance tests are provided because statistical power in simulations can always be adjusted arbitrarily by altering the number of replications. Amount of Missing Data As shown in Table 2, the more data are missing, the higher the average parameter error becomes (F = ). Twenty-five percent of data deleted at each wave results in relatively modest error for parameters in this simulation (.048 on average), whereas

14 Newman / LONGITUDINAL MODELS WITH MISSING DATA 341 Table 2 Selected Analysis of Variance Results: Average Parameter Error Source F t Parameters Percentage missing All parameters Missing data technique All parameters Missingness mechanism 7.78 All parameters Missing data technique contrasts Ad hoc versus ML and MI 6.35 All parameters Listwise versus ML and MI 5.73 All parameters Listwise versus pairwise 3.82 All parameters Pairwise versus ML and MI 1.04 All parameters Regresssion versus listwise 0.79 All parameters FIML versus EM algorithm 0.06 All parameters ML versus MI 0.11 All parameters Listwise versus ML and MI 6.95 Wave 1 and Wave 2 parameters, no Wave 3 a Listwise versus pairwise 4.45 Wave 1 and Wave 2 parameters, no Wave 3 Regresssion versus listwise 7.67 Wave 2 and Wave 3 parameters, no Wave 1 b Pairwise versus ML and MI 1.69 MAR-sensitive parameters c Missingness mechanisms MAR versus MCAR 3.08 MAR-sensitive parameters c NMAR versus MCAR NMAR-sensitive parameters d NMAR versus MAR 4.33 MAR and NMAR-sensitive parameters e Interactions Ad hoc Percentage Missing 2.13 All parameters Listwise Versus ML and MI 7.03 Wave 1 and Wave 2 parameters, Percentage Missing no Wave 3 Listwise Versus Pairwise Percentage Missing 6.57 Wave 1 and Wave 2 parameters, no Wave 3 Ad hoc MAR Versus MCAR 1.65 All parameters Listwise Versus ML and MI MAR Versus MCAR 2.27 MAR-sensitive and Wave 1 and Wave 2 parameters, no Wave 3 f Pairwise Versus ML and MI MAR Versus MCAR 0.83 MAR-sensitive parameters Note. ML = maximum likelihood; MI = multiple imputation; FIML = full information maximum likelihood; EM = expectation-maximization; MAR = missing at random; MCAR = missing completely at random; NMAR = not missing at random; OC = organizational commitment; RI = reenlistment intentions. a. Parameters OC1-RI2, RI1-RI2, OC1-OC2, OC1-RI1, and OC2-RI2. b. Parameters RI2-OC3, RI2-RI3, OC2-RI2, and OC3-RI3. c. Parameters OC1-RI2, OC1-OC2, and OC2-OC3. d. Parameters OC1-OC2, OC2-OC3, OC2-RI2, RI2-OC3, and OC3-RI3. e. Parameters OC1-OC2 and OC2-OC3. f. Parameters OC1-RI2 and OC1-OC2.

15 342 ORGANIZATIONAL RESEARCH METHODS Table 3 Average Errors (absolute value errors) of Parameters Cross-Lagged Parameters OC1-RI2 RI2-OC3 Missing at Each Wave Missing at Each Wave Missing Data Technique 50% 50% MCAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation NMAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation Stability Parameters RI1-RI2 OC1-OC2 RI2-RI3 OC2-OC3 Missing Missing Missing Missing 50% 50% 50% 50% MCAR Listwise deletion Pairwise deletion Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation NMAR Listwise deletion Pairwise deletion (continued)

16 Newman / LONGITUDINAL MODELS WITH MISSING DATA 343 Table 3 (continued) Stability Parameters RI1-RI2 OC1-OC2 RI2-RI3 OC2-OC3 Missing Missing Missing Missing 50% 50% 50% 50% NMAR Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation Synchronous Parameters OC1-RI1 OC2-RI2 OC3-RI3 Missing Missing Missing 50% 50% 50% MCAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation NMAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation Note. N = 440 cases per complete data set. Tabulated values reflect mean parameter error across successful replications. OC = organizational commitment; RI = reenlistment intentions; MCAR = missing completely at random; MAR = missing at random; NMAR = not missing at random; ML = maximum likelihood; FIML = full information maximum likelihood; EM = expectation-maximization. 50% and missingness result in larger average errors (.082 and.156, respectively; see Table 6). The damaging effects of large amounts of missing data are even more pronounced when implementing ad hoc MDTs (listwise, pairwise, and stochastic regression imputation) than when implementing ML and MI techniques (interaction of percentage missing with ad hoc MDT: F = 2.13). Listwise deletion performs especially poorly when there are more data missing (average errors under listwise deletion are

17 344 ORGANIZATIONAL RESEARCH METHODS Table 4 Mean Standard Errors of Estimates: Monte Carlo Estimates Cross-Lagged Parameters OC1-RI2 RI2-OC3 Missing at Each Wave Missing at Each Wave Missing Data Technique 50% 50% MCAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation NMAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation Complete data Stability Parameters RI1-RI2 OC1-OC2 RI2-RI3 OC2-OC3 Missing Missing Missing Missing 50% 50% 50% 50% MCAR Listwise deletion Pairwise deletion Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation (continued)

18 Newman / LONGITUDINAL MODELS WITH MISSING DATA 345 Table 4 (continued) Stability Parameters RI1-RI2 OC1-OC2 RI2-RI3 OC2-OC3 Missing Missing Missing Missing 50% 50% 50% 50% NMAR Listwise deletion Pairwise deletion Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation Complete data Synchronous Parameters OC1-RI1 OC2-RI2 OC3-RI3 Missing Missing Missing 50% 50% 50% MCAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation NMAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation Complete data Note. N = 440 cases per complete data set. Tabulated values reflect standard deviations of parameter estimates across successful replications. OC = organizational commitment; RI = reenlistment intentions; MCAR = missing completely at random; MAR = missing at random; NMAR = not missing at random; ML = maximum likelihood; FIML = full information maximum likelihood; EM = expectation-maximization..05,.10, and.23 at, 50%, and missing, respectively). Overall, parameter errors are generally unacceptable at missingness (average errors are above.1 for all parameters involving Wave 3; see Table 3).

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

SOS3003 Applied data analysis for social science Lecture note 04-2009 Erling Berge Department of sociology and political science NTNU Erling Berge 2009 1 Missing data Literature Allison, Paul D 2002 Missing