Longitudinal Modeling With Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques

Size: px
Start display at page:

Download "Longitudinal Modeling With Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques"

Transcription

1 / ORGANIZATIONAL Newman / LONGITUDINAL RESEARCH MODELS METHODS WITH MISSING DATA ARTICLE Longitudinal Modeling With Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques DANIEL A. NEWMAN The Pennsylvania State University For organizational research on individual change, missing data can greatly reduce longitudinal sample size and potentially bias parameter estimates. Within the structural equation modeling framework, this article compares six missing data techniques (MDTs): listwise deletion, pairwise deletion, stochastic regression imputation, the expectation-maximization (EM) algorithm, full information maximization likelihood (FIML), and multiple imputation (MI). The rationale for each technique is reviewed, followed by Monte Carlo analysis based on a threewave simulation of organizational commitment and turnover intentions. Parameter estimates and standard errors for each MDT are contrasted with complete-data estimates, under three mechanisms of missingness (completely random, random, and nonrandom) and three levels of missingness (, 50%, and ; all monotone missing). Results support maximum likelihood and MI approaches, which particularly outperform listwise deletion for parameters involving many recouped cases. Better standard error estimates are derived from FIML and MI techniques. All MDTs perform worse when data are missing nonrandomly. Multivariate longitudinal analyses hold promise for the study of individual change in organizations (see Chan & Schmitt, 2000, and Garst, Frese, & Molenaar, 2000, for examples). Unfortunately, attrition from multiwave studies can lead to large standard errors in parameter estimates because nonresponse is compounded across waves of data collection to produce small longitudinal sample sizes. Furthermore, nonrandom mechanisms leading to attrition can bias model parameters and engender Author s Note: This article is based on the author s doctoral minor project. I would like to thank Mike Rovine for showing me Allison s direct estimation technique, and I am indebted to Kevin Murphy for his early input on the design. The article also benefited from conversations with Hock-Peng Sin and John Graham as well as from comments made on earlier versions by David Chan, Chuck Lance, and two anonymous reviewers. All correspondence should be directed to Dan Newman, The Pennsylvania State University, 429 Bruce V. Moore Building, University Park, PA 16802; dan148@psu.edu. Organizational Research Methods, Vol. 6 No. 3, July DOI: / Sage Publications 328

2 Newman / LONGITUDINAL MODELS WITH MISSING DATA 329 misspecification and misestimation of the substantive model of change (Chan, 1998; Goodman & Blum, 1996; Muthen, Kaplan, & Hollis, 1987). The problems of survey nonresponse (i.e., reduction in statistical power and threat of parameter bias) are a particularly salient challenge for longitudinal researchers. Two important factors to consider when selecting a technique for analyzing missing data are the amount of data missing and the specific type (or mechanism) of missingness (Roth, 1994). The greater the percentage of missing data, the more important missing data approaches are for minimizing bias. As the portion of data missing reaches 15% to 20%, the choice of a missing data estimation technique can have substantial implications for the parameter estimates (Raymond & Roberts, 1987; Roth, 1994). The second factor that has been suggested for consideration in the choice of a MDT is the specific type of missingness, commensurate with the mechanism that gave rise to the missing values (Little & Rubin, 1987). The following section reviews several types of incomplete data and explains which are most likely to give rise to biased sample estimates. Types of Missing Data Survey data are usually arrayed in a rectangular matrix of variables by cases. If one imagines a matrix in which some of the data are missing, it is instructive to question whether the probability that a specific cell of the data matrix is empty is totally random or, alternatively, whether the probability of missingness has any dependence on the value of the variable that would have occupied that cell or on the values of other variables. Understanding missing data that are not missing completely at random is a good first step in determining the appropriateness of any particular missing data procedure. Little and Rubin (1987) offered an explication of various missing data patterns, which has been helpfully abstracted by Roth (1994) and Schafer and Graham (2002). Little and Rubin (1987) explain that when considering a single variable Y, values of Y can be missing (a) randomly, (b) below some cut value on Y (e.g., those with IQ scores below 85 are not considered for hire), or (c) subject to a form of probabilistic censoring proportional to the value of Y (e.g., the less committed one is to the organization, the greater the probability that one fails to respond to the survey measuring commitment). The extent to which missing data is likely to bias sample estimates of the population mean of Y is negligible when data are missing randomly but not necessarily negligible when data are missing systematically. When considering two or more variables at once, there are even more possible missing data patterns, and nonrandom missingness can bias both variable means and covariance estimates (Rubin, 1976). The first type of missing data is referred to as missing completely at random (MCAR) and describes a random mechanism for data loss in which the probability that any given datum is not recorded is equal across all respondents, independent of the values on both the variable with incomplete data as well as on all other variables, measured or unmeasured. This type of missingness is unlikely to bias population mean estimates (Little & Rubin, 1987). For a missing-data mechanism to be classified as MCAR, it must be classified as both missing at random (MAR) and observed at random (OAR). Data are MAR if the missingness pattern does not depend on the values of the data that are missing, whereas data are OAR if the missingness pattern does not depend on the values of data that are observed (inter-

3 330 ORGANIZATIONAL RESEARCH METHODS estingly, it is impossible to test whether data are MAR, but possible to test whether they are OAR). For two variables X and Y, data on Y is MAR if the probability of missingness on Y depends on X but not on Y after controlling for X. For instance, if an organization has three levels of hierarchy and individuals in higher levels are more likely to not report their incomes, then income data can still be MAR, as long as all the data within each level of hierarchy have an equal probability of missingness (independent of income). The MAR-OAR distinction is useful for pointing out circumstances that are MAR but not OAR (and therefore not MCAR). The above example with hierarchy and income is such a circumstance, because the probability of income nonresponse is related to observed levels of hierarchy but unrelated to income when hierarchy is controlled. Note also that the circumstance in which nonreporting of income is related to level of hierarchy and also related level of income within each level of hierarchy is neither MAR nor OAR (alternatively denoted NMAR) (Jamshidian & Bentler, 1999). In a simple summary of the three types of mechanisms of missing data, Schafer and Graham (2002) pointed out that MCAR, MAR, and NMAR can be distinguished by delineating the antecedents of the missing data on variable Y. That is, the probability that data are missing on Y can depend on (a) neither X nor Y (MCAR), (b) X but not Y when X is controlled (MAR), or (c) Y itself (NMAR). Last, the missing data mechanism may be related to variables not included in the study (Graham & Donaldson, 1993), may be attributable to the sensitivity of the information elicited by the item or scale itself (as in self-report measures of illegal behavior), or may be related to a combination of two or more variables (Kim & Curry, 1977; Roth, 1994). The mechanisms accountable for data missingness are of direct relevance to the MDT selected and depend somewhat on the objective of the analysis (Little & Rubin, 1987). For example, MAR is not a problem if one s interest is in studying the conditional distributions of income when given a particular level of hierarchy, yet MAR can lead to bias when estimating parameters across levels of hierarchy, such as mean income (due to selective income data missingness from those at higher levels in the hierarchy) or covariance between income and some other variable (due to loss of variation in measures). For a more in-depth review of these topics, see Little and Rubin (1987), Muthen et al., (1987), and Roth (1994). On the basis of the missing data conceptualization just reviewed, one would predict that the parameters in a longitudinal latent variable model would be biased by nonrandom attrition. To understand precisely how various analytic treatments can reduce the bias created by nonrandom missingness mechanisms in such complex multivariate models, however, more research is required. Ad Hoc Procedures MDTs for Latent Variable Models Listwise deletion. The most common technique for handling missing data problems in latent variable modeling is to analyze only those cases for which data is available on all variables. This technique omits from analysis any cases that are not entirely complete. In practice, this approach can severely reduce the effective sample size (increas-

4 Newman / LONGITUDINAL MODELS WITH MISSING DATA 331 ing standard errors). Although listwise deletion provides relatively unbiased estimates under MCAR, it can lead to parameter bias when the missingness mechanism is not completely random. For instance, Allison (2002) pointed out that listwise deletion biases regression parameters under MAR when missingness on the predictor (X) is based on the criterion (Y). This study assesses the performance of listwise deletion for latent variable models under MCAR, MAR, and NMAR conditions. Pairwise deletion. As an alternative to listwise deletion, it is possible to calculate covariances among each pair of variables using all available cases for each pair. This technique has the advantage of including information in the covariance matrix from cases that would have otherwise been discarded under listwise deletion. Pairwise deletion is unbiased in large samples under MCAR but suffers from potentially serious parameter bias when data are only MAR (Allison, 2002). Currently, the most damning problem of using pairwise deletion for latent variable modeling is the lack of any appropriate method for estimating a single sample size to be used for the analyses (Marsh, 1998). When different amounts of data are missing for different variables, the precision of estimation can vary greatly across parameters in the model. Using a single sample size (e.g., the minimum or mean N per correlation) to estimate all parameters will inevitably give inaccurate standard errors for some parameters in the model. Single imputation. The term imputation refers to a set of techniques that fill in values for the missing data. Imputation is usually carried out as a first step, prior to conducting the statistical analyses as though there were no missing data to begin with. The most commonly implemented forms of imputation are mean substitution (replacing the missing value with the mean value on that variable calculated from all other respondents), mean person substitution (replacing the missing value for an item with the mean value on similar items reported by the same individual), and hot-deck imputation (replacing the missing value with the value reported by another respondent [ donor ], who is either chosen randomly from the sample or selected on the basis of similarity to the recipient in terms of values reported for other variables [i.e., smallest Euclidean distance]). Last, regression imputation is a popular technique, in which the variable with missing data is regressed onto all other variables to produce a regression equation (on the basis of the complete cases). Missing values are then replaced with predicted values from the regression equation. This sort of regression imputation has two problems stemming from the fact that the imputed values perfectly fit a regression line: (a) The variance of the imputed variable is underestimated, and (b) correlations with the imputed variable are overestimated (because the underestimated variance of the imputed variable is in the denominator of the correlation formula). One method used to redress the above problems is the addition of a random error term to the imputed values (this is known as stochastic regression imputation ). The random error term is a random normal variate with mean of zero and standard deviation equal to the standard error of estimate of the regression equation. Allison (2002) pointed out that regression parameter estimates based on regression imputation under MCAR are relatively unbiased in large samples (Gourieroux & Monfort, 1981). Unfortunately, all imputation techniques have the fundamental flaw of underestimating standard errors (due to their addition of imputed data to the incomplete data set, which results in overestimation of the actual sample size).

5 332 ORGANIZATIONAL RESEARCH METHODS Likelihood-Based Estimation Procedures Aside from the ad hoc MDTs reviewed above, newer approaches have been developed that are based on likelihood functions and are well tailored to latent variable modeling. In a review of latent variable techniques for missing data analysis, Rovine (1994) classified the available methods into two types. The first type includes methods that use sufficient statistics to estimate a complete data matrix, which can then serve as input for a latent variable model (e.g., EM-type algorithms) (Dempster, Laird, & Rubin, 1977). The second type of method for latent variable modeling with missing data is direct parameter estimation, which can be implemented using multigroup structural equation modeling (see Allison, 1987) or full information maximum likelihood (FIML) (Finkbeiner, 1979). In general, maximum likelihood (ML) approaches operate by estimating a set of parameters that maximize the probability of getting the data that was observed. Enders (2001) provided a useful review in which he distinguished the three currently available ML algorithms: (a) the EM algorithm, (b) the multiple-group approach, and (c) FIML. Whereas the ad hoc MDTs reviewed above (i.e., listwise and pairwise deletion, single imputation methods) generally require missing data be MCAR, ML methods have the advantage of being theoretically unbiased under both MCAR and MAR conditions. The reason this is true is that ML algorithms implicitly account for the dependencies of missingness on other variables in the data set. For example, if the probability that data on variable Y is missing depends on the value of variable X, ML algorithms produce estimates that incorporate the conditional distributions of the missing data on Y given the observed data on X, whereas ad hoc approaches do not (Enders, 2001). Direct estimation: FIML and multiple-group approaches. The two direct estimation ML approaches (FIML and the multiple-group approach) are essentially alternate versions of the same method, implemented with slightly different algorithms. Although the generic FIML approach has now superseded the multiple-group approach as the direct-estimation method of choice due to its superior flexibility and recent software availability procedures for applying the multiple-group algorithm became widely available first. The multiple-group approaches are described first here, as an introduction to the basic principles of direct estimation ML. In 1987, two techniques were formally proposed for direct ML estimation using the notion of dividing the sample into subgroups based on missing data patterns (Allison, 1987; Muthen et al., 1987). According to Allison (1987), his ML method of linear modeling is more efficient with data than listwise deletion (because it does not discard data from partially complete cases) and gives more consistent estimates of standard errors than pairwise deletion (because it uses the correct sample sizes for each parameter). Essentially, the multigroup approaches divide the incomplete data set into several subsamples based on different patterns of missingness (e.g., one subsample with complete data, one subsample with data missing on variables X1 and Z1 only, etc.). All unique missingness subgroups are then incorporated into a single multigroup structural equation model, which is estimated simultaneously for all subsamples while appropriate equality constraints are imposed across subsamples (Allison, 1987, p. 73). Multiple-group methods are most useful when the number of subsamples is small and the number of cases in each subsample is large.

6 Newman / LONGITUDINAL MODELS WITH MISSING DATA 333 Although several authors have proposed multiple group structural equation modeling approaches for estimating missing data subgroups (Lee, 1986; Muthen et al., 1987; Werts, Rock, & Grandy, 1979), Allison s (1987) phantom variable technique was historically one of the more popular and easy to implement. Allison showed how LISREL s multigroup routine (Jöreskog & Sörbom, 1996) could be used to implement Hartley and Hocking s (1971) likelihood function for a multivariate normal distribution with data missing at random. The technique worked by specifying a separate phantom factor on which all variables in the model had a loading of 1 when those variables were present in a data subsample, and a loading of 0 when the variables were missing from a subsample. Because the multigroup ML procedure calculates separate likelihood functions for each missing data subgroup (which are then aggregated and maximized), it has been touted as loosely analogous to pairwise deletion (Enders, 2001) (see McArdle & Hamagami, 1992, for a multigroup application with nonlinear change). Since the introduction of multiple-group direct estimation approaches, techniques have evolved for executing a more flexible FIML procedure. FIML was introduced by Finkbeiner (1979) and has been included in the popular structural equation modeling software AMOS (Arbuckle, 1995), LISREL (Jöreskog & Sörbom, 1996), and MPLUS (Muthen & Muthen, 1998). Conceptually, FIML is the same as the multigroup approach, except that it begins with individual-level (rather than group-level) likelihood functions (Enders, 2001). As illustrated by Duncan, Duncan, and Li (1998), the χ 2 test for ML models requires a comparison of likelihood functions between specially constructed unrestricted (H 0 ) and restricted (H 1 ) models (the LISREL FIML routine does this automatically) because there is no appropriate single-value for N that applies to the entire model. EM algorithm. The EM algorithm (Dempster et al., 1977) is a maximum-likelihood procedure that produces estimates of the complete-data correlation matrix and means. The algorithm produces these estimates by repeatedly iterating through two steps, called the E-step (for expectation ) and the M-step (for maximization ). The E-step calculates an expected value for the complete-data likelihood function of the missing data, based on the observed (incomplete) data and the current set of parameter estimates (the very first E-step uses the listwise or pairwise deleted correlation matrix and means, whereas subsequent E-steps use the parameters produced from the previous M- step). The E-step is essentially like conducting a series of regression imputations to produce expected missing values, in which the regressions are based on the current correlation matrix (with some random error terms added) and conditioned on the observed values of other variables. After the E-step gives an expectation for the complete-data likelihood function based on the observed data and current parameters, the M-step maximizes this expectation (i.e., it maximizes the likelihood) to produce a new, updated set of parameters (i.e., a new correlation matrix and means). These new parameters are then combined with the observed (incomplete) data to yield a new expectation for the complete-data likelihood function (the second E-step), which is then maximized to produce an even newer set of updated parameters (the second M- step). The iteration between E-steps and M-steps continues until some convergence criterion is met, at which point the algorithm has produced a final correlation matrix and vector of means. The correlation matrix and means can then be used to estimate a

7 334 ORGANIZATIONAL RESEARCH METHODS latent variable model. Because the EM algorithm only produces correlation and mean parameters that must subsequently serve as input for the structural equation model, this technique is considered an indirect ML procedure, in contrast with the multigroup and FIML approaches, which can estimate latent variable models directly from raw data. Some distinctions can be made between the EM algorithm and the direct estimation ML approaches (i.e., FIML). Although both produce ML estimates, the EM algorithm does not impose the restrictions on the covariance matrix implied by the structural model. Enders (2001) suggested an advantage of the EM algorithm over direct ML estimation is the ability of the EM algorithm to incorporate variables into the missing data treatment that are not part of the substantive model being tested (i.e., auxiliary variables). To understand this advantage, recall that all of the ML approaches reviewed here provide some protection against parameter bias under MAR (by keeping track of the conditional distributions of the missing data). Unfortunately, the direct estimation ML procedures as generally applied only gain protection from MAR when the variables supposed to produce the missingness (or correlated with the variables containing missingness) are included in the model being tested. By contrast, the EM algorithm can provide ML estimates of the means and correlations based on a large set of (both central and auxiliary) variables that may be suspected to produce missingness while only using a subset of these variables in the substantive model of interest. An anonymous reviewer of this article suggested that FIML could indeed be used to in combination with variables extraneous to the substantive model that are believed to produce missingness. The recommended technique for combining direct ML with auxiliary variables is to specify a model in which the auxiliary variables are allowed to be correlated with all observed exogenous variables and with the error terms for all observed endogenous variables. Recent work has begun to address the issues of incorporating auxiliary variables into FIML approaches (see Collins, Schafer, & Kam, 2001; Graham, in press). Multiple Imputation (MI) As mentioned earlier, a fundamental problem with single imputation is the inability to get accurate estimates of standard errors. MI is a procedure by which missing data are imputed several times (e.g., using regression imputation) to produce several different complete-data estimates of the parameters. The parameter estimates from each imputation are then combined to give an overall estimate of the complete-data parameters as well as reasonable estimates of the standard errors. MI has one complexity, however. If each of the imputations used in the MI procedure were based on regression parameters from the observed data, then it would be assumed that these regression imputation parameters are the true population parameters, when in fact they are only sample estimates from a sampling distribution of betas. Therefore, when multiple imputation is implemented, rather than using the sample regression parameters for each imputation, new parameters are drawn randomly for each imputation from a Bayesian posterior distribution of the regression imputation parameters. The difficulty created by the random draws in the MI procedure is that slightly different results are recovered each time the procedure is used, even when the procedure is used twice on the same data set.

8 Newman / LONGITUDINAL MODELS WITH MISSING DATA 335 MI assumes that data are MCAR or MAR and requires that data be imputed under a particular model (the multivariate normal model will suffice for most applications, and MI estimates are probably not as sensitive as ML estimates to violations of multivariate normality) (Allison, 2002). Research Questions and Contribution In conducting a Monte Carlo analysis of the above MDTs, several important questions will be answered: Question 1: Which techniques will produce the smallest missing data errors for estimates of three types of structural parameters (cross-lagged, stabilities, and synchronous), based on a three-wave panel model with random monotone missing data? Question 2: Which techniques will produce the most appropriate standard errors for three types of structural parameters (cross-lagged, stabilities, and synchronous), based on a three-wave panel model with random monotone missing data? Question 3: Will the missing data errors, standard errors, and the differences in errors between techniques be altered when the amount of missing data varies from to 50% to at each wave? Question 4: Will the missing data errors, standard errors, and the differences in errors between techniques be altered when the mechanism that produced the missing data changes from a completely random mechanism (MCAR) to a systematic missing data mechanism (MAR and NMAR)? This study represents the first full-scale simulation of all six missing data approaches. Furthermore, the techniques are applied to an increasingly important design (the three-wave longitudinal design), yet one for which missing data is a typical problem. Third, this research investigates bias and standard errors of structural parameters within the SEM framework, which is a particularly promising analytic mode for longitudinal research (Chan, 1998; Lance, Vandenberg, & Self, 2000). Finally, systematic missing-data mechanisms are tested, which helps to determine how the techniques perform when the assumptions of random missingness are violated. Design Method In the following study, MDTs for analyzing incomplete data are assessed via a Monte Carlo experiment with three factors. This design incorporates six MDTs, three levels of missingness (, 50%, ), and three missing data mechanisms (MCAR, MAR, NMAR). Four dependent variables are calculated. First, the average error of parameter estimates (or missing data error) is tabulated as the mean absolute difference between estimates derived from complete data and those derived from the respective MDTs. This can be expressed by the equation Average Error = (complete data estimate missing data estimate) / N. Next, the average standard error estimates observed under each technique will be recorded. These are the estimates that in practice form a basis for significance testing

9 336 ORGANIZATIONAL RESEARCH METHODS and the construction of confidence intervals. In addition to the average observed standard errors, estimates of the true standard errors under each technique will be derived from Monte Carlo simulation as the standard deviation of parameter estimates across replications. The average of the observed standard errors will then be compared to the Monte Carlo estimates of the true standard errors to show the extent of over- or underestimation of standard errors by each MDT. Last, each technique is assessed on whether the structural equation modeling algorithm failed to converge. Procedure The method includes seven steps, partly mimicking those used by Switzer, Roth, and Switzer (1998) and Roth and Switzer (1995). 1. Generate simulated longitudinal data. For the present study, data simulation required a population matrix for three consecutive waves of organizational commitment and turnover intentions data. Although published individual-level organizational research with three-wave longitudinal designs is sparse, the research reported by Farkas and Tetrick (1989) provided a prime example of the sort of data for which the techniques reviewed above might be useful (see Table 1 for population data matrix). The three waves of data are spaced roughly 10 months apart and represent a sample of 440 first-term military personnel (Farkas & Tetrick, 1989). Based on the population matrix, 100 sample data sets were created. Each data set held 440 observations of six variables, generated to be multivariate normal by PRELIS2 (Jöreskog & Sörbom, 1993). These 100 samples were used in each cell of the experimental design. 2. Test the theorized multivariate model on each complete data set. A theoretically predicted three-wave panel model of organizational commitment and reenlistment intentions was estimated in this study (see Figure 1). Theoretically, the predicted cross-lagged model parameters suggest that organizational commitment leads to intentions to remain with the organization during an initial 6-to-12 month developmental period during which intentions are labile (i.e., from Time 1 to Time 2) (Porter, Steers, Mowday, & Boulian, 1974), but that after intentions crystallize (from Time 2 to Time 3), reported commitment levels are adjusted to reflect established intentions (Bateman & Strasser, 1984; Farkas & Tetrick, 1989). The stability parameters linking each construct to itself across time are estimated, as are the synchronous ψ parameters, which account for the cumulative influence of variables external to the model (Anderson & Williams, 1992; Brett, Feldman, & Weingart, 1990). As can be seen in Figure 1, the measurement model is a single-indicator model (the same as used by Farkas & Tetrick, 1989), wherein λ parameters were set equal to the square root of each indicator s Cronbach s α reliability, and random error variance was set at (1.00 α) the variance of the indicator. After testing the model on each complete sample, LISREL 8.50 estimates were recorded for nine key model parameters (see Figure 1). 3. Delete data randomly. In the missingness condition, of the data was deleted at each wave for each data set. Twenty-five percent or more missingness per wave is typical in longitudinal designs such as this one (see Fullagar & Barling, 1989; Meyer, Bobocel, & Allen, 1991). Participants deleted at the second wave did not return for the third wave. This resulted in a loss of 110 participants in the second wave and

10 Newman / LONGITUDINAL MODELS WITH MISSING DATA 337 Table 1 Population Correlation Matrix Organizational Commitment Reenlistment Intentions Variable M SD Time 1 Time 2 Time 3 Time 1 Time 2 Time 3 Organizational commitment Time Time Time Reenlistment intentions Time Time Time Source. Farkas and Tetrick (1989). 192 participants in the third wave (110 participants missing Waves 2 and 3 plus 82 participants missing Wave 3 only). Consequently, each sample was essentially divided into three subsamples, with each subsample exhibiting a unique missing data pattern. The three subsample patterns of missing data created for each sample were as follows: (a) no missing data (248 participants), (b) missing both second and third waves of data (110 participants), and (c) missing third-wave data only (82 participants). This overall scheme of missingness, in which participants who leave the sample at one wave do not return at any later waves, has been referred to as the monotone missing data pattern (Marini, Olsen, & Rubin, 1980) and represents a great portion of the data collected using longitudinal designs. For the 50% and missingness conditions, subgroup sample sizes were and , respectively. Deletion was conducted using the SAS Version 8e software, with variables dropped from cases on the basis of two independent random uniform numbers. 4. Delete data systematically. In the MAR systematic deletion condition, participants were arranged so that the probability of deletion from each wave decreased linearly with organizational commitment reported at the previous wave. This trend of enhanced likelihood of survey noncompliance among individuals with lower levels of organizational commitment is consistent with previous research (Rogelberg, Luong, Sederburg, & Cristol, 2000). One method for achieving the linear data loss pattern was modeled by Switzer et al. (1998) and involves three steps: (a) Observations are rank-ordered by organizational commitment scores from the previous wave, (b) rank scores are linearly transformed to probability of deletion scores to produce deletion per wave in Step 3, and (c) deletion of an observation is determined by comparing the transformed probability value from Step 2 to a random uniform number and deleting cases for which the probability of missingness variable exceeds the random uniform variable. As a result of this process, participants in the missingness condition have a linearly increasing chance of being deleted as their organizational commitment scores decrease, with the

11 338 ORGANIZATIONAL RESEARCH METHODS Figure 1: Three-Wave Panel Model of Organizational Commitment (OC) and Reenlistment Intentions (RIs) Note. Parameters of interest: βs and ψs. least committed respondent having a.5 probability of deletion and the most committed respondent having a.0 probability of deletion. At its basis, Switzer et al. s (1998) approach for simulating systematically missing data involves creating a cut-score by summing a substantive variable theorized to influence missingness (e.g., organizational commitment) and a random uniform variable. Interestingly, when using this approach, the random variable greatly overwhelms the substantive variable in its relation to the missingness. To demonstrate this, correlations were calculated between the cut score and its two components across all cases in the simulation. The resulting correlation between the cut score and its random component was.95, whereas the correlation between the cut score and its substantive component was only.3. This may explain why Switzer et al. (1998) found little empirical difference between their random and systematic missingness conditions. For the present study, it was desirable to simulate stronger systematic missingness. This was accomplished by constraining the variance of the random uniform component of the cut score

12 Newman / LONGITUDINAL MODELS WITH MISSING DATA 339 to equal the variance of the substantive component of the cut score (in this case, organizational commitment), with the result that the cut score was composed of equal parts random and systematic variance (i.e., correlation of each component with the cut score was.7). This approach was used to simulate, 50%, and missingness conditions. To simulate the MAR condition, missingness on Wave 2 variables was based on Wave 1 commitment, whereas missingness on Wave 3 variables was based on Wave 2 commitment. Some missingness on Wave 3 variables was also based on Wave 1 commitment because the monotone missing pattern requires that those cases missing in Wave 2 are also missing in wave 3. For the NMAR mechanism, the probability that a datum is missing on a variable Y must depend on the value of Y itself. To simulate NMAR, missingness on Wave 2 variables depended on Wave 2 commitment, whereas missingness on Wave 3 variables depended on Wave 3 commitment (and on whether the case was already missing in Wave 2). Thus, under the condition labeled NMAR, missingness on Wave 2 commitment scores was strictly NMAR, missingness on Wave 3 commitment scores was partially NMAR, and missingness on Wave 2 and 3 reenlistment intentions was MAR. 5. Test the theorized multivariate model using various MDTs. Listwise deletion. Each data set was subjected to listwise deletion prior to the estimation of the three-wave panel model. All nine parameter estimates and standard errors of interest were computed in LISREL 8.50 and recorded for each sample. Pairwise deletion. Each data set was subjected to pairwise deletion using the PROC CORR routine in SAS software version 8e. The mean sample size per correlation (NPC) was used for each latent variable model, as tentatively recommended by Marsh (1998). Stochastic regression imputation. Regression imputation was carried out using the PROC REG routine in SAS 8e. The regression parameters for each variable with partially incomplete data were estimated on the basis of all available cases with complete data. Complete cases for each partially incomplete variable were regressed on all complete cases from the previous waves. For example, Wave 3 commitment was regressed on Wave 1 commitment and Wave 1 reenlistment intentions (using all available data), and the resulting regression parameters were then used to impute predicted values of Wave 3 commitment for the missing data subgroup in which data from Waves 2 and 3 were missing. As explained earlier, a random error component was added to each predicted score to more accurately retain the overall variability of the partially imputed variables. This random error component was estimated as the product of a random normal variate with the standard deviation of the regression residuals estimated from the complete data for each replication. FIML. FIML is a new feature in LISREL 8.50 that operates by simply adding the command mi =. to the data step. EM algorithm. The EM algorithm was implemented in SAS 8e under the PROC MI routine. The default settings of 200 maximum iterations and convergence criterion of.0001 failed to produce convergence under the missingness conditions and were adjusted as follows: (a) MCAR maximum iterations raised to 400, (b) MAR maximum

13 340 ORGANIZATIONAL RESEARCH METHODS iterations raised to 1,100 and convergence criterion raised to.001, and (c) NMAR maximum iterations raised to 600 and convergence criterion raised to.001. Correlation matrices and means created by the EM algorithm were then input to LISREL 8.50, and models were estimated using the complete sample N of 440. MI. When data are missing in a monotone pattern (e.g., when those respondents in a longitudinal study who leave the sample do not return), then the regression method for MI is appropriate. PROC MI in SAS 8e, which is based the approach described by Schafer (1997), was used to create 10 imputations for each data set. These are stochastic regression imputations based on random draws from a posterior distribution. Once the 10 imputations are completed for each replication, each imputation is analyzed in LISREL 8.50, and the resulting parameters are averaged across imputations to produce final parameter estimates. Estimates of standard errors are given by the equation Standard Error = {1/M s 2 k + [1 + 1/M][1/(M 1)] (p k p mean ) 2 } ½, where M is the number of imputations, p k is the parameter estimate from imputation k, and s k is the standard error from imputation k (Rubin, 1987, as applied by Allison, 2002). 6. Compare parameter estimates from various techniques to those made using complete data. The average errors (differences between estimates from complete data and estimates from various MDTs) were calculated across the 100 studies in each condition. 7. Assess and compare standard error estimates from various techniques. Two estimates of the standard error were computed for each parameter under each condition. First, the true standard errors were estimated as the standard deviations of parameter estimates across replications in the simulation. Next, these estimates are contrasted with the average standard errors of estimates produced by single samples using each MDT. The latter represent the estimates that a researcher would derive when implementing each technique. Results and Discussion Upon evaluation of 100 simulated samples in each of 54 conditions (6 MDTs 3 levels of missingness 3 mechanisms of missingness), average errors and standard errors were computed for each parameter (see Tables 3, 4, and 5 and Figures 2a to 2c). From these results, several trends can be observed. To aid in interpretation of these results, an analysis of variance was conducted to evaluate selected main effects and interactions on average missing date errors of model parameters (see Table 2). No significance tests are provided because statistical power in simulations can always be adjusted arbitrarily by altering the number of replications. Amount of Missing Data As shown in Table 2, the more data are missing, the higher the average parameter error becomes (F = ). Twenty-five percent of data deleted at each wave results in relatively modest error for parameters in this simulation (.048 on average), whereas

14 Newman / LONGITUDINAL MODELS WITH MISSING DATA 341 Table 2 Selected Analysis of Variance Results: Average Parameter Error Source F t Parameters Percentage missing All parameters Missing data technique All parameters Missingness mechanism 7.78 All parameters Missing data technique contrasts Ad hoc versus ML and MI 6.35 All parameters Listwise versus ML and MI 5.73 All parameters Listwise versus pairwise 3.82 All parameters Pairwise versus ML and MI 1.04 All parameters Regresssion versus listwise 0.79 All parameters FIML versus EM algorithm 0.06 All parameters ML versus MI 0.11 All parameters Listwise versus ML and MI 6.95 Wave 1 and Wave 2 parameters, no Wave 3 a Listwise versus pairwise 4.45 Wave 1 and Wave 2 parameters, no Wave 3 Regresssion versus listwise 7.67 Wave 2 and Wave 3 parameters, no Wave 1 b Pairwise versus ML and MI 1.69 MAR-sensitive parameters c Missingness mechanisms MAR versus MCAR 3.08 MAR-sensitive parameters c NMAR versus MCAR NMAR-sensitive parameters d NMAR versus MAR 4.33 MAR and NMAR-sensitive parameters e Interactions Ad hoc Percentage Missing 2.13 All parameters Listwise Versus ML and MI 7.03 Wave 1 and Wave 2 parameters, Percentage Missing no Wave 3 Listwise Versus Pairwise Percentage Missing 6.57 Wave 1 and Wave 2 parameters, no Wave 3 Ad hoc MAR Versus MCAR 1.65 All parameters Listwise Versus ML and MI MAR Versus MCAR 2.27 MAR-sensitive and Wave 1 and Wave 2 parameters, no Wave 3 f Pairwise Versus ML and MI MAR Versus MCAR 0.83 MAR-sensitive parameters Note. ML = maximum likelihood; MI = multiple imputation; FIML = full information maximum likelihood; EM = expectation-maximization; MAR = missing at random; MCAR = missing completely at random; NMAR = not missing at random; OC = organizational commitment; RI = reenlistment intentions. a. Parameters OC1-RI2, RI1-RI2, OC1-OC2, OC1-RI1, and OC2-RI2. b. Parameters RI2-OC3, RI2-RI3, OC2-RI2, and OC3-RI3. c. Parameters OC1-RI2, OC1-OC2, and OC2-OC3. d. Parameters OC1-OC2, OC2-OC3, OC2-RI2, RI2-OC3, and OC3-RI3. e. Parameters OC1-OC2 and OC2-OC3. f. Parameters OC1-RI2 and OC1-OC2.

15 342 ORGANIZATIONAL RESEARCH METHODS Table 3 Average Errors (absolute value errors) of Parameters Cross-Lagged Parameters OC1-RI2 RI2-OC3 Missing at Each Wave Missing at Each Wave Missing Data Technique 50% 50% MCAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation NMAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation Stability Parameters RI1-RI2 OC1-OC2 RI2-RI3 OC2-OC3 Missing Missing Missing Missing 50% 50% 50% 50% MCAR Listwise deletion Pairwise deletion Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation NMAR Listwise deletion Pairwise deletion (continued)

16 Newman / LONGITUDINAL MODELS WITH MISSING DATA 343 Table 3 (continued) Stability Parameters RI1-RI2 OC1-OC2 RI2-RI3 OC2-OC3 Missing Missing Missing Missing 50% 50% 50% 50% NMAR Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation Synchronous Parameters OC1-RI1 OC2-RI2 OC3-RI3 Missing Missing Missing 50% 50% 50% MCAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation NMAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation Note. N = 440 cases per complete data set. Tabulated values reflect mean parameter error across successful replications. OC = organizational commitment; RI = reenlistment intentions; MCAR = missing completely at random; MAR = missing at random; NMAR = not missing at random; ML = maximum likelihood; FIML = full information maximum likelihood; EM = expectation-maximization. 50% and missingness result in larger average errors (.082 and.156, respectively; see Table 6). The damaging effects of large amounts of missing data are even more pronounced when implementing ad hoc MDTs (listwise, pairwise, and stochastic regression imputation) than when implementing ML and MI techniques (interaction of percentage missing with ad hoc MDT: F = 2.13). Listwise deletion performs especially poorly when there are more data missing (average errors under listwise deletion are

17 344 ORGANIZATIONAL RESEARCH METHODS Table 4 Mean Standard Errors of Estimates: Monte Carlo Estimates Cross-Lagged Parameters OC1-RI2 RI2-OC3 Missing at Each Wave Missing at Each Wave Missing Data Technique 50% 50% MCAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation NMAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation Complete data Stability Parameters RI1-RI2 OC1-OC2 RI2-RI3 OC2-OC3 Missing Missing Missing Missing 50% 50% 50% 50% MCAR Listwise deletion Pairwise deletion Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation (continued)

18 Newman / LONGITUDINAL MODELS WITH MISSING DATA 345 Table 4 (continued) Stability Parameters RI1-RI2 OC1-OC2 RI2-RI3 OC2-OC3 Missing Missing Missing Missing 50% 50% 50% 50% NMAR Listwise deletion Pairwise deletion Reg. imputation Direct ML-FIML ML-EM algorithm Multiple imputation Complete data Synchronous Parameters OC1-RI1 OC2-RI2 OC3-RI3 Missing Missing Missing 50% 50% 50% MCAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation MAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation NMAR Listwise deletion Pairwise deletion Regression imputation Direct ML-FIML ML-EM algorithm Multiple imputation Complete data Note. N = 440 cases per complete data set. Tabulated values reflect standard deviations of parameter estimates across successful replications. OC = organizational commitment; RI = reenlistment intentions; MCAR = missing completely at random; MAR = missing at random; NMAR = not missing at random; ML = maximum likelihood; FIML = full information maximum likelihood; EM = expectation-maximization..05,.10, and.23 at, 50%, and missing, respectively). Overall, parameter errors are generally unacceptable at missingness (average errors are above.1 for all parameters involving Wave 3; see Table 3).

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU. SOS3003 Applied data analysis for social science Lecture note 04-2009 Erling Berge Department of sociology and political science NTNU Erling Berge 2009 1 Missing data Literature Allison, Paul D 2002 Missing

More information

Multiple Imputation with Mplus

Multiple Imputation with Mplus Multiple Imputation with Mplus Tihomir Asparouhov and Bengt Muthén Version 2 September 29, 2010 1 1 Introduction Conducting multiple imputation (MI) can sometimes be quite intricate. In this note we provide

More information

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa Ronald H. Heck 1 In this handout, we will address a number of issues regarding missing data. It is often the case that the weakest point of a study is the quality of the data that can be brought to bear

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION Introduction CHAPTER 1 INTRODUCTION Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data. Mplus offers researchers a wide choice of models, estimators,

More information

Missing Data Techniques

Missing Data Techniques Missing Data Techniques Paul Philippe Pare Department of Sociology, UWO Centre for Population, Aging, and Health, UWO London Criminometrics (www.crimino.biz) 1 Introduction Missing data is a common problem

More information

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy Simulation of Imputation Effects Under Different Assumptions Danny Rithy ABSTRACT Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive

More information

Missing Data Missing Data Methods in ML Multiple Imputation

Missing Data Missing Data Methods in ML Multiple Imputation Missing Data Missing Data Methods in ML Multiple Imputation PRE 905: Multivariate Analysis Lecture 11: April 22, 2014 PRE 905: Lecture 11 Missing Data Methods Today s Lecture The basics of missing data:

More information

Missing Data Analysis with SPSS

Missing Data Analysis with SPSS Missing Data Analysis with SPSS Meng-Ting Lo (lo.194@osu.edu) Department of Educational Studies Quantitative Research, Evaluation and Measurement Program (QREM) Research Methodology Center (RMC) Outline

More information

PSY 9556B (Jan8) Design Issues and Missing Data Continued Examples of Simulations for Projects

PSY 9556B (Jan8) Design Issues and Missing Data Continued Examples of Simulations for Projects PSY 9556B (Jan8) Design Issues and Missing Data Continued Examples of Simulations for Projects Let s create a data for a variable measured repeatedly over five occasions We could create raw data (for each

More information

Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International

Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International Part A: Comparison with FIML in the case of normal data. Stephen du Toit Multivariate data

More information

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS Examples: Missing Data Modeling And Bayesian Analysis CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS Mplus provides estimation of models with missing data using both frequentist and Bayesian

More information

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION

More information

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES UNIVERSITY OF GLASGOW MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES by KHUNESWARI GOPAL PILLAY A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in

More information

Handling Data with Three Types of Missing Values:

Handling Data with Three Types of Missing Values: Handling Data with Three Types of Missing Values: A Simulation Study Jennifer Boyko Advisor: Ofer Harel Department of Statistics University of Connecticut Storrs, CT May 21, 2013 Jennifer Boyko Handling

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

The Performance of Multiple Imputation for Likert-type Items with Missing Data

The Performance of Multiple Imputation for Likert-type Items with Missing Data Journal of Modern Applied Statistical Methods Volume 9 Issue 1 Article 8 5-1-2010 The Performance of Multiple Imputation for Likert-type Items with Missing Data Walter Leite University of Florida, Walter.Leite@coe.ufl.edu

More information

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999. 2 Schafer, J. L., Graham, J. W.: (2002). Missing Data: Our View of the State of the Art. Psychological methods, 2002, Vol 7, No 2, 47 77 Rosner, B. (2005) Fundamentals of Biostatistics, 6th ed, Wiley.

More information

Missing Data. Where did it go?

Missing Data. Where did it go? Missing Data Where did it go? 1 Learning Objectives High-level discussion of some techniques Identify type of missingness Single vs Multiple Imputation My favourite technique 2 Problem Uh data are missing

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

NORM software review: handling missing values with multiple imputation methods 1

NORM software review: handling missing values with multiple imputation methods 1 METHODOLOGY UPDATE I Gusti Ngurah Darmawan NORM software review: handling missing values with multiple imputation methods 1 Evaluation studies often lack sophistication in their statistical analyses, particularly

More information

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models Bengt Muth en University of California, Los Angeles Tihomir Asparouhov Muth en & Muth en Mplus

More information

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Development of weighted model fit indexes for structural equation models using multiple imputation

Development of weighted model fit indexes for structural equation models using multiple imputation Graduate Theses and Dissertations Graduate College 2011 Development of weighted model fit indexes for structural equation models using multiple imputation Cherie Joy Kientoff Iowa State University Follow

More information

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis Applied Mathematical Sciences, Vol. 5, 2011, no. 57, 2807-2818 Simulation Study: Introduction of Imputation Methods for Missing Data in Longitudinal Analysis Michikazu Nakai Innovation Center for Medical

More information

Introduction to Mplus

Introduction to Mplus Introduction to Mplus May 12, 2010 SPONSORED BY: Research Data Centre Population and Life Course Studies PLCS Interdisciplinary Development Initiative Piotr Wilk piotr.wilk@schulich.uwo.ca OVERVIEW Mplus

More information

Missing data analysis. University College London, 2015

Missing data analysis. University College London, 2015 Missing data analysis University College London, 2015 Contents 1. Introduction 2. Missing-data mechanisms 3. Missing-data methods that discard data 4. Simple approaches that retain all the data 5. RIBG

More information

A STOCHASTIC METHOD FOR ESTIMATING IMPUTATION ACCURACY

A STOCHASTIC METHOD FOR ESTIMATING IMPUTATION ACCURACY A STOCHASTIC METHOD FOR ESTIMATING IMPUTATION ACCURACY Norman Solomon School of Computing and Technology University of Sunderland A thesis submitted in partial fulfilment of the requirements of the University

More information

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data Donsig Jang, Amang Sukasih, Xiaojing Lin Mathematica Policy Research, Inc. Thomas V. Williams TRICARE Management

More information

Performance of Latent Growth Curve Models with Binary Variables

Performance of Latent Growth Curve Models with Binary Variables Performance of Latent Growth Curve Models with Binary Variables Jason T. Newsom & Nicholas A. Smith Department of Psychology Portland State University 1 Goal Examine estimation of latent growth curve models

More information

Introduction to Mixed Models: Multivariate Regression

Introduction to Mixed Models: Multivariate Regression Introduction to Mixed Models: Multivariate Regression EPSY 905: Multivariate Analysis Spring 2016 Lecture #9 March 30, 2016 EPSY 905: Multivariate Regression via Path Analysis Today s Lecture Multivariate

More information

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 6 SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE

More information

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 9. oktober 2012 Multivariate tables Agenda today Age standardization Missing data 1 2 3 4 Age standardization

More information

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Age standardization Incidence and prevalence are strongly agedependent Risks rising

More information

ANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3

ANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3 ANNOUNCING THE RELEASE OF LISREL VERSION 9.1 2 BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3 THREE-LEVEL MULTILEVEL GENERALIZED LINEAR MODELS 3 FOUR

More information

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Final Report for cs229: Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Abstract. The goal of this work is to use machine learning to understand

More information

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Missing Data. SPIDA 2012 Part 6 Mixed Models with R: The best solution to the missing data problem is not to have any. Stef van Buuren, developer of mice SPIDA 2012 Part 6 Mixed Models with R: Missing Data Georges Monette 1 May 2012 Email: georges@yorku.ca

More information

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background An Introduction to Multiple Imputation and its Application Craig K. Enders University of California - Los Angeles Department of Psychology cenders@psych.ucla.edu Background Work supported by Institute

More information

Latent Class Modeling as a Probabilistic Extension of K-Means Clustering

Latent Class Modeling as a Probabilistic Extension of K-Means Clustering Latent Class Modeling as a Probabilistic Extension of K-Means Clustering Latent Class Cluster Models According to Kaufman and Rousseeuw (1990), cluster analysis is "the classification of similar objects

More information

Missing Data Part 1: Overview, Traditional Methods Page 1

Missing Data Part 1: Overview, Traditional Methods Page 1 Missing Data Part 1: Overview, Traditional Methods Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 17, 2015 This discussion borrows heavily from: Applied

More information

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR Detecting Missing and Spurious Edges in Large, Dense Networks Using Parallel Computing Samuel Coolidge, sam.r.coolidge@gmail.com Dan Simon, des480@nyu.edu Dennis Shasha, shasha@cims.nyu.edu Technical Report

More information

Study Guide. Module 1. Key Terms

Study Guide. Module 1. Key Terms Study Guide Module 1 Key Terms general linear model dummy variable multiple regression model ANOVA model ANCOVA model confounding variable squared multiple correlation adjusted squared multiple correlation

More information

Missing Not at Random Models for Latent Growth Curve Analyses

Missing Not at Random Models for Latent Growth Curve Analyses Psychological Methods 20, Vol. 6, No., 6 20 American Psychological Association 082-989X//$2.00 DOI: 0.037/a0022640 Missing Not at Random Models for Latent Growth Curve Analyses Craig K. Enders Arizona

More information

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi 1 Introduction Imputation is the commonly used

More information

1. Estimation equations for strip transect sampling, using notation consistent with that used to

1. Estimation equations for strip transect sampling, using notation consistent with that used to Web-based Supplementary Materials for Line Transect Methods for Plant Surveys by S.T. Buckland, D.L. Borchers, A. Johnston, P.A. Henrys and T.A. Marques Web Appendix A. Introduction In this on-line appendix,

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Handling missing data for indicators, Susanne Rässler 1

Handling missing data for indicators, Susanne Rässler 1 Handling Missing Data for Indicators Susanne Rässler Institute for Employment Research & Federal Employment Agency Nürnberg, Germany First Workshop on Indicators in the Knowledge Economy, Tübingen, 3-4

More information

arxiv: v1 [stat.me] 29 May 2015

arxiv: v1 [stat.me] 29 May 2015 MIMCA: Multiple imputation for categorical variables with multiple correspondence analysis Vincent Audigier 1, François Husson 2 and Julie Josse 2 arxiv:1505.08116v1 [stat.me] 29 May 2015 Applied Mathematics

More information

An Introduction to Growth Curve Analysis using Structural Equation Modeling

An Introduction to Growth Curve Analysis using Structural Equation Modeling An Introduction to Growth Curve Analysis using Structural Equation Modeling James Jaccard New York University 1 Overview Will introduce the basics of growth curve analysis (GCA) and the fundamental questions

More information

PASS Sample Size Software. Randomization Lists

PASS Sample Size Software. Randomization Lists Chapter 880 Introduction This module is used to create a randomization list for assigning subjects to one of up to eight groups or treatments. Six randomization algorithms are available. Four of the algorithms

More information

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to

More information

SPSS INSTRUCTION CHAPTER 9

SPSS INSTRUCTION CHAPTER 9 SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can

More information

Improving Imputation Accuracy in Ordinal Data Using Classification

Improving Imputation Accuracy in Ordinal Data Using Classification Improving Imputation Accuracy in Ordinal Data Using Classification Shafiq Alam 1, Gillian Dobbie, and XiaoBin Sun 1 Faculty of Business and IT, Whitireia Community Polytechnic, Auckland, New Zealand shafiq.alam@whitireia.ac.nz

More information

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT This chapter provides step by step instructions on how to define and estimate each of the three types of LC models (Cluster, DFactor or Regression) and also

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Latent Curve Models. A Structural Equation Perspective WILEY- INTERSCIENΠKENNETH A. BOLLEN

Latent Curve Models. A Structural Equation Perspective WILEY- INTERSCIENΠKENNETH A. BOLLEN Latent Curve Models A Structural Equation Perspective KENNETH A. BOLLEN University of North Carolina Department of Sociology Chapel Hill, North Carolina PATRICK J. CURRAN University of North Carolina Department

More information

Effects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex

Effects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex Effects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex Keiko I. Powers, Ph.D., J. D. Power and Associates, Westlake Village, CA ABSTRACT Discrete time series

More information

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002 Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002 Introduction Neural networks are flexible nonlinear models that can be used for regression and classification

More information

Generalized least squares (GLS) estimates of the level-2 coefficients,

Generalized least squares (GLS) estimates of the level-2 coefficients, Contents 1 Conceptual and Statistical Background for Two-Level Models...7 1.1 The general two-level model... 7 1.1.1 Level-1 model... 8 1.1.2 Level-2 model... 8 1.2 Parameter estimation... 9 1.3 Empirical

More information

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM * Which directories are used for input files and output files? See menu-item "Options" and page 22 in the manual.

More information

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional. independence assumption and auxiliary information Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional

More information

Missing Data in Orthopaedic Research

Missing Data in Orthopaedic Research in Orthopaedic Research Keith D Baldwin, MD, MSPT, MPH, Pamela Ohman-Strickland, PhD Abstract Missing data can be a frustrating problem in orthopaedic research. Many statistical programs employ a list-wise

More information

HANDLING MISSING DATA

HANDLING MISSING DATA GSO international workshop Mathematic, biostatistics and epidemiology of cancer Modeling and simulation of clinical trials Gregory GUERNEC 1, Valerie GARES 1,2 1 UMR1027 INSERM UNIVERSITY OF TOULOUSE III

More information

The Use of Sample Weights in Hot Deck Imputation

The Use of Sample Weights in Hot Deck Imputation Journal of Official Statistics, Vol. 25, No. 1, 2009, pp. 21 36 The Use of Sample Weights in Hot Deck Imputation Rebecca R. Andridge 1 and Roderick J. Little 1 A common strategy for handling item nonresponse

More information

Statistical Matching using Fractional Imputation

Statistical Matching using Fractional Imputation Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application:

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this

More information

PRI Workshop Introduction to AMOS

PRI Workshop Introduction to AMOS PRI Workshop Introduction to AMOS Krissy Zeiser Pennsylvania State University klz24@pop.psu.edu 2-pm /3/2008 Setting up the Dataset Missing values should be recoded in another program (preferably with

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Bootstrap and multiple imputation under missing data in AR(1) models

Bootstrap and multiple imputation under missing data in AR(1) models EUROPEAN ACADEMIC RESEARCH Vol. VI, Issue 7/ October 2018 ISSN 2286-4822 www.euacademic.org Impact Factor: 3.4546 (UIF) DRJI Value: 5.9 (B+) Bootstrap and multiple imputation under missing ELJONA MILO

More information

Robustness of Centrality Measures for Small-World Networks Containing Systematic Error

Robustness of Centrality Measures for Small-World Networks Containing Systematic Error Robustness of Centrality Measures for Small-World Networks Containing Systematic Error Amanda Lannie Analytical Systems Branch, Air Force Research Laboratory, NY, USA Abstract Social network analysis is

More information

Supplementary Figure 1. Decoding results broken down for different ROIs

Supplementary Figure 1. Decoding results broken down for different ROIs Supplementary Figure 1 Decoding results broken down for different ROIs Decoding results for areas V1, V2, V3, and V1 V3 combined. (a) Decoded and presented orientations are strongly correlated in areas

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

Lecture: Simulation. of Manufacturing Systems. Sivakumar AI. Simulation. SMA6304 M2 ---Factory Planning and scheduling. Simulation - A Predictive Tool

Lecture: Simulation. of Manufacturing Systems. Sivakumar AI. Simulation. SMA6304 M2 ---Factory Planning and scheduling. Simulation - A Predictive Tool SMA6304 M2 ---Factory Planning and scheduling Lecture Discrete Event of Manufacturing Systems Simulation Sivakumar AI Lecture: 12 copyright 2002 Sivakumar 1 Simulation Simulation - A Predictive Tool Next

More information

Bayesian Model Averaging over Directed Acyclic Graphs With Implications for Prediction in Structural Equation Modeling

Bayesian Model Averaging over Directed Acyclic Graphs With Implications for Prediction in Structural Equation Modeling ing over Directed Acyclic Graphs With Implications for Prediction in ing David Kaplan Department of Educational Psychology Case April 13th, 2015 University of Nebraska-Lincoln 1 / 41 ing Case This work

More information

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc. C: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc. C is one of many capability metrics that are available. When capability metrics are used, organizations typically provide

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

- 1 - Fig. A5.1 Missing value analysis dialog box

- 1 - Fig. A5.1 Missing value analysis dialog box WEB APPENDIX Sarstedt, M. & Mooi, E. (2019). A concise guide to market research. The process, data, and methods using SPSS (3 rd ed.). Heidelberg: Springer. Missing Value Analysis and Multiple Imputation

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

Multiple imputation using chained equations: Issues and guidance for practice

Multiple imputation using chained equations: Issues and guidance for practice Multiple imputation using chained equations: Issues and guidance for practice Ian R. White, Patrick Royston and Angela M. Wood http://onlinelibrary.wiley.com/doi/10.1002/sim.4067/full By Gabrielle Simoneau

More information

Statistical Analysis of List Experiments

Statistical Analysis of List Experiments Statistical Analysis of List Experiments Kosuke Imai Princeton University Joint work with Graeme Blair October 29, 2010 Blair and Imai (Princeton) List Experiments NJIT (Mathematics) 1 / 26 Motivation

More information

Approaches to Missing Data

Approaches to Missing Data Approaches to Missing Data A Presentation by Russell Barbour, Ph.D. Center for Interdisciplinary Research on AIDS (CIRA) and Eugenia Buta, Ph.D. CIRA and The Yale Center of Analytical Studies (YCAS) April

More information

Small Sample Robust Fit Criteria in Latent Growth Models with Incomplete Data. Dan McNeish & Jeff Harring University of Maryland

Small Sample Robust Fit Criteria in Latent Growth Models with Incomplete Data. Dan McNeish & Jeff Harring University of Maryland Small Sample Robust Fit Criteria in Latent Growth Models with Incomplete Data Dan McNeish & Jeff Harring University of Maryland Growth Models With Small Samples An expanding literature has addressed the

More information

IBM SPSS Missing Values 21

IBM SPSS Missing Values 21 IBM SPSS Missing Values 21 Note: Before using this information and the product it supports, read the general information under Notices on p. 87. This edition applies to IBM SPSS Statistics 21 and to all

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS}

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS} MVA MVA [VARIABLES=] {varlist} {ALL } [/CATEGORICAL=varlist] [/MAXCAT={25 ** }] {n } [/ID=varname] Description: [/NOUNIVARIATE] [/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n}

More information

Dynamic Thresholding for Image Analysis

Dynamic Thresholding for Image Analysis Dynamic Thresholding for Image Analysis Statistical Consulting Report for Edward Chan Clean Energy Research Center University of British Columbia by Libo Lu Department of Statistics University of British

More information

An Experiment in Visual Clustering Using Star Glyph Displays

An Experiment in Visual Clustering Using Star Glyph Displays An Experiment in Visual Clustering Using Star Glyph Displays by Hanna Kazhamiaka A Research Paper presented to the University of Waterloo in partial fulfillment of the requirements for the degree of Master

More information

SAS Graphics Macros for Latent Class Analysis Users Guide

SAS Graphics Macros for Latent Class Analysis Users Guide SAS Graphics Macros for Latent Class Analysis Users Guide Version 2.0.1 John Dziak The Methodology Center Stephanie Lanza The Methodology Center Copyright 2015, Penn State. All rights reserved. Please

More information

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 1 (2016), pp. 1131-1140 Research India Publications http://www.ripublication.com A Monotonic Sequence and Subsequence Approach

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

ANNUAL REPORT OF HAIL STUDIES NEIL G, TOWERY AND RAND I OLSON. Report of Research Conducted. 15 May May For. The Country Companies

ANNUAL REPORT OF HAIL STUDIES NEIL G, TOWERY AND RAND I OLSON. Report of Research Conducted. 15 May May For. The Country Companies ISWS CR 182 Loan c.l ANNUAL REPORT OF HAIL STUDIES BY NEIL G, TOWERY AND RAND I OLSON Report of Research Conducted 15 May 1976-14 May 1977 For The Country Companies May 1977 ANNUAL REPORT OF HAIL STUDIES

More information

Cluster Tendency Assessment for Fuzzy Clustering of Incomplete Data

Cluster Tendency Assessment for Fuzzy Clustering of Incomplete Data EUSFLAT-LFA 2011 July 2011 Aix-les-Bains, France Cluster Tendency Assessment for Fuzzy Clustering of Incomplete Data Ludmila Himmelspach 1 Daniel Hommers 1 Stefan Conrad 1 1 Institute of Computer Science,

More information

Faculty of Sciences. Holger Cevallos Valdiviezo

Faculty of Sciences. Holger Cevallos Valdiviezo Faculty of Sciences Handling of missing data in the predictor variables when using Tree-based techniques for training and generating predictions Holger Cevallos Valdiviezo Master dissertation submitted

More information

MAT 110 WORKSHOP. Updated Fall 2018

MAT 110 WORKSHOP. Updated Fall 2018 MAT 110 WORKSHOP Updated Fall 2018 UNIT 3: STATISTICS Introduction Choosing a Sample Simple Random Sample: a set of individuals from the population chosen in a way that every individual has an equal chance

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Chapter 7: Dual Modeling in the Presence of Constant Variance

Chapter 7: Dual Modeling in the Presence of Constant Variance Chapter 7: Dual Modeling in the Presence of Constant Variance 7.A Introduction An underlying premise of regression analysis is that a given response variable changes systematically and smoothly due to

More information