Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Size: px
Start display at page:

Download "Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa"

Transcription

1 Ronald H. Heck 1 In this handout, we will address a number of issues regarding missing data. It is often the case that the weakest point of a study is the quality of the data that can be brought to bear on the research problem. Missing data can be a problem in both cross-sectional and longitudinal analyses, depending upon the extent to which the data are missing and the patterns of missing data that may be present. The more the analyst can find out about why the data are "as they are," the more she or he can develop a case about the patterns of missing data, as well as a rationale about why the pattern may or may not matter. In some modeling situations, there may be considerable missing data. In this chapter, we will explore some commonly accepted strategies for dealing with missing data. Researchers should consider data preparation and analysis as two separate steps. In some modeling situations, there may be considerable missing data. In this handout, we will concentrate primarily on data preparation when there are missing values. As a first step in preparing the data for analysis, it is often useful to determine the amount of missing data, as well as possible patterns of missing data that exist (e.g., For which variables do missing values occur? Are there specific patterns of missing data?). It is important to keep in mind that the reality is there is no real way to get missing data back (short of actually following up with subjects in a study). In a sense, then, we are always dealing with the problem of missing information to some extent when we use actual data. The quality of our analysis depends on assumptions we make about the patterns of missing responses present and what is reasonable to conclude about those patterns in relation to the study s design (e.g., experimental or quasi-experimental, survey) and data collection (e.g., cross-sectional, longitudinal). What we do about the missing data becomes a more pressing concern. There are a number of available strategies for dealing with missing data. Some traditional approaches (e.g., listwise or pairwise deletion, mean substitution, simple imputation) lead to biased results in various situations (Enders, 2011a; Peugh & Enders, 2004). It helps to know what the defaults are for the different software programs that you might be using. Handling missing data in an appropriate manner depends first on one s knowledge of the data set and why particular values may be missing. The traditional way of handling missing data was to use listwise (i.e., eliminating any case with at least one missing value) or pairwise (e.g., eliminating pairs of cases when missing data are present as in calculating a correlation) deletion or mean substitution. Generally, these are not considered acceptable solutions because they lead to biased parameter estimation. For example, listwise deletion is only valid when the data are missing completely at random (MCAR), as when selecting a random sample from a population. This is seldom the case using real data. The second step in data preparation is to then select an approach for dealing with missing data. At present, the two acceptable solutions mentioned in the literature are multiple imputation

2 Ronald H. Heck 2 (MI) of plausible values and full information maximum likelihood estimation (FIML) with the individuals with partial data included in the analysis (Peugh & Enders, 2004). This latter approach is not generally available in SPSS, although when the data set is structured vertically (i.e., where they may be several lines per subject as when the outcome consists of repeated measures or multiple outcomes), we can retain subjects who are missing one or more pieces of information on the outcomes. Even various regression-based approaches, such as estimating outcomes with dummy-coded missing data flags to determine whether there were differences in outcomes associated with individuals with missing versus complete data, are not optimal approaches for dealing with missing data by themselves (i.e., because they can introduce bias if data are not missing at random). They can, however, be a useful place to begin in determining whether there is likely any systematic bias due to missing data on variables used in a proposed analysis. The FIML approach does not actually impute missing values. Instead, FIML identifies parameter values that yield the highest log likelihood using the available information (i.e., including the individuals with partial data). The estimation procedure borrows information from the available observed data to estimate model parameters, since the existing sample itself is considered as containing incomplete information about the population (Peugh & Enders, 2004). Therefore, ML estimation is not so much dependent on the amount of data available as on its distributional quality; that is, if there are unknown patterns of missing values present in the observed data, this will likely lead to some bias in estimating population parameters with the observed data. Of course, all other things being equal, having at least a certain amount of data will generally yield more accurate estimates of the population values. What is a minimum number of subjects necessary, however, will differ depending on the design of the research, the size of the effects one anticipates, and complexity of the proposed analysis. Types of Missing Data In general, there are three main types of missing data [e.g., see Rubin (1976, 1987) or Peugh & Enders (2004) for further discussion]. These include data that are missing completely at random (MCAR), missing at random (MAR), and non-ignorable missing (NIM), which is also referred to as missing not at random (MNAR). For data to be MCAR strong assumptions must hold. More specifically, the missing data on a given outcome should be unrelated to either a subject s standing on the outcome or to other observed data or unobserved (missing) data in the analysis. Typically, this assumption is only met when the data are missing by design, as in the situation where we draw a random sample of the studied population. For example, suppose we have a sample where 25% are missing data on an outcome such as a math score. It may be the case that students who have missing data on math are distributed relatively evenly across levels of absenteeism. As long as the missing data are spread across the distribution of a continuous covariate such grade point average (GPA) or across different categories of an ordinal outcome such as absenteeism, we can likely conclude the data are MAR.

3 Ronald H. Heck 3 We would try to develop an argument that the individuals with missing data on math are relatively evenly dispersed across levels of the covariates or levels of the factors in the model. For example, for those individuals with missing data on math, but with complete data on absenteeism, we may be able to conclude the distribution of missing data on math is relatively evenly distributed across absenteeism (coded 1 = low, 2 = average, 3 = high). This is summarized in Table 1 below. Table 1. Absences * missing data on math crosstabulation 250 of the 1000 math cases are missing Not missing Missing Total absences 1.00 Count Expected Count Count Expected Count Count Expected Count Total Count Expected Count The data suggests the observed and expected values for complete versus missing data on math are pretty evenly distributed across categories of absenteeism. In fact, if we conduct a chisquare test on the distribution of complete versus missing data on math across levels of absenteeism, the difference in observed versus expected values is not significant (χ 2 = 0.642, 2 df, p >.05). In this first case, therefore, we would likely conclude the data are MAR. Keep in mind, individuals could have missing data on absenteeism and math scores simultaneously and still be MAR, as long as the missing data on one variable seems relatively evenly distributed across cases of the other variable. Of course, if there are a considerable number of individuals with missing data on both math and absenteeism, it may be more difficult to draw a conclusion. Let s suppose now, however, there are individuals in the data set who are also missing data on absenteeism. We will say there are 150 individuals with missing data on absenteeism along with the 250 who are missing data on math). Some of the individuals will therefore likely be missing data on both math and absenteeism. In Table 2, we have individuals with complete data on math (N = 750), but the table suggests that 118 of the individuals with complete data on math have missing data on absenteeism.

4 Ronald H. Heck 4 Table 2. Math categories * missing on absenteeism crosstabulation 150 from the 1000 absenteeism cases are missing Not Missing Missing Total mathcategories 1.00 Count Expected Count Count Expected Count Count Expected Count Total Count Expected Count In this table, we can see there is a tendency for individuals with missing data on absenteeism to have lower achievement in math (i.e., with categories defined as low = 1, average = 2, and high = 3). More specifically, the expected count for low achievement is about 12 individuals with missing data on absenteeism, but among those with missing data on absenteeism, there are 18 observed individuals in the low achievement category. It turns out the chi-square coefficient is not significant for these individuals in the partial data base (χ 2 = 4.086, 2 df, p =.130). For individuals with complete data on math, we can see there is no systematic bias in terms of missing data on absenteeism, although we noted a slight tendency for students who missing on absenteeism to be low achieving in math. It is important to note in passing, however, that in this data set there are actually 150 individuals with data missing on absenteeism (rather than 118), so this means that potentially up to 32 individuals with missing data on absenteeism could also be low achieving in math (category 1), if their math score data were not missing. We cannot be sure about this small, unknown group of individuals. If we now look at all the individuals in the data set (N = 1000), including those with missing data on either absenteeism or math coded as 0, we can examine whether there is a tendency for individuals with missing data on absenteeism to have either missing data on math (which could be still MAR) or to be higher or lower achieving in math (which might then be NIM). In Table 3, we can see for individuals with missing data on absenteeism (coded 0), they are relatively evenly distributed across all math levels, if we compare observed versus expected counts in each category (including the math category with missing data = 0).

5 Ronald H. Heck 5 Table 3. absences * mathcategories Crosstabulation Mathcategories Total Absences.00 Count Expected Count Count Expected Count Count Expected Count Count Expected Count Total Count Expected Count We can see in the first row (absences = 0) a slight tendency for individuals with missing data on absences to be low achieving in math (observed count = 20 versus expected count = 12). However, the observed versus expected counts are relatively similar for other categories of math achievement. On the other hand, for individuals missing data on math, there is no real tendency for them to be high in absenteeism (coded 3). More specifically, the expected count is 70 and the observed count is 68). As noted, if it were the case that the probability of an individual missing data on the math outcome is related to standing on the outcome, even for individuals with the same value on a covariate (e.g., there is more missing low-math data than missing average- and high-math data among students with the same attendance level), then it is likely that the data are NIM. This does not seem to be the case in this example data set. Data that are NIM can produce more bias on model estimation than either of the other missing data situations because the missing data on the absenteeism variable are related to actual values of individual achievement for those subjects who do not take the math test. We would prefer to be able to say the pattern of missing data on math outcomes is relatively similar for students with high, average, and low absenteeism. This would then indicate data that are MAR. Although it is often reasonable to assume that data are MAR, under some circumstances this assumption may not hold (Enders, 2011a). For example, if we again had 1,000 students who took the math test, with the same 250 having missing test data, but now determined that the majority of those 250 missing individuals also had high absenteeism (say 2/3), then we might have to acknowledge there could be some bias present. This bias might be necessary to note if we knew, for example, there was a statistically significant relationship between high absenteeism and low math achievement among those individuals with complete data. Given the large number of missing individuals with high absenteeism, then it would likely be difficult to argue that the

6 Ronald H. Heck 6 missing data on the student absenteeism variable would not affect the estimation of students' math test scores in the population. You can see that often dealing with missing data involves being able to make a reasonable argument about why the data are as they are and whether or not this is likely to bias the estimation of the outcomes in the analysis phase. Other techniques have been developed for data that are NIM. For example, Enders (2011a) demonstrates the usefulness of two of the NIM approaches for longitudinal data (i.e., pattern mixture model, selection models) and demonstrates their use on a real data set. The pattern mixture model, for example, involves stratifying the sample into subgroups that share the same pattern of missing data and the estimating a separate model for each pattern of missing data. SPSS is limited in its ability to deal with missing data appropriately. As the default, the program uses listwise deletion of cases with missing data. This means any individual with data missing on any variable will be dropped from the analysis. As an example, an individual with complete data on nine variables but missing data on the tenth variable would be dropped from the analysis. This obviously can result in a tremendous loss of data from individuals who may have largely complete data otherwise and, generally, it will lead loss of power to detect effects as well as potentially biased parameter estimates, since the estimates will be based on a reduced (and possibly biased) information. Mean substitution treats individuals with missing data as if they were sitting on the grand mean, which may not be very likely. SPSS does provide a number of options for examining missing data in addition to listwise, pairwise, or mean substitution. For example, the Base Statistics program includes a basic routine to replace missing values. This is, however, a limited routine which provides the following replacement methods: series mean, mean of nearby points, median of nearby points, linear interpolation, and linear trend at point. It is accessed from the program s toolbar, where the user can select TRANSFORM and REPLACE MISSING VALUES. It will not be appropriate to use in any real situations, however. Next, we would make a choice about how to address the missing data in our analyses. I actually favor a process where the analyst attempts to triangulate the results with different approaches which are currently recommended for examining missing data. One possible approach is to do something like the following. First, the analyst can try running the model using listwise deletion (which assumes MCAR). This data set is likely to be considerably smaller than the data set that would include individuals with "partially complete (data if FIML estimation were available), but it does provide the analyst with a baseline view of the relationships in the data (albeit likely a biased one) for comparing subsequent results. Second, if there is not too much missing data per variable, the listwise results can be compared against a number of complete data sets generated using a multiple imputation program which can be applied to the existing data under the assumption the data are MAR. Third, if the analyst has access to an SEM program (like Mplus), she or he can also try estimating the model with FIML with the cases with partial data included in the analysis and then comparing these results with the other approaches. Also, if the analysis is using a secondary (existing) data set, there may be sample weights

7 Ronald H. Heck 7 available. In this case, the analyst can (and should) check whether the available sample weights include adjustment for subject nonresponse. If there are such weights, they will be useful in addressing the issue of missing data. Multiple Imputation The only viable option, then, in SPSS is multiple imputation (MI). SPSS has a missing data module that can provide multiple imputation for missing data, but it is designed for singlelevel imputation only (i.e., referred to as design-based imputation). Of course, this is fine if one is not working with multilevel data (i.e., where individuals are clustered in groups such as schools, hospitals, or businesses). The available SPSS approach allows the user to generate and save a number of data sets with random values imputed for missing values. One of the advantages of MI for dealing with missing data is that the program implements random plausible values for missing cases and the user can produce several data sets, which will provide a normal distribution of imputed values. In order to use MI, however, there should be a relatively large number of individuals in the study (ideally, perhaps at least 150 or 200). Each data developed set is filled in with a different set of plausible replacement values for the missing data. Often, the advice is to create 5-10 imputed data sets, analyze them all, and obtain the average of the estimates and their adjusted standard errors across all the data sets as the final set of results. Note that some researchers suggest imputing as many as data sets. Patterns of missing data can first be identified and then plausible values can be imputed using a three-step process. The multiple imputation data sets can be used for subsequent analysis of each data set. Finally, the parameter estimates reported are averaged over the set of imputed data sets, and the corresponding standard errors needed for hypothesis testing are computed using the Rubin formula (Rubin, 1987). We will learn how to compute the standard errors in the next section of the handout, although SPSS can do that automatically, at least for many of the corresponding SPSS analytic procedures (Enders, 2011b) including multiple regression. Imputing Missing Values There are generally three phases of the MI process. First is the imputation phase, where plausible values are imputed into a number of data sets using information from variables the analyst chooses to include in the analysis. This phase is also sometimes referred to as data augmentation. Obtaining the estimates involves an iterative, two-step process where missing values are first imputed using regression equations and then a covariance matrix and mean vector are estimated. At the second (posterior) step (p-step), the regression coefficients are randomly perturbed in order to produce a slightly different equation for the next imputation step. This repeats until the difference between covariance matrices from adjacent iterations differs by a trivial amount [see Peugh and Enders (2004) for further discussion]. The process can be used to create a number of imputed data sets, where each simulates a random draw from the distribution

8 Ronald H. Heck 8 of plausible values for the missing data (Peugh & Enders, 2004). These can be saved as separate data sets and subsequently analyzed. The second MI phase is the analysis phase. This phase produces separate estimates for each of the data sets created. Third is the pooling phase, where the data sets are brought together and the parameters of the model under consideration are estimated across the data sets. As noted previously, in SPSS, the pooled estimates can be produced following Rubin s (1987) approach for most analytic procedures. Similarly, Mplus can also pool and analyze the data sets, producing one set of final estimates for the separate imputed data sets. Rubin s (1987) approach involves combining the results from a data imputation analysis performed m times (i.e., once for each of m imputed data sets), to obtain a single set of results. From each analysis, one must first calculate and save the parameter estimates and standard errors. Suppose that Q ˆ j is an estimated regression coefficient of interest (e.g., student socioeconomic status or SES) obtained from imputed data set j (j=1,2,...,m) and U j is the standard error associated withq ˆ j. The overall estimate is the average of the individual estimates (1) The overall standard error for Q ˆ j is comprised of two components. First is the withinimputation variance, and second is the between-imputation variance, which is defined as additional sampling error due to the presence of missing data. The within-imputation variance can be defined as the average squared standard error from the imputed data sets. (2) The between-imputation variance is the variance of the estimates across the data sets. This represents the usual formula for sample variance. The total variance is estimated as (3) (4) The overall standard error (SE) is then the square root of T. Significance tests of averaged individual parameters (θ) use a t distribution: Another advantage of the MI approach is that other variables not included in the actual analysis can also be used to supply information about missing data when the assumption that the data are MAR is plausible. The variables used in the imputation phase need not be included in (5)

9 Ronald H. Heck 9 the estimation of the proposed model. This can be useful when the analyst wants to use as much information from other variables in the data set to impute plausible values, but the variables used may not all be relevant to the focus of the specific analysis. In Table 4, we have a simple multiple regression model to estimate the effects of gender and socioeconomic status on student reading scores. The data are based on 20 individuals with complete data. We can see that standardized (M=0, SD = 1) socioeconomic status (ZSES) is statistically significant in explaining reading scores (B = 3.538, p <.05), but gender is not significant (p = 0.938). Table 4. Unstandardized estimates with complete data. B Std. Error T Sig. (Intercept) ZSES Female In Table 5, we can estimate the same model but with missing data on ZSES for six individuals. Now, with listwise deletion in SPSS (i.e., where any individual with missing data is eliminated from the analysis), we will lose 30% of the data. We can see that estimating the model with 15 individuals will result in a model where ZSES now does not affect reading scores significantly (B = 2.989, p >.05). We can also note the estimated parameter is smaller and, because of the loss of data, the power to detect the effect if it exists in the population is also reduced. We can therefore see that listwise deletion will result in bias unless the data are MCAR. This would only generally be the case, however, if we drew a random sample from a population (i.e., where the individuals not drawn for the sample could be thought of as missing completely at random). Of course, in Table 5 the missing individuals affect all the parameters in the model not just the one variable with missing data (ZSES). Table 5. Unstandardized estimates with 30% missing data. B Std. Error T Sig. (Intercept) ZSES Female For demonstration purposes, I first estimated the incomplete data using full information maximum likelihood (FIML), so the individuals with partial data are included. As you can see in Table 6, when the model is run using the Mplus software (which has FIML), we can retain all 20 individuals in the data set since the individuals with missing data on SES area included in the analysis. Even though the estimated ZSES parameter value (2.770) is smaller than the

10 Ronald H. Heck 10 corresponding estimate in Table 5 (2.989), it is statistically significant because of the increased power to detect the effect over the listwise data set. This is because all the individuals could be retained in the analysis. Table 6. Unstandardized Mplus model results (FIML estimation with partial data*). Estimate S.E. Est./S.E. Sign. Read on ZSES Female Intercepts Read *Number of observations = 20 Unfortunately, although SPSS has ML estimation available, it will not include individuals with partial data in any of its standard analytic platforms. In general, therefore, the only viable option available in SPSS for dealing with individuals with partial data is to use multiple imputation. As noted, MI will impute plausible values for variables with missing values by borrowing information from similar cases with complete data on the variables with missing values. As a beginning point, it is often useful to gather information about whether the missing values for a variable of interest are likely MAR. In this case, the missing data are on ZSES (and not the outcome), so we might begin by creating a flag for missing SES data. We can use Transform: Recode into Another Variable to create a variable called Missing. We open the dialog box and place ZSES in it. Then we create a new variable called Missing (click change to add it to the data set). Then click on Old and New Values. First, click on System Missing and code it as 0 and add to the dialog box. Then click on All Other Values and code them = 1 and add it in the dialog box. Click continue and this will create a missing value variable for ZSES. Now we can see if there is any systematic missingness associated with ZSES with respect to the reading outcome. You will recall that in the earlier discussion of types of missing data, that data can still be MAR if the probability of data being missing on the outcome is related to missing data on a covariate, but not to subjects standing on the outcome. In this case, however, there is no missing data on the outcome, so it is easier to examine whether or not missing data on SES is related to level of reading outcomes. The multiple regression results suggest that there is no relationship between the missing data on ZSES and reading scores More specifically, we can see that students with missing data on SES did tend to have lower reading scores but the result was not statistically significant (p >

11 Ronald H. Heck 11.05). One way the result might be statistically significant would be if more low (or high) SES students were missing. This observed result is therefore consistent with the idea that the data are likely MAR. Note that it might be possible for there to be missing data on SES that is associated with gender; that is, there can be missing values associated with one or more other predictors. The key issue is whether standing on a predictor (or predictors jointly) is associated with standing on the outcome. So, for example, if low SES students are known to be more likely to have low reading scores, and there is more missing data regarding low SES students, then it will be harder to argue that the data are truly MAR. Table 7. Unstandardized coefficients estimating the relationship between missing values on ZSES and reading levels. Unstandardized Coefficients Model B Std. Error T Sig. 1 (Constant) Female misszses Dependent Variable: read In this instance, however, since missing values on ZSES do not appear to predict levels of reading in Table 7, we could argue that the pattern of missing data with respect to ZSES is likely MAR with respect to reading scores. Creating Imputed Data Sets For demonstration purposes, next we can examine three data sets generated through multiple imputation in SPSS with the assumption that the data are MAR (recall it is suggested to create between 5 and 10 data sets as a minimum). Each of the three imputed is first analyzed separately in Table 8. Table 8. Unstandardized Estimates for 3 Imputed Data Sets and Averaged Estimates Parameter Coefficient SE T Sig. Data Set 1 ZSES Female Data Set 2 ZSES Female

12 Ronald H. Heck 12 Data Set 3 ZSES Female Imputed Estimates ZSES Female The results across the three data sets indicate that ZSES is statistically significant in explaining reading levels (p <.05). Female is not statistically significant in any of the three data sets (p >.05). You can see the size of the parameter estimates differ considerably (due to the small sample size). Keep in mind that if I imputed the data a second time, I would obtain different estimates, since the program selects random plausible values each time for ZSES based on other available information. The last estimates in the table are the averaged imputed estimates for the three data sets. Obtaining the Correct Averaged Estimates for SPSS Using Rubin s Method At the last step, we would provide the results of our analysis which incorporates our approach for dealing with missing data. Here is the series of steps to use in order to take the estimates from each imputed data set in Table 8 and develop the correct set of averaged imputed estimates with corrected standard errors for the imputed data sets. Find the average of the estimates for ZSES and female in the imputed data sets (Eq. 1). For ZSES: = = For female: (-.267) = = Obtain the within-imputation variance (Eq. 2). For ZSES: = 3.36/3 = 1.12 For Female: = 23.54/3 = 7.85 Obtain the between-imputation variance (Eq. 3). For ZSES: [( ) 2 + ( ) 2 + ( ) = 0.57/2 = 0.28 For Female [ )] 2 + [ )] 2 + [ )] = 4.91/2 = 2.46 Obtain the total variance (Eq. 4). For ZSES: (1.33) = 1.49 For Female: (1.33)(2.46)

13 Ronald H. Heck = The standard error is estimated as the square root of the variance. For ZSES ( 1.49 ) = For female ( ) = Finally, we can construct a t-ratio from the ratio of the average estimate to its corrected standard error (Eq. 5). For ZSES 2.892/1.221 = For female 0.118/3.335 = You can see this information in the t-ratio column for imputed estimates in Table 8. You can then estimate the p-value from a table of t-scores (or use a t-score conversion tool obtained online). We can compare the results obtained for the averaged results using SPSS against the Mplus results in Table 9, which use Rubin s (1987) approach. We can see the estimates of the averaged regression coefficients in Mplus are only slightly different from the SPSS averaged coefficients. For example, for ZSES, the SPSS unstandardized estimate is and the Mplus unstandardized estimate is The standard error for ZSES is in SPSS is a little larger (1.221) than the Mplus estimate (1.162). In contrast, the calculated standard error for female is slightly smaller (3.335) than the Mplus estimate (3.390). These small differences are likely due slightly different estimation procedures. When Mplus standard errors for each data set are used and the standard error adjustments for ZSES and female are calculated by hand, they agree with the Mplus output. Table 9. Mplus unstandardized results (ML estimation) Estimate S.E. Est./S.E. Sig. Read On Female ZSES Producing Pooled Estimates in SPSS We can compare our estimates of the averaged effects at the bottom of Table 8 against SPSS pooled estimates in Table 10. SPSS places the original data and the imputed data sets in one (stacked) file. A variable called IMPUTATION_ is used to differentiate the original (0) from the successive number of imputed data sets (1,2,, n). In this example, we will impute 3 data sets, but generally the recommendation is for probably 10 to 20 data sets (the default is 5).

14 Ronald H. Heck 14 MULTIPLE IMPUTATION female Zses read /IMPUTE METHOD=AUTO NIMPUTATIONS=3 MAXPCTMISSING=NONE /MISSINGSUMMARIES NONE /IMPUTATIONSUMMARIES MODELS /OUTFILE IMPUTATIONS='C:\Users\COE Staff\Documents\My Files\EDEP 606\Imputed3.sav' Following is the output regarding the 3 imputations in SPSS. The default imputation method was used, which develops the imputed data sets based on scanning the measurement scales of the data with missing values (e.g., continuous, ordinal, binary). As shown, ZSES was the only variable among the variables used in the imputation process that had missing data. Imputation Specifications Imputation Method Automatic Number of Imputations 3 Model for Scale Variables Linear Regression Interactions Included in (none) Models Maximum Percentage of 100.0% Missing Values Maximum Number of Parameters in Imputation 100 Model Imputation Results Imputation Method Monotone Fully Conditional Specification Method Iterations n/a Dependent Variables Imputed Zses Not Imputed(Too Many Missing Values) Not Imputed(No Missing female,read Values) Imputation Sequence female,read,zses The results below suggest that there were six missing values and there were 18 values imputed (6 missing cases x 3 imputed data sets).

15 Ronald H. Heck 15 Model Imputation Models Type Effects Missing Values Imputed Values Zses Linear Regression female,read 6 18 After the multiple imputation is conducted, before actually analyzing the data using multiple regression, it is necessary to use the SPLIT FILE option in SPSS (DATA: SPLIT FILE). We can split the imputed data sets to examine the original data set with missing data against the imputed data sets and to run a pooled analysis with one set of estimates based on Rubin s (1987) techniques discussed earlier in the chapter. We select Compare Groups. The imputed data have already been sorted. In the SPSS syntax below, we first provide a statement to split the file layered by the imputation (Imputation_) data sets. Afterward, we can conduct the multiple regression analysis. SPLIT FILE LAYERED BY Imputation_. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT read /METHOD=ENTER female Zses. Examining the Parameter Estimates for 3 Imputed Data Sets in SPSS The results in Table 10 across the three data sets indicate that ZSES is statistically significant in explaining reading levels (p <.05). Female is not statistically significant in any of the three data sets (p >.05). You can see the size of the parameter estimates differ considerably (due to the small sample size). Keep in mind if I imputed the data a second time, I would obtain different estimates, since the program selects random plausible values each time for ZSES based on other available information. The last estimates in the table are the averaged imputed ones for the three data sets. You can compare the pooled output produced by SPSS in the table below to the estimates I did by hand in Table 8. We can see that the pooled estimates for female and ZSES are quite similar to the ones calculated by hand.

16 Ronald H. Heck 16 Table 9. Intercepts and Unstandardized Coefficients a Unstandardized Coefficients imputation_ Model B Std. Error t Sig. 0 1 (Constant) female Zses (Constant) female Zses (Constant) female Zses (Constant) female Zses Pooled 1 (Constant) female Zses a. Dependent Variable: read Below in Table 10 are the estimates of the model r-square statistics for the missing data set and the three imputed data sets. You can see there is variability across the imputed data sets. Table 10. Model Summary imputation_ Model R R Square Adjusted R Square Std. Error of the Estimate a a a a a. Predictors: (Constant), Zses, female

17 Ronald H. Heck 17 Finally, we have the overall ANOVA results for each imputed data set in Table 11. The F-tests suggest variability in the overall model in terms of accounting for significant variance in reading scores across the separate imputed data sets. Table 11. ANOVA a Results imputation_ Model Sum of Squares df Mean Square F Sig. 0 1 Regression b Residual Total Regression b Residual Total Regression b Residual Total Regression b Residual Total a. Dependent Variable: read b. Predictors: (Constant), Zses, female Missing Data in Vertical Format in SPSS Using the Mixed Modeling As noted previously, at present SPSS does not support FIML estimation in situations where there may be observations missing, as is found in typical SEM software programs. However, where one can vertically arrange the data (e.g., where a single individual may have repeated observations which comprise several rows in the data set), only that particular piece of missing information will be dropped if it is on the dependent variable. If covariates are missing, however, the subject will also be listwise deleted, which will likely introduce some bias into the analysis. Besides repeated measures data, arranging the outcome data vertically can also be useful in situations where an analyst may wish to examine several univariate outcomes (e.g., individual results on reading, math, and language tests). If there are considerable missing data on each outcome, treating the outcome as multivariate (i.e., with vertical arrangement of the data at level 1), can result in keeping most of the missing data, since only cases where data are missing on all three tests will be dropped. It is important to note that keeping participants with partial

18 Ronald H. Heck 18 data is important for justifying the MAR assumption. Where MAR can be supported, this should lead to estimates that are not biased (Hox, 2010). Here is a simple illustration of how this works with longitudinal data. In Table 10, where there are 3 repeated reading measures per individual and 4 individuals. We can see different patterns of missing data for each particular individual present. SPSS can handle different patterns of missing data (i.e., missing on the first occasion, the second or third occasion, various multiple occasions) and amounts of missing data. Table 10: Vertical Data Format Subject Time Score Some individuals in the table have no missing observations, some have missing data on one occasion, and some have missing data on two occasions. As long as Y is not missing on all occasions, the program will come up with an "estimated" growth over each time interval, as well as an initial status (intercept) estimate, even though the initial data point is missing for subject 4. In Table 11, closer inspection of the data using Missing Values Analysis (MVA) in SPSS suggests individuals have missing data on the outcome. We can also see that of the total data lines (12), only 8 lines are present (so 33% of the data is missing). Table 11. Univariate Statistics N Mean Std. Deviation Missing Count Percent time score Next, in Table 12, we can see that there is more missing data at the second time interval than either the first or third interval. Also we can see that is we only used the 8 lines of data the

19 Ronald H. Heck 19 grand mean over occasions would be This is not as relevant since we are estimating growth over time, rather than one grand mean, but it makes the point that we are losing onethird of the data. Table 12. Tabulated Patterns Missing Patterns a Complete time d Number of Cases time score if... b score c X a. Variables are sorted on missing patterns. b. Number of complete cases if variables missing in that pattern (marked with X) are not used. c. Means at each unique pattern d. Frequency distribution at each unique pattern Below in Table 13 is the Model Dimension table, which is part of the SPSS output from estimating this model. It shows all four subjects are retained in the analysis. This can be important information for analysts concerning how many individuals in the data set are actually being included in the analysis. It is important to be able to accept that the data are MAR, since the maximum likelihood estimates in the MIXED analytic platform in SPSS depend on this assumption in order for the estimates to be unbiased. This is why being able to include all individuals with partial or complete data is generally important for this accepting this assumption as valid. You can see that the more individuals who are dropped from the analysis, the harder it would be to make the case that the data are indeed MAR. Table 13. Model Dimension a Number of Levels Covariance Structure Number of Parameters Subject Variables Number of Subjects Fixed Effects Intercept 1 1 Time 1 1 Repeated Time 3 Diagonal 3 Subject 4 Effects Total 5 5 a. Dependent Variable: score. Importantly, in Table 14, the estimates for students initial intercept score and their change over each interval of time are summarized. The initial status intercept (5.21), which describes the mean when Time = 0, is estimated based on available information, but all four individuals are kept in the analysis, rather than if we just estimated the data based on complete data. For example, if we used RM ANOVA, in this small data set that would be only one individual! This makes the point about the importance of retaining partial data. Of course, if the

20 Ronald H. Heck 20 missing data were also on a predictor (like say gender or SES), then we would have also investigate possible effects of missing data due to the predictors (as well as due to the outcome). Table 14. Estimates of Fixed Effects a Parameter Estimate Std. Error df T Sig. Intercept Time a. Dependent Variable: score. Notice there is one more interesting piece of information in this table. SPSS uses adjusted degrees of freedom, which is something like the relative sample size, in estimating hypothesis tests (i.e., regarding the statistical significance of parameters). You can see that the relative sample size is larger in estimating the effect of growth over each interval of time than in estimating the initial status intercept. This is because there are more data points that can used to estimate the change over time than to estimate the initial status intercept (see Table 12). Summary Much of our discussion about missing data suggests that dealing with missing data is not so much about "How much missing data is allowable?" but, rather, is more about how to develop a process to deal with the missing data. It is incumbent on researchers to be aware of how missing data will affect the analysis. We can definitely improve the quality of our analyses by giving attention to missing data in the preliminary phase of preparing the data for analysis. Even relatively small amounts of missing data on one or more variables can create some bias in the estimated parameters, so it is important to assess what this likely parameter bias might be, and then develop some type of strategy to address the problem (e.g., multiple imputation, ways to retain individuals with partial data, provide analyses under various conditions and compare the results, etc.). I have one handout (which I did not include in this chapter) showing that even with small amounts of missing data (i.e., less than 60 missing cases) and over 6,500 individuals in the data set, each of three analytic approaches I compared made use of differing amounts of data. This was not a problem in such a large sample size; however, it illustrates the point that different analytic approaches can make use of differing amounts of data. It becomes important to know how many cases are being included in an analysis. In smaller data sets, of course, you can see the problem of missing data could be considerably magnified. Class Activity A researcher is interested in examining whether treatment (coded 1) or control group (coded 0) membership is related to knowledge acquisition in math. Students (N = 40) were randomly assigned to treatment or control conditions. They were also assessed in terms of their

21 Ronald H. Heck 21 prior knowledge (pretest). Unfortunately, there is missing data on both the knowledge posttest and the pretest. The data set used for this activity is ch8missingdataactivity.sav 1. Determine how much missing data there is on the two variables of concern and whether missing data on the posttest tends to be associated with group membership and missing data on the pretest. 2. Impute three data sets. Present the average results across the three data sets with the standard errors for group and pretest adjusted for variance in the imputing process. 3. After you obtain your averaged results calculate the t ratio and determine whether the variables in the model are statistically significant at the p =.05 level. You may want to begin by estimating a regression model with the listwise data, just to see where you are starting out. Then you can create dichotomous missing variables for the posttest and for the pretest. Finally, you can examine whether group membership and missing values on the pretest tend to predict missingness on the dependent variable (you can use logistic regression (REGRESSION: Binary logistic) to do this. Remember the more important part of missing data analysis is whether standing on the independent variable is related to standing on the dependent variable. When the missing data is confined to the predictor, it is a bit easier to check whether missing data on, for example, the pretest is related to lower scores on the posttest. So one place we can start is by examining whether the predictor tends is related to higher or lower values of the outcome (which in this case we expect for both group and pretest) and then whether they are related to missing data on the outcome in a systematic way for similar standing on the predictor (i.e., statistically significant relationship). Recall that for non-ignorable missing (NIM) the key is whether the probability of missing on the outcome is related to standing on the outcome, even for individuals with the same value on a covariate. So, for example, if individuals in the control group were responsible for 2/3 of the missing data and we know they have lower scores, it would be harder to argue than the greater missing data in that group might be biasing the results for the overall population estimates. Similarly, if most of the missing data were for low pretest scores, this might affect the overall estimates of the learning at the end. In most cases, this rests on mounting an argument about why the data are as they are and whether this likely has a non-ignorable effect on the outcomes. References Enders, C. (2011a). Missing not at random models for latent growth curve analysis. Psychological Methods, 16, Enders, C. K. (2011b). Analysis of missing data. Workshop at BYU, June 2-3, Hox, J. (2010). Multilevel analysis. Techniques and applications (2 nd Edition). NY: Routledge.

22 Ronald H. Heck 22 Peugh, J. & Enders, C. (2004). Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of Educational Research, 74, Rubin, D.B. (1976). Inference and missing data. Biometrika, 63, Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. J. Wiley & Sons, New York.

Example Using Missing Data 1

Example Using Missing Data 1 Ronald H. Heck and Lynn N. Tabata 1 Example Using Missing Data 1 Creating the Missing Data Variable (Miss) Here is a data set (achieve subset MANOVAmiss.sav) with the actual missing data on the outcomes.

More information

- 1 - Fig. A5.1 Missing value analysis dialog box

- 1 - Fig. A5.1 Missing value analysis dialog box WEB APPENDIX Sarstedt, M. & Mooi, E. (2019). A concise guide to market research. The process, data, and methods using SPSS (3 rd ed.). Heidelberg: Springer. Missing Value Analysis and Multiple Imputation

More information

PSY 9556B (Jan8) Design Issues and Missing Data Continued Examples of Simulations for Projects

PSY 9556B (Jan8) Design Issues and Missing Data Continued Examples of Simulations for Projects PSY 9556B (Jan8) Design Issues and Missing Data Continued Examples of Simulations for Projects Let s create a data for a variable measured repeatedly over five occasions We could create raw data (for each

More information

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background An Introduction to Multiple Imputation and its Application Craig K. Enders University of California - Los Angeles Department of Psychology cenders@psych.ucla.edu Background Work supported by Institute

More information

Missing Data Missing Data Methods in ML Multiple Imputation

Missing Data Missing Data Methods in ML Multiple Imputation Missing Data Missing Data Methods in ML Multiple Imputation PRE 905: Multivariate Analysis Lecture 11: April 22, 2014 PRE 905: Lecture 11 Missing Data Methods Today s Lecture The basics of missing data:

More information

Missing Data Analysis with SPSS

Missing Data Analysis with SPSS Missing Data Analysis with SPSS Meng-Ting Lo (lo.194@osu.edu) Department of Educational Studies Quantitative Research, Evaluation and Measurement Program (QREM) Research Methodology Center (RMC) Outline

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Missing Data. SPIDA 2012 Part 6 Mixed Models with R: The best solution to the missing data problem is not to have any. Stef van Buuren, developer of mice SPIDA 2012 Part 6 Mixed Models with R: Missing Data Georges Monette 1 May 2012 Email: georges@yorku.ca

More information

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS Examples: Missing Data Modeling And Bayesian Analysis CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS Mplus provides estimation of models with missing data using both frequentist and Bayesian

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION Introduction CHAPTER 1 INTRODUCTION Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data. Mplus offers researchers a wide choice of models, estimators,

More information

An Introduction to Growth Curve Analysis using Structural Equation Modeling

An Introduction to Growth Curve Analysis using Structural Equation Modeling An Introduction to Growth Curve Analysis using Structural Equation Modeling James Jaccard New York University 1 Overview Will introduce the basics of growth curve analysis (GCA) and the fundamental questions

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

PRI Workshop Introduction to AMOS

PRI Workshop Introduction to AMOS PRI Workshop Introduction to AMOS Krissy Zeiser Pennsylvania State University klz24@pop.psu.edu 2-pm /3/2008 Setting up the Dataset Missing values should be recoded in another program (preferably with

More information

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION

More information

Generalized least squares (GLS) estimates of the level-2 coefficients,

Generalized least squares (GLS) estimates of the level-2 coefficients, Contents 1 Conceptual and Statistical Background for Two-Level Models...7 1.1 The general two-level model... 7 1.1.1 Level-1 model... 8 1.1.2 Level-2 model... 8 1.2 Parameter estimation... 9 1.3 Empirical

More information

SPSS INSTRUCTION CHAPTER 9

SPSS INSTRUCTION CHAPTER 9 SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can

More information

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Introduction to Mixed Models: Multivariate Regression

Introduction to Mixed Models: Multivariate Regression Introduction to Mixed Models: Multivariate Regression EPSY 905: Multivariate Analysis Spring 2016 Lecture #9 March 30, 2016 EPSY 905: Multivariate Regression via Path Analysis Today s Lecture Multivariate

More information

NORM software review: handling missing values with multiple imputation methods 1

NORM software review: handling missing values with multiple imputation methods 1 METHODOLOGY UPDATE I Gusti Ngurah Darmawan NORM software review: handling missing values with multiple imputation methods 1 Evaluation studies often lack sophistication in their statistical analyses, particularly

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

Missing Data Techniques

Missing Data Techniques Missing Data Techniques Paul Philippe Pare Department of Sociology, UWO Centre for Population, Aging, and Health, UWO London Criminometrics (www.crimino.biz) 1 Introduction Missing data is a common problem

More information

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding In the previous lecture we learned how to incorporate a categorical research factor into a MLR model by using

More information

TABEL DISTRIBUSI DAN HUBUNGAN LENGKUNG RAHANG DAN INDEKS FASIAL N MIN MAX MEAN SD

TABEL DISTRIBUSI DAN HUBUNGAN LENGKUNG RAHANG DAN INDEKS FASIAL N MIN MAX MEAN SD TABEL DISTRIBUSI DAN HUBUNGAN LENGKUNG RAHANG DAN INDEKS FASIAL Lengkung Indeks fasial rahang Euryprosopic mesoprosopic leptoprosopic Total Sig. n % n % n % n % 0,000 Narrow 0 0 0 0 15 32,6 15 32,6 Normal

More information

Introduction to Mplus

Introduction to Mplus Introduction to Mplus May 12, 2010 SPONSORED BY: Research Data Centre Population and Life Course Studies PLCS Interdisciplinary Development Initiative Piotr Wilk piotr.wilk@schulich.uwo.ca OVERVIEW Mplus

More information

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS}

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS} MVA MVA [VARIABLES=] {varlist} {ALL } [/CATEGORICAL=varlist] [/MAXCAT={25 ** }] {n } [/ID=varname] Description: [/NOUNIVARIATE] [/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n}

More information

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU. SOS3003 Applied data analysis for social science Lecture note 04-2009 Erling Berge Department of sociology and political science NTNU Erling Berge 2009 1 Missing data Literature Allison, Paul D 2002 Missing

More information

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Creation & Description of a Data Set * 4 Levels of Measurement * Nominal, ordinal, interval, ratio * Variable Types

More information

Multiple Imputation with Mplus

Multiple Imputation with Mplus Multiple Imputation with Mplus Tihomir Asparouhov and Bengt Muthén Version 2 September 29, 2010 1 1 Introduction Conducting multiple imputation (MI) can sometimes be quite intricate. In this note we provide

More information

Handling Data with Three Types of Missing Values:

Handling Data with Three Types of Missing Values: Handling Data with Three Types of Missing Values: A Simulation Study Jennifer Boyko Advisor: Ofer Harel Department of Statistics University of Connecticut Storrs, CT May 21, 2013 Jennifer Boyko Handling

More information

The Performance of Multiple Imputation for Likert-type Items with Missing Data

The Performance of Multiple Imputation for Likert-type Items with Missing Data Journal of Modern Applied Statistical Methods Volume 9 Issue 1 Article 8 5-1-2010 The Performance of Multiple Imputation for Likert-type Items with Missing Data Walter Leite University of Florida, Walter.Leite@coe.ufl.edu

More information

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening Variables Entered/Removed b Variables Entered GPA in other high school, test, Math test, GPA, High school math GPA a Variables Removed

More information

ANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3

ANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3 ANNOUNCING THE RELEASE OF LISREL VERSION 9.1 2 BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3 THREE-LEVEL MULTILEVEL GENERALIZED LINEAR MODELS 3 FOUR

More information

Estimation of Item Response Models

Estimation of Item Response Models Estimation of Item Response Models Lecture #5 ICPSR Item Response Theory Workshop Lecture #5: 1of 39 The Big Picture of Estimation ESTIMATOR = Maximum Likelihood; Mplus Any questions? answers Lecture #5:

More information

Regression. Page 1. Notes. Output Created Comments Data. 26-Mar :31:18. Input. C:\Documents and Settings\BuroK\Desktop\Data Sets\Prestige.

Regression. Page 1. Notes. Output Created Comments Data. 26-Mar :31:18. Input. C:\Documents and Settings\BuroK\Desktop\Data Sets\Prestige. GET FILE='C:\Documents and Settings\BuroK\Desktop\DataSets\Prestige.sav'. GET FILE='E:\MacEwan\Teaching\Stat252\Data\SPSS_data\MENTALID.sav'. DATASET ACTIVATE DataSet1. DATASET CLOSE DataSet2. GET FILE='E:\MacEwan\Teaching\Stat252\Data\SPSS_data\survey_part.sav'.

More information

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models Bengt Muth en University of California, Los Angeles Tihomir Asparouhov Muth en & Muth en Mplus

More information

Multiple-imputation analysis using Stata s mi command

Multiple-imputation analysis using Stata s mi command Multiple-imputation analysis using Stata s mi command Yulia Marchenko Senior Statistician StataCorp LP 2009 UK Stata Users Group Meeting Yulia Marchenko (StataCorp) Multiple-imputation analysis using mi

More information

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file 1 SPSS Guide 2009 Content 1. Basic Steps for Data Analysis. 3 2. Data Editor. 2.4.To create a new SPSS file 3 4 3. Data Analysis/ Frequencies. 5 4. Recoding the variable into classes.. 5 5. Data Analysis/

More information

Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International

Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International Part A: Comparison with FIML in the case of normal data. Stephen du Toit Multivariate data

More information

Data analysis using Microsoft Excel

Data analysis using Microsoft Excel Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data

More information

ANSWERS -- Prep for Psyc350 Laboratory Final Statistics Part Prep a

ANSWERS -- Prep for Psyc350 Laboratory Final Statistics Part Prep a ANSWERS -- Prep for Psyc350 Laboratory Final Statistics Part Prep a Put the following data into an spss data set: Be sure to include variable and value labels and missing value specifications for all variables

More information

DataSet2. <none> <none> <none>

DataSet2. <none> <none> <none> GGraph Notes Output Created 09-Dec-0 07:50:6 Comments Input Active Dataset Filter Weight Split File DataSet Syntax Resources N of Rows in Working Data File Processor Time Elapsed Time 77 GGRAPH /GRAPHDATASET

More information

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated

More information

MISSING DATA AND MULTIPLE IMPUTATION

MISSING DATA AND MULTIPLE IMPUTATION Paper 21-2010 An Introduction to Multiple Imputation of Complex Sample Data using SAS v9.2 Patricia A. Berglund, Institute For Social Research-University of Michigan, Ann Arbor, Michigan ABSTRACT This

More information

IBM SPSS Missing Values 21

IBM SPSS Missing Values 21 IBM SPSS Missing Values 21 Note: Before using this information and the product it supports, read the general information under Notices on p. 87. This edition applies to IBM SPSS Statistics 21 and to all

More information

PSY 9556B (Feb 5) Latent Growth Modeling

PSY 9556B (Feb 5) Latent Growth Modeling PSY 9556B (Feb 5) Latent Growth Modeling Fixed and random word confusion Simplest LGM knowing how to calculate dfs How many time points needed? Power, sample size Nonlinear growth quadratic Nonlinear growth

More information

Handling missing data for indicators, Susanne Rässler 1

Handling missing data for indicators, Susanne Rässler 1 Handling Missing Data for Indicators Susanne Rässler Institute for Employment Research & Federal Employment Agency Nürnberg, Germany First Workshop on Indicators in the Knowledge Economy, Tübingen, 3-4

More information

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 6 SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE

More information

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options

More information

Chapters 5-6: Statistical Inference Methods

Chapters 5-6: Statistical Inference Methods Chapters 5-6: Statistical Inference Methods Chapter 5: Estimation (of population parameters) Ex. Based on GSS data, we re 95% confident that the population mean of the variable LONELY (no. of days in past

More information

MHPE 494: Data Analysis. Welcome! The Analytic Process

MHPE 494: Data Analysis. Welcome! The Analytic Process MHPE 494: Data Analysis Alan Schwartz, PhD Department of Medical Education Memoona Hasnain,, MD, PhD, MHPE Department of Family Medicine College of Medicine University of Illinois at Chicago Welcome! Your

More information

Descriptives. Graph. [DataSet1] C:\Documents and Settings\BuroK\Desktop\Prestige.sav

Descriptives. Graph. [DataSet1] C:\Documents and Settings\BuroK\Desktop\Prestige.sav GET FILE='C:\Documents and Settings\BuroK\Desktop\Prestige.sav'. DESCRIPTIVES VARIABLES=prestige education income women /STATISTICS=MEAN STDDEV MIN MAX. Descriptives Input Missing Value Handling Resources

More information

Chapter 1. Using the Cluster Analysis. Background Information

Chapter 1. Using the Cluster Analysis. Background Information Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,

More information

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis Applied Mathematical Sciences, Vol. 5, 2011, no. 57, 2807-2818 Simulation Study: Introduction of Imputation Methods for Missing Data in Longitudinal Analysis Michikazu Nakai Innovation Center for Medical

More information

Regression. Notes. Page 1 25-JAN :21:57. Output Created Comments

Regression. Notes. Page 1 25-JAN :21:57. Output Created Comments /STATISTICS COEFF OUTS CI(95) R ANOVA /CRITERIA=PIN(.05) POUT(.10) /DEPENDENT Favorability /METHOD=ENTER zcontemp ZAnxious6 zallcontact. Regression Notes Output Created Comments Input Missing Value Handling

More information

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data Introduction About this Document This manual was written by members of the Statistical Consulting Program as an introduction to SPSS 12.0. It is designed to assist new users in familiarizing themselves

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics PASW Complex Samples 17.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use

More information

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy Simulation of Imputation Effects Under Different Assumptions Danny Rithy ABSTRACT Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive

More information

PASW Missing Values 18

PASW Missing Values 18 i PASW Missing Values 18 For more information about SPSS Inc. software products, please visit our Web site at http://www.spss.com or contact SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412

More information

Missing Data Part 1: Overview, Traditional Methods Page 1

Missing Data Part 1: Overview, Traditional Methods Page 1 Missing Data Part 1: Overview, Traditional Methods Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 17, 2015 This discussion borrows heavily from: Applied

More information

7.4 Tutorial #4: Profiling LC Segments Using the CHAID Option

7.4 Tutorial #4: Profiling LC Segments Using the CHAID Option 7.4 Tutorial #4: Profiling LC Segments Using the CHAID Option DemoData = gss82.sav After an LC model is estimated, it is often desirable to describe (profile) the resulting latent classes in terms of demographic

More information

Bootstrap and multiple imputation under missing data in AR(1) models

Bootstrap and multiple imputation under missing data in AR(1) models EUROPEAN ACADEMIC RESEARCH Vol. VI, Issue 7/ October 2018 ISSN 2286-4822 www.euacademic.org Impact Factor: 3.4546 (UIF) DRJI Value: 5.9 (B+) Bootstrap and multiple imputation under missing ELJONA MILO

More information

HANDLING MISSING DATA

HANDLING MISSING DATA GSO international workshop Mathematic, biostatistics and epidemiology of cancer Modeling and simulation of clinical trials Gregory GUERNEC 1, Valerie GARES 1,2 1 UMR1027 INSERM UNIVERSITY OF TOULOUSE III

More information

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to

More information

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings Statistical Good Practice Guidelines SSC home Using Excel for Statistics - Tips and Warnings On-line version 2 - March 2001 This is one in a series of guides for research and support staff involved in

More information

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) *

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) * OpenStax-CNX module: m39305 1 Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) * Free High School Science Texts Project This work is produced by OpenStax-CNX

More information

Product Catalog. AcaStat. Software

Product Catalog. AcaStat. Software Product Catalog AcaStat Software AcaStat AcaStat is an inexpensive and easy-to-use data analysis tool. Easily create data files or import data from spreadsheets or delimited text files. Run crosstabulations,

More information

Source df SS MS F A a-1 [A] [T] SS A. / MS S/A S/A (a)(n-1) [AS] [A] SS S/A. / MS BxS/A A x B (a-1)(b-1) [AB] [A] [B] + [T] SS AxB

Source df SS MS F A a-1 [A] [T] SS A. / MS S/A S/A (a)(n-1) [AS] [A] SS S/A. / MS BxS/A A x B (a-1)(b-1) [AB] [A] [B] + [T] SS AxB Keppel, G. Design and Analysis: Chapter 17: The Mixed Two-Factor Within-Subjects Design: The Overall Analysis and the Analysis of Main Effects and Simple Effects Keppel describes an Ax(BxS) design, which

More information

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Your Name: Section: 36-201 INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Objectives: 1. To learn how to interpret scatterplots. Specifically you will investigate, using

More information

Lab #9: ANOVA and TUKEY tests

Lab #9: ANOVA and TUKEY tests Lab #9: ANOVA and TUKEY tests Objectives: 1. Column manipulation in SAS 2. Analysis of variance 3. Tukey test 4. Least Significant Difference test 5. Analysis of variance with PROC GLM 6. Levene test for

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Multiple imputation using chained equations: Issues and guidance for practice

Multiple imputation using chained equations: Issues and guidance for practice Multiple imputation using chained equations: Issues and guidance for practice Ian R. White, Patrick Royston and Angela M. Wood http://onlinelibrary.wiley.com/doi/10.1002/sim.4067/full By Gabrielle Simoneau

More information

Study Guide. Module 1. Key Terms

Study Guide. Module 1. Key Terms Study Guide Module 1 Key Terms general linear model dummy variable multiple regression model ANOVA model ANCOVA model confounding variable squared multiple correlation adjusted squared multiple correlation

More information

Panel Data 4: Fixed Effects vs Random Effects Models

Panel Data 4: Fixed Effects vs Random Effects Models Panel Data 4: Fixed Effects vs Random Effects Models Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised April 4, 2017 These notes borrow very heavily, sometimes verbatim,

More information

Missing Not at Random Models for Latent Growth Curve Analyses

Missing Not at Random Models for Latent Growth Curve Analyses Psychological Methods 20, Vol. 6, No., 6 20 American Psychological Association 082-989X//$2.00 DOI: 0.037/a0022640 Missing Not at Random Models for Latent Growth Curve Analyses Craig K. Enders Arizona

More information

Analysis of Complex Survey Data with SAS

Analysis of Complex Survey Data with SAS ABSTRACT Analysis of Complex Survey Data with SAS Christine R. Wells, Ph.D., UCLA, Los Angeles, CA The differences between data collected via a complex sampling design and data collected via other methods

More information

ASSOCIATION BETWEEN VARIABLES: CROSSTABULATIONS

ASSOCIATION BETWEEN VARIABLES: CROSSTABULATIONS POLI 300 Handouts #10 Fall 2006 ASSOCIATION BETWEEN VARIABLES: CROSSTABULATIONS Suppose we want to do research on the following bivariate hypothesis: the more interested people are in politics, the more

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques SEVENTH EDITION and EXPANDED SEVENTH EDITION Slide - Chapter Statistics. Sampling Techniques Statistics Statistics is the art and science of gathering, analyzing, and making inferences from numerical information

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

The following procedures and commands, are covered in this part: Command Purpose Page

The following procedures and commands, are covered in this part: Command Purpose Page Some Procedures in SPSS Part (2) This handout describes some further procedures in SPSS, following on from Part (1). Because some of the procedures covered are complex, with many sub-commands, the descriptions

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information

For our example, we will look at the following factors and factor levels.

For our example, we will look at the following factors and factor levels. In order to review the calculations that are used to generate the Analysis of Variance, we will use the statapult example. By adjusting various settings on the statapult, you are able to throw the ball

More information

CHAPTER 18 OUTPUT, SAVEDATA, AND PLOT COMMANDS

CHAPTER 18 OUTPUT, SAVEDATA, AND PLOT COMMANDS OUTPUT, SAVEDATA, And PLOT Commands CHAPTER 18 OUTPUT, SAVEDATA, AND PLOT COMMANDS THE OUTPUT COMMAND OUTPUT: In this chapter, the OUTPUT, SAVEDATA, and PLOT commands are discussed. The OUTPUT command

More information

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017 Machine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 Smith College October 30, 2017 Analytical Chain: From Data to Actions Data Collection Data Cleaning/ Preparation Analysis

More information

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 9. oktober 2012 Multivariate tables Agenda today Age standardization Missing data 1 2 3 4 Age standardization

More information

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Age standardization Incidence and prevalence are strongly agedependent Risks rising

More information

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999. 2 Schafer, J. L., Graham, J. W.: (2002). Missing Data: Our View of the State of the Art. Psychological methods, 2002, Vol 7, No 2, 47 77 Rosner, B. (2005) Fundamentals of Biostatistics, 6th ed, Wiley.

More information

The linear mixed model: modeling hierarchical and longitudinal data

The linear mixed model: modeling hierarchical and longitudinal data The linear mixed model: modeling hierarchical and longitudinal data Analysis of Experimental Data AED The linear mixed model: modeling hierarchical and longitudinal data 1 of 44 Contents 1 Modeling Hierarchical

More information

Multidimensional Latent Regression

Multidimensional Latent Regression Multidimensional Latent Regression Ray Adams and Margaret Wu, 29 August 2010 In tutorial seven, we illustrated how ConQuest can be used to fit multidimensional item response models; and in tutorial five,

More information

Chapter 15 Mixed Models. Chapter Table of Contents. Introduction Split Plot Experiment Clustered Data References...

Chapter 15 Mixed Models. Chapter Table of Contents. Introduction Split Plot Experiment Clustered Data References... Chapter 15 Mixed Models Chapter Table of Contents Introduction...309 Split Plot Experiment...311 Clustered Data...320 References...326 308 Chapter 15. Mixed Models Chapter 15 Mixed Models Introduction

More information

Teaching students quantitative methods using resources from the British Birth Cohorts

Teaching students quantitative methods using resources from the British Birth Cohorts Centre for Longitudinal Studies, Institute of Education Teaching students quantitative methods using resources from the British Birth Cohorts Assessment of Cognitive Development through Childhood CognitiveExercises.doc:

More information

Condence Intervals about a Single Parameter:

Condence Intervals about a Single Parameter: Chapter 9 Condence Intervals about a Single Parameter: 9.1 About a Population Mean, known Denition 9.1.1 A point estimate of a parameter is the value of a statistic that estimates the value of the parameter.

More information

LISREL 10.1 RELEASE NOTES 2 1 BACKGROUND 2 2 MULTIPLE GROUP ANALYSES USING A SINGLE DATA FILE 2

LISREL 10.1 RELEASE NOTES 2 1 BACKGROUND 2 2 MULTIPLE GROUP ANALYSES USING A SINGLE DATA FILE 2 LISREL 10.1 RELEASE NOTES 2 1 BACKGROUND 2 2 MULTIPLE GROUP ANALYSES USING A SINGLE DATA FILE 2 3 MODELS FOR GROUPED- AND DISCRETE-TIME SURVIVAL DATA 5 4 MODELS FOR ORDINAL OUTCOMES AND THE PROPORTIONAL

More information

THE ANALYSIS OF CONTINUOUS DATA FROM MULTIPLE GROUPS

THE ANALYSIS OF CONTINUOUS DATA FROM MULTIPLE GROUPS THE ANALYSIS OF CONTINUOUS DATA FROM MULTIPLE GROUPS 1. Introduction In practice, many multivariate data sets are observations from several groups. Examples of these groups are genders, languages, political

More information

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT This chapter provides step by step instructions on how to define and estimate each of the three types of LC models (Cluster, DFactor or Regression) and also

More information