Missing Data Part II: Multiple Imputation & Maximum Likelihood

Size: px
Start display at page:

Download "Missing Data Part II: Multiple Imputation & Maximum Likelihood"

Transcription

1 Missing Data Part II: Multiple Imputation & Maximum Likelihood Richard Williams, University of Notre Dame, Last revised February 12, 2017 Warning: I teach about Multiple Imputation with some trepidation. You should know what it is and at least have reading competency with it. However, I have seen people try incredibly complicated imputation models before they have a lot of other basics down. For many/most purposes, at least for the work typically done in this class, listwise deletion is fine and MI adds little. Some people say to not even consider MI unless at least 15% or 20% of your data are missing. For your own papers, if you use it at all, MI should probably be one of the last things you do, rather than the first. And, if you do want to seriously use it, you should do a lot more reading than is in these notes. Some additional online sources (as of January 27, 2017) for information on MI are (This is especially good) I. Advanced methods: Maximum Likelihood Estimation and Multiple Imputation. Allison concludes that, of the conventional methods listed in Part I, listwise deletion often works the best. However, he argues that, under certain conditions, Maximum Likelihood Methods and Multiple Imputation Methods can work better. As Newman (2003, p. 334) notes, MI [multiple imputation] is a procedure by which missing data are imputed several times (e.g. using regression imputation) to produce several different complete-data estimates of the parameters. The parameter estimates from each imputation are then combined to give an overall estimate of the complete-data parameters as well as reasonable estimates of the standard errors. Maximum Likelihood (ML) approaches operate by estimating a set of parameters that maximize the probability of getting the data that was observed (Newman, p. 332). Allison argues that, while Maximum Likelihood techniques may be superior when they are available, either the theory or the software needed to estimate them is often lacking. Therefore this handout will primarily focus on multiple imputation. However if you are primarily interested in linear regression models, you may prefer ML to MI. Appendix D briefly discusses ML. In a 2000 Sociological Methods and Research paper entitled Multiple Imputation for Missing Data: A Cautionary Tale Allison summarizes the basic rationale for multiple imputation: Multiple imputation (MI) appears to be one of the most attractive methods for general- purpose handling of missing data in multivariate analysis. The basic idea, first proposed by Rubin (1977) and elaborated in his (1987) book, is quite simple: 1. Impute missing values using an appropriate model that incorporates random variation. 2. Do this M times producing M complete data sets. 3. Perform the desired analysis on each data set using standard complete-data methods. 4. Average the values of the parameter estimates across the M samples to produce a single point estimate. 5. Calculate the standard errors by (a) averaging the squared standard errors of the M estimates (b) calculating the variance of the M parameter estimates across samples, and (c) combining the two quantities using a simple formula. Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 1

2 Allison adds that Multiple imputation has several desirable features: Introducing appropriate random error into the imputation process makes it possible to get approximately unbiased estimates of all parameters. No deterministic imputation method can do this in general settings. Repeated imputation allows one to get good estimates of the standard errors. Single imputation methods don t allow for the additional error introduced by imputation (without specialized software of very limited generality). With regards to the assumptions needed for MI, Allison says that First, the data must be missing at random (MAR), meaning that the probability of missing data on a particular variable Y can depend on other observed variables, but not on Y itself (controlling for the other observed variables). o Example: Data are MAR if the probability of missing income depends on marital status, but within each marital status, the probability of missing income does not depend on income; e.g. single people may be more likely to be missing data on income, but low income single people are no more likely to be missing income than are high income single people. Second, the model used to generate the imputed values must be correct in some sense. Third, the model used for the analysis must match up, in some sense, with the model used in the imputation. The problem is that it s easy to violate these conditions in practice. There are often strong reasons to suspect that the data are not MAR. Unfortunately, not much can be done about this. While it s possible to formulate and estimate models for data that are not MAR, such models are complex, untestable, and require specialized software. Hence, any general-purpose method will necessarily invoke the MAR assumption. We now show some of the ways Stata can handle multiple imputation problems. II. Using Stata 11 or higher for Multiple Imputation for One Variable This example is adapted from pages 1-14 of the Stata 12 Multiple Imputation Manual (which I highly recommend reading) and also quotes directly from the Stata 12 online help. If you have Stata 11 or higher the entire manual is available as a PDF file. This is a simple example and there are other commands and different ways to do multiple imputation, so you should do a lot more reading if you want to use MI yourself. NOTE: This example focuses on using regress to impute missing values for a single continuous variable. Appendix A shows other examples, such as logit and mlogit for categorical variables. It also shows how to use Predictive Mean Matching (PMM), a sometimes attractive alternative to regress for continuous variables with missing data. Appendix B shows how to do multiple imputation when more than one variable has missing data. Appendix C shows roughly how multiple imputation works its magic. Appendix D discusses Full Information Maximum Likelihood, which is a great alternative to MI in those situations where it works. Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 2

3 The file mheart0.dta is a fictional data set with 154 cases, 22 of which are missing data on bmi (Body Mass Index). The dependent variable for this example is attack, coded 0 if the subject did not have a heart attack and 1 if he or she did.. version * Imputation for a single continuous variable using regress. webuse mheart0, clear (Fictional heart attack data; bmi missing). sum Variable Obs Mean Std. Dev. Min Max attack smokes age bmi female hsgrad marstatus alcohol hightar mi set mlong [From the Stata 12 online help:] mi set is used to set a regular Stata dataset to be an mi dataset. An mi set dataset has the following attributes: The data are recorded in a style: wide, mlong, flong, or flongsep. Variables are registered as imputed, passive, or regular, or they are left unregistered. In addition to m=0, the data with missing values, the data include M>=0 imputations of the imputed variables. For this example, the Stata 12 Manual says we choose to use the data in the marginal long style (mlong) because it is a memory-efficient style. Type help mi styles for more details.. mi register imputed bmi (22 m=0 obs. now marked as incomplete). mi register regular attack smokes age hsgrad female An imputed variable is a variable that has missing values and for which you have or will have imputations. All variables whose missing values are to be filled in must be registered as imputed variables. A passive variable (not used in this example) is a variable that is a function of imputed variables (e.g. an interaction effect) or of other passive variables. A passive variable will have missing values in m=0 (the original data set) and varying values for observations in m>0 (the imputed data sets). A regular variable is a variable that is neither imputed nor passive and that has the same values, whether missing or not, in all m; registering regular variables is optional but recommended. In the above, we are telling Stata that the values of bmi will be imputed while the values of the other variables will not be. Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 3

4 . mi impute regress bmi attack smokes age hsgrad female, add(20) rseed(2232) Univariate imputation Imputations = 20 Linear regression added = 20 Imputed: m=1 through m=20 updated = 0 Observations per m Variable complete incomplete imputed total bmi (complete + incomplete = total; imputed is the minimum across m of the number of filled in observations.) The mi impute command fills in missing values (.) of a single variable or of multiple variables using the specified method. In this case, the use of regress means use a linear regression for a continuous variable; i.e. bmi is being regressed on attack smokes age hsgrad & female. The Stata 12 manual includes guidelines for choosing variables to include in the imputation model. One of the most common/important recommendations is that the analytic model and the imputation model should be congenial, i.e. the imputation model should include the same variables (including the dependent variable) that are in the analytic model; otherwise relationships with the variables that have been omitted will be biased toward 0. Other methods include logit, ologit and mlogit, e.g. you would use logit if you had a binary variable you wanted to impute values for. The add option specifies the number of imputations, in this case 20. (Stata recommends using at least 20 although it is not unusual to see as few as 5.) The rseed option sets the random number seed which makes results reproducible (different seeds will produce different imputed data sets). Case 8 is the first case with missing data on bmi, so let s see what happens to it after imputation:. list bmi attack smokes age hsgrad female _mi_id _mi_miss _mi_m if _mi_id == bmi attack smokes age hsgrad female _mi_id _mi_miss _mi_m Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 4

5 bmi is missing in the original unimputed data set (_mi_m = 0). For each of the 20 imputed data sets, a different value has been imputed for bmi. The imputation of multiple plausible values will let the estimation procedure take into account the fact that the true value is unknown and hence uncertain. The Stata 12 Manual recommends checking to see whether the imputations appear reasonable. In this case we do so by running the mi xeq command, which executes command(s) on individual imputations. Specifically, we run the summarize command on the original data set (m = 0) and on the (arbitrarily chosen) first and last imputed data sets. The means and standard deviations for bmi are all similar and seem reasonable in this case:. mi xeq : summarize bmi m=0 data: -> summarize bmi Variable Obs Mean Std. Dev. Min Max bmi m=1 data: -> summarize bmi Variable Obs Mean Std. Dev. Min Max bmi m=20 data: -> summarize bmi Variable Obs Mean Std. Dev. Min Max bmi The mi estimate command does estimation using multiple imputations. The desired analysis is done on each imputed data set and the results are then combined into a single multipleimputation result (the dots option just tells Stata to print a dot after each estimation; it helps you track progress and an X gets printed out if there is a problem doing one of the estimations): Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 5

6 . mi estimate, dots: logit attack smokes age bmi hsgrad female Imputations (20): done Multiple-imputation estimates Imputations = 20 Logistic regression Number of obs = 154 Average RVI = Largest FMI = DF adjustment: Large sample DF: min = avg = max = Model F test: Equal FMI F( 5, ) = 3.74 Within VCE type: OIM Prob > F = attack Coef. Std. Err. t P> t [95% Conf. Interval] smokes age bmi hsgrad female _cons Note that you don t always get the same information as you do with non-imputed data sets (e.g. Pseudo R 2 ), partly because these things don t always make sense with imputed data or because it is not clear how to compute them. Compare this to the results when we only analyze the original unimputed data:. mi xeq 0: logit attack smokes age bmi hsgrad female, nolog m=0 data: -> logit attack smokes age bmi hsgrad female, nolog Logistic regression Number of obs = 132 LR chi2(5) = Prob > chi2 = Log likelihood = Pseudo R2 = attack Coef. Std. Err. z P> z [95% Conf. Interval] smokes age bmi hsgrad female _cons The most striking difference is that the effect of age is statistically significant in the imputed data, whereas it wasn t in the original data set. Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 6

7 III. Estimating adjusted predictions and marginal effects after using multiple imputation. The margins command does not work after using mi estimate. Daniel Klein s user-written mimrgns (available from SSC) does. While mimrgns can be extremely helpful, it is also a use at your own risk sort of routine. Be sure to carefully read the help file first, which warns that There might be good reasons why margins does not work after mi estimate. The help further warns that, if you also use marginsplot, the DF and confidence intervals may be a little off (which may be a reason for not including the CIs when using marginsplot). In an to me Klein further warned that he was not sure whether predicted probabilities at fixed values qualify for pooling according to Rubin rules. Having said all that, mimrgns may be as good as it gets for now if you want to use both multiple imputation and adjusted predictions/ marginal effects. The help file includes links that explain the approach mimrgns uses. Here is an example (thanks to both Christopher Quiroz and Daniel Klein for helping come up with this). On the mimrgns command, note the use of predict(pr) to get predicted probabilities (otherwise you would get log odds); and the cmdmargins option, which is needed if you also want to use marginsplot.. * Use mimrgns -- but with caution. mi estimate, dots: logit attack i.smokes age bmi i.hsgrad i.female Imputations (20): done Multiple-imputation estimates Imputations = 20 Logistic regression Number of obs = 154 Average RVI = Largest FMI = DF adjustment: Large sample DF: min = avg = 115, max = 287, Model F test: Equal FMI F( 5, ) = 3.74 Within VCE type: OIM Prob > F = attack Coef. Std. Err. t P> t [95% Conf. Interval] 1.smokes age bmi hsgrad female _cons mimrgns smokes, at (age = (20 (10) 90)) predict(pr) cmdmargins vsquish Imputations (20): done Multiple-imputation estimates Imputations = 20 Predictive margins Number of obs = 154 Average RVI = Largest FMI = DF adjustment: Large sample DF: min = 126, avg = Within VCE type: Delta-method max = 2.11e+07 Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 7

8 Expression : Pr(attack), predict(pr) 1._at : age = 20 2._at : age = 30 3._at : age = 40 4._at : age = 50 5._at : age = 60 6._at : age = 70 7._at : age = 80 8._at : age = 90 Margin Std. Err. t P> t [95% Conf. Interval] _at#smokes marginsplot, noci scheme(sj) name(mimrgnsplot) Variables that uniquely identify margins: age smokes Predictive Margins of smokes Pr(Attack) Age, in years smokes=0 smokes=1 Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 8

9 IV. Already existing MI data sets. If you are lucky, somebody else may have already done the imputation for you (although it is possible that you might do even better since you know what variables are in your analytic models); and if you are super-lucky, the MI data will already be in Stata format. If not, you ll have to convert it to Stata yourself. The mi import command may be useful for this purpose. Once the data are in Stata format, the mi describe command can be used to provide a detailed report. Using the above data,. mi describe Style: mlong Obs.: complete 132 incomplete 22 (M = 20 imputations) total 154 Vars.: imputed: 1; bmi(22) passive: 0 regular: 5; attack smokes age hsgrad female system: 3; _mi_m _mi_id _mi_miss (there are 3 unregistered variables; marstatus alcohol hightar) V. Other comments on multiple imputation Imputation is pretty easy when only one variable has missing data. It can get more complicated in the more typical case when several variables have missing data. Again, this handout is just a brief introduction; read the manual and some related articles if you want to use multiple imputation in your own analyses. Random number generator. Stata s random number generator has changed across versions, so even if you do specify rseed you may not get identical results, e.g. some results I got using Stata 11 were not the same as results I got using Stata 12. Using version control should keep things consistent. For more, see help version and, possibly (for Stata 14+), help set rng. Soft versus hard missing data codes. Stata has soft missing codes (coded as.) and hard missing codes (.a,.b,.c,,.z). The former are eligible for imputation, the latter are not. This distinction can be useful when variables should not be imputed, e.g. Number of times pregnant is not applicable for men; either code it as zero or leave it as missing. Depending on the nature of the variable, you may need to change some soft codes to hard or hard codes to soft. Otherwise you may fail to impute values when you should or else impute values when you shouldn t. As stated before, you need to understand why data are missing. Multiple imputation on the dependent variable. Multiple imputation on the independent variables can be good because it lets you use the non-missing information on the other independent variables. Multiple imputation of the dependent variable, however, tends to gain you little or nothing. (One possible exception is when you have auxiliary variables that are strongly correlated with the dependent variable, e.g. r =.5 or greater, such as the same variable measured Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 9

10 at different points in time.) Of course, the dependent variable in one part of the analysis may be an independent variable in a different part, so you may go ahead and do the imputation on the variable anyway. Other programs for multiple imputation. User-written programs like ice and mim can also be used for imputation and estimation. I think Stata 12 largely eliminates the need for those programs. But even if you have Stata 12, the articles that have been written about these programs may be helpful to you in understanding how the ICE method works. Note: In a 2017 Statalist discussion, some people claimed that ice worked better in some situations, e.g. when mlogit was being used as the imputation method (but they also expressed concern that they weren t sure ice was giving correct results in these situations). If you are having trouble with mi impute, you may wish to look at Passive imputation versus just another variable (JAV) approach. Passive imputation is somewhat controversial. With passive imputation, you would, for example, impute values for x1 and x2, and then multiply those values together to create the interaction term x1x2. The alternative is to multiply x1 * x2 before imputation, and then impute values for the resulting x1x2 interaction term, i.e. the just another variable (JAV) approach. Perhaps surprisingly, some people (including Paul Allison) claim that the JAV approach is superior. The issue was discussed on Stata List in February If interested, see In the latter message, Paul Allison says In multiple imputation, interactions should be imputed as though they are additional variables, not constructed by multiplying imputed values. The same is true if you have x and x^2 in a model. The x^2 term should be imputed just like any other variable, not constructed by squaring the imputed values of x. While this principle may seem counterintuitive, it is easily demonstrated by simulation that the more natural" way to do it produces biased estimates. For more good discussion of JAV vs Passive Imputation, as well as several other issues, see White, Ian R., Royston, Patrick, Wood, Angela M Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine. Pp NOTE: There is at least one exception. Suppose you are trying to compute a scale that is the sum of several items. In an to me, Allison said It's better, when possible, to impute at the item level rather than the scale level. Otherwise you lose a lot of data. This is one case where JAV doesn't apply. Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 10

11 Appendix A: More Examples of Multiple Imputation for a Single Variable These examples (and much of the text) are pretty much copied straight from the Stata 12 or 13 Multiple Imputation Manual. Read the manual for more details. Further, multiple methods can be used if you specify mi impute chained (see Appendix B). Read the manual if you want to get into other methods or more complicated imputations. I will either go over these quickly or not at all in class. PMM Predictive Mean Matching. PMM is an alternative to regress when imputing values for continuous variables. It may be preferable to linear regression when the normality of the variable is suspect (which is likely the case with BMI). The basic idea is that you again use regression methods to come up with an estimate of the missing value for variable X. However, rather than use that estimate, you identify one or more neighbors who have similar estimated values. (Note that it is the estimated value for the neighbor, not the neighbor s observed value.) The observed value of the nearest neighbor (or the randomly chosen nearest neighbor) is then used for the imputed value for the case with missing data on X. So, for example, suppose that case 8 is missing on X, and the estimated value for X is Suppose the nearest neighbor has an estimated value of 18.73, with an observed value of 20. Twenty will be used as the imputed value of X for case 8. (If the nearest neighbor was a big outlier, e.g. estimated value of with observed value of 50, you would still use the observed value of 50 as the imputed value.) Or, if you have specified, say, 5 nearest neighbors, one of them will be chosen at random and their observed value on X will be used as the imputed value for case 8. In other words, the method identifies neighbors who have complete data that have estimated values on X that are close to the estimated value for the person with incomplete data. One of these neighbors is chosen as a donor, and the donor s observed value on the variable replaces the recipient s missing value. You have to choose how many neighbors are to be used. If you only choose 1, your MI estimates may be highly variable from one imputation to the next. Including too many neighbors may bias your point estimates. In other words there is a tradeoff between biased estimators and estimators that have larger standard errors. The Stata Manual seems to use 1, 3 or 5 neighbors in its examples. Here is an example from the manual. It uses the same data we used in our earlier example but uses PMM instead of regress to impute values for BMI.. webuse mheart0, clear (Fictional heart attack data; bmi missing). mi set mlong. mi register imputed bmi (22 m=0 obs. now marked as incomplete). mi impute pmm bmi attack smokes age hsgrad female, add(20) knn(5) rseed(2232) Univariate imputation Imputations = 20 Predictive mean matching added = 20 Imputed: m=1 through m=20 updated = 0 Nearest neighbors = Observations per m Variable Complete Incomplete Imputed Total bmi (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.) Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 11

12 As the Stata Manual explains, By default, mi impute pmm uses one nearest neighbor to draw from. That is, it replaces missing values with an observed value whose linear prediction is the closest to that of the missing value. Using only one nearest neighbor may result in high variability of the MI estimates. You can increase the number of nearest neighbors from which the imputed value is drawn by specifying the knn() option. In the example above I told Stata to select a donor from the 5 nearest neighbors. If you look at the imputed values, you may even be able to figure out who the donor was (e.g. if the imputed value for case 8 is 20 and case 47 is the only case with an observed value of 20, then case 47 must be the donor).. mi estimate: logit attack smokes age bmi hsgrad female Multiple-imputation estimates Imputations = 20 Logistic regression Number of obs = 154 Average RVI = Largest FMI = DF adjustment: Large sample DF: min = avg = max = Model F test: Equal FMI F( 5, ) = 3.63 Within VCE type: OIM Prob > F = attack Coef. Std. Err. t P> t [95% Conf. Interval] smokes age bmi hsgrad female _cons While PMM may be superior to regress in some cases, it barely matters here. Recall that this is what we got earlier when we used regress to impute the values of BMI: Multiple-imputation estimates Imputations = 20 Logistic regression Number of obs = 154 Average RVI = Largest FMI = DF adjustment: Large sample DF: min = avg = max = Model F test: Equal FMI F( 5, ) = 3.74 Within VCE type: OIM Prob > F = attack Coef. Std. Err. t P> t [95% Conf. Interval] smokes age bmi hsgrad female _cons Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 12

13 I suppose if you were really worried about whether pmm or regress was most appropriate, you could try both and see if it makes much difference. Logit. Logit imputation is used when the variable with missing data has only two possible values, 0 and 1. In this example, hsgrad (coded 1 if high school graduate, 0 otherwise) has the missing data.. webuse mheart2, clear (Fictional heart attack data; hsgrad missing). mi set mlong. * This will show us how much missing data, and the ranges of observed values. mi misstable summarize Obs< Unique Variable Obs=. Obs>. Obs<. values Min Max hsgrad mi register imputed hsgrad (18 m=0 obs. now marked as incomplete). mi impute logit hsgrad attack smokes age bmi female, add(10) rseed(2232) Univariate imputation Imputations = 10 Logistic regression added = 10 Imputed: m=1 through m=10 updated = Observations per m Variable Complete Incomplete Imputed Total hsgrad (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.). * Estimates before imputation. mi xeq 0: logit attack smokes age bmi female hsgrad, nolog m=0 data: -> logit attack smokes age bmi female hsgrad Logistic regression Number of obs = 136 LR chi2(5) = Prob > chi2 = Log likelihood = Pseudo R2 = attack Coef. Std. Err. z P> z [95% Conf. Interval] smokes age bmi female hsgrad _cons Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 13

14 . * Estimates after imputation. mi estimate: logit attack smokes age bmi female hsgrad Multiple-imputation estimates Imputations = 10 Logistic regression Number of obs = 154 Average RVI = Largest FMI = DF adjustment: Large sample DF: min = avg = 7.02e+07 max = 2.75e+08 Model F test: Equal FMI F( 5, ) = 3.85 Within VCE type: OIM Prob > F = attack Coef. Std. Err. t P> t [95% Conf. Interval] smokes age bmi female hsgrad _cons mlogit. Multinomial logit can be used when a variable is nominal and has more than 2 categories. Marital Status (1 = single, 2 = married, 3 = divorced) is the missing data victim this time.. webuse mheart3, clear (Fictional heart attack data; marstatus missing). mi set mlong. mi misstable summarize Obs< Unique Variable Obs=. Obs>. Obs<. values Min Max marstatus mi register imputed marstatus (7 m=0 obs. now marked as incomplete). mi impute mlogit marstatus attack smokes age bmi female hsgrad, add(20) rseed(2232) Univariate imputation Imputations = 20 Multinomial logistic regression added = 20 Imputed: m=1 through m=20 updated = Observations per m Variable Complete Incomplete Imputed Total marstatus (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.) Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 14

15 . * Estimates before imputation. mi xeq 0: logit attack smokes age bmi female hsgrad i.marstatus m=0 data: -> logit attack smokes age bmi female hsgrad i.marstatus Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Logistic regression Number of obs = 147 LR chi2(7) = Prob > chi2 = Log likelihood = Pseudo R2 = attack Coef. Std. Err. z P> z [95% Conf. Interval] smokes age bmi female hsgrad marstatus _cons * Estimates after imputation. mi estimate: logit attack smokes age bmi female hsgrad i.marstatus Multiple-imputation estimates Imputations = 20 Logistic regression Number of obs = 154 Average RVI = Largest FMI = DF adjustment: Large sample DF: min = avg = max = Model F test: Equal FMI F( 7, )= 3.14 Within VCE type: OIM Prob > F = attack Coef. Std. Err. t P> t [95% Conf. Interval] smokes age bmi female hsgrad marstatus _cons Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 15

16 ologit. The ordered logistic regression imputation method can be used to fill in missing values of an ordinal variable (e.g. the variable is coded high, medium low; Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree; Poor, Fair, Good, Excellent).. use clear (Fictional heart attack data; alcohol missing). tabulate alcohol, missing Alcohol consumption: none, <2 drinks/day, >=2 drinks/day Freq. Percent Cum Do not drink Less than 3 drinks/day Three or more drinks/day Total mi set mlong. mi register imputed alcohol (9 m=0 obs. now marked as incomplete). mi impute ologit alcohol attack smokes age bmi female hsgrad, /// > add(10) rseed(2232) Univariate imputation Imputations = 10 Ordered logistic regression added = 10 Imputed: m=1 through m=10 updated = Observations per m Variable Complete Incomplete Imputed Total alcohol (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.). mi estimate: logit attack smokes age bmi female hsgrad i.alcohol Multiple-imputation estimates Imputations = 10 Logistic regression Number of obs = 154 Average RVI = Largest FMI = DF adjustment: Large sample DF: min = avg = 3.73e+07 max = 1.35e+08 Model F test: Equal FMI F( 7, )= 2.79 Within VCE type: OIM Prob > F = attack Coef. Std. Err. t P> t [95% Conf. Interval] smokes age bmi female hsgrad alcohol Less than 3 drinks/day Three or more drinks/day _cons Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 16

17 Poisson & Nbreg Count variables. mi impute poisson fills in missing values of a count variable using a Poisson regression imputation method. mi impute nbreg fills in missing values of an overdispersed count variable using a negative binomial regression imputation method (and will usually be better than the Poisson method). This won t mean a lot to you unless/until you have some background in categorical data analysis. For now, we will just briefly note the following: Variables that count the # of times something happens are common in the Social Sciences. Hausman looked at effect of R & D expenditures on # of patents received by US companies Grogger examined deterrent effects of capital punishment on daily homicides King examined effect of # of alliances on the # of nations at war Long looked at # of publications of scientists Count variables are often treated as though they are continuous and the linear regression model is applied; but this can result in inefficient, inconsistent and biased estimates. Fortunately, there are many models that deal explicitly with count outcomes. These include the Poisson and the (usually superior) Negative Binomial Regression method. These examples illustrate another feature you can use when imputing: conditional imputation. In this example, most men are coded zero for number of pregnancies they have had. But 7 men, and 3 women, have missing values on the pregnancy variable. Imputing values for men would be a bit silly, as we can be cautiously optimistic that the true value for men on # of pregnancies is zero. With the conditional imputation procedure used below, the 7 men with missing values get assigned zero while the value for # of pregnancies is imputed for the 3 women with missing values.. *Poisson. use clear (Fictional heart attack data; npreg missing). misstable summarize Obs< Unique Variable Obs=. Obs>. Obs<. values Min Max npreg tab2 female npreg, missing -> tabulation of female by npreg Number of pregnancies Gender Total Male Female Total mi set mlong. mi register imputed npreg (10 m=0 obs. now marked as incomplete) Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 17

18 . mi impute poisson npreg attack smokes age bmi hsgrad, /// > add(20) conditional(if female==1) rseed(2232) Univariate imputation Imputations = 20 Poisson regression added = 20 Imputed: m=1 through m=20 updated = 0 Conditional imputation: npreg: incomplete out-of-sample obs. replaced with value Observations per m Variable Complete Incomplete Imputed Total npreg (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.). mi estimate: logit attack smokes age bmi female hsgrad npreg Multiple-imputation estimates Imputations = 20 Logistic regression Number of obs = 154 Average RVI = Largest FMI = DF adjustment: Large sample DF: min = avg = 6.31e+09 max = 2.15e+10 Model F test: Equal FMI F( 6, )= 3.20 Within VCE type: OIM Prob > F = attack Coef. Std. Err. t P> t [95% Conf. Interval] smokes age bmi female hsgrad npreg _cons *nbreg. use clear (Fictional heart attack data; npreg missing). mi set mlong. mi register imputed npreg (10 m=0 obs. now marked as incomplete). mi impute nbreg npreg attack smokes age bmi hsgrad, /// > add(20) conditional(if female==1) rseed(2232) Univariate imputation Imputations = 20 Negative binomial regression added = 20 Imputed: m=1 through m=20 updated = 0 Dispersion: mean Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 18

19 Conditional imputation: npreg: incomplete out-of-sample obs. replaced with value Observations per m Variable Complete Incomplete Imputed Total npreg (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.). mi estimate: logit attack smokes age bmi female hsgrad npreg Multiple-imputation estimates Imputations = 20 Logistic regression Number of obs = 154 Average RVI = Largest FMI = DF adjustment: Large sample DF: min = avg = 2.26e+10 max = 9.44e+10 Model F test: Equal FMI F( 6, )= 3.23 Within VCE type: OIM Prob > F = attack Coef. Std. Err. t P> t [95% Conf. Interval] smokes age bmi female hsgrad npreg _cons Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 19

20 Appendix B: Using Stata 12+ for Multiple Imputation for Multiple Variables Stata 12 introduced several new procedures and commands for multiple imputation. Among these is the mi impute chained command, which supports multivariate Imputation using Chained Equations (ICE). ICE uses iterative procedures to impute missing values when more than one variable is missing. These variables can be of different types, e.g. they might be binary, ordinal or continuous. Variables can have an arbitrary missing-data pattern. mi impute chained has numerous options, and Stata warns that you should do checks to make sure the imputation is working correctly. I am just going to give a simple example adapted from the Stata Manual; you should read the whole manual and/or related literature if you want to do a more detailed analysis of your own. NOTE: Other commands for imputing multiple variables include mi impute monotone and mi impute mvn. While these can be good (or even better) than mi impute chained, the assumptions required to use these commands are often violated. mi impute mvn may be good if all your imputed variables happen to be continuous, e.g. you don t need to impute any dichotomies, but in practice you often will have mixed types of variables to impute. First, we retrieve another version of the fictitious heart attack data, in which some data are missing for bmi and age.. webuse mheart8s0, clear (Fictional heart attack data; bmi and age missing; arbitrary pattern). mi describe Style: mlong last mi update 25mar :00:38, 122 days ago Obs.: complete 118 incomplete 36 (M = 0 imputations) total 154 Vars.: imputed: 2; bmi(28) age(12) passive: 0 regular: 4; attack smokes female hsgrad system: 3; _mi_m _mi_id _mi_miss (there are no unregistered variables) The above shows that the data have previously been mi set in mlong format. bmi and age have previously been specified as variables whose missing values are to be imputed. bmi has 28 missing cases, age has 12. M = 0 means that no imputed data sets have been computed yet, i.e. you just have the original data. Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 20

21 . mi misstable patterns, frequency Missing-value patterns (1 means complete) Pattern Frequency Variables are (1) age (2) bmi In the above table, a value of 1 indicates not missing, 0 indicates missing. So, we see that there are 118 cases with non-missing values on both age and bmi. Another 24 cases are missing bmi but not age, 8 cases are missing age but not bmi, and 4 cases have missing data on both age and bmi. Next we impute missing values using the mi impute chained command.. mi impute chained (regress) bmi age = attack smokes hsgrad female, add(20) rseed(2232) Conditional models: age: regress age bmi attack smokes hsgrad female bmi: regress bmi age attack smokes hsgrad female Performing chained iterations... Multivariate imputation Imputations = 20 Chained equations added = 20 Imputed: m=1 through m=20 updated = 0 Initialization: monotone Iterations = 200 burn-in = 10 bmi: linear regression age: linear regression Observations per m Variable Complete Incomplete Imputed Total bmi age (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.) The (regress) option on the command told Stata that both bmi and age were continuous and that OLS regression should be used for imputation. If, instead, the two variables were dichotomies, we would have specified (logit) instead. (We could have also mixed different types on the same command, we could have used (logit), (regress), and (ologit) for different variables if that was appropriate, see the help for mi impute chained for more complicated examples where Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 21

22 different methods are mixed.) Like before, the add option told Stata to create 20 imputed data sets and the rseed option was used so we can exactly reproduce our results later. The conditional models show us that age was regressed on every variable (both from the left and right hand side) except itself. The same is true for bmi. This is the default behavior, i.e. all variables except the one being imputed are included in the prediction equation. This will work well in many situations but there are numerous options for changing this behavior if you need more flexibility. Having done the imputation, we can proceed as before. To get the unimputed results,. mi xeq 0: logit attack smokes age bmi hsgrad female, nolog m=0 data: -> logit attack smokes age bmi hsgrad female, nolog Logistic regression Number of obs = 118 LR chi2(5) = Prob > chi2 = Log likelihood = Pseudo R2 = attack Coef. Std. Err. z P> z [95% Conf. Interval] smokes age bmi hsgrad female _cons The analysis is limited to the 118 cases that had complete data, i.e. we have lost almost a third of the sample (36 cases) because of missing data. With multiple imputation, the results are. mi estimate: logit attack smokes age bmi hsgrad female Multiple-imputation estimates Imputations = 20 Logistic regression Number of obs = 154 Average RVI = Largest FMI = DF adjustment: Large sample DF: min = avg = max = Model F test: Equal FMI F( 5, ) = 3.46 Within VCE type: OIM Prob > F = attack Coef. Std. Err. t P> t [95% Conf. Interval] smokes age bmi hsgrad female _cons In this particular example, the coefficients and standard errors for the two imputed variables, age and bmi, change little. The other independent variables show modest changes. Missing Data Part 2: Multiple Imputation & Maximum Likelihood Page 22

Missing Data Part II: Multiple Imputation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 24, 2015

Missing Data Part II: Multiple Imputation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 24, 2015 Missing Data Part II: Multiple Imputation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 24, 2015 Warning: I teach about Multiple Imputation with some trepidation.

More information

Multiple-imputation analysis using Stata s mi command

Multiple-imputation analysis using Stata s mi command Multiple-imputation analysis using Stata s mi command Yulia Marchenko Senior Statistician StataCorp LP 2009 UK Stata Users Group Meeting Yulia Marchenko (StataCorp) Multiple-imputation analysis using mi

More information

Panel Data 4: Fixed Effects vs Random Effects Models

Panel Data 4: Fixed Effects vs Random Effects Models Panel Data 4: Fixed Effects vs Random Effects Models Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised April 4, 2017 These notes borrow very heavily, sometimes verbatim,

More information

Multiple imputation using chained equations: Issues and guidance for practice

Multiple imputation using chained equations: Issues and guidance for practice Multiple imputation using chained equations: Issues and guidance for practice Ian R. White, Patrick Royston and Angela M. Wood http://onlinelibrary.wiley.com/doi/10.1002/sim.4067/full By Gabrielle Simoneau

More information

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms

More information

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients

More information

Compute MI estimates of coefficients by fitting estimation command to mi data

Compute MI estimates of coefficients by fitting estimation command to mi data Title mi estimate Estimation using multiple imputations Syntax Compute MI estimates of coefficients by fitting estimation command to mi data mi estimate [, options ] : estimation command... Compute MI

More information

Missing Data Part 1: Overview, Traditional Methods Page 1

Missing Data Part 1: Overview, Traditional Methods Page 1 Missing Data Part 1: Overview, Traditional Methods Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 17, 2015 This discussion borrows heavily from: Applied

More information

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa Ronald H. Heck 1 In this handout, we will address a number of issues regarding missing data. It is often the case that the weakest point of a study is the quality of the data that can be brought to bear

More information

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to

More information

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

More information

Preparing for Data Analysis

Preparing for Data Analysis Preparing for Data Analysis Prof. Andrew Stokes March 27, 2018 Managing your data Entering the data into a database Reading the data into a statistical computing package Checking the data for errors and

More information

Dr. Barbara Morgan Quantitative Methods

Dr. Barbara Morgan Quantitative Methods Dr. Barbara Morgan Quantitative Methods 195.650 Basic Stata This is a brief guide to using the most basic operations in Stata. Stata also has an on-line tutorial. At the initial prompt type tutorial. In

More information

Missing Data Techniques

Missing Data Techniques Missing Data Techniques Paul Philippe Pare Department of Sociology, UWO Centre for Population, Aging, and Health, UWO London Criminometrics (www.crimino.biz) 1 Introduction Missing data is a common problem

More information

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple

More information

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU. SOS3003 Applied data analysis for social science Lecture note 04-2009 Erling Berge Department of sociology and political science NTNU Erling Berge 2009 1 Missing data Literature Allison, Paul D 2002 Missing

More information

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated

More information

Creating a data file and entering data

Creating a data file and entering data 4 Creating a data file and entering data There are a number of stages in the process of setting up a data file and analysing the data. The flow chart shown on the next page outlines the main steps that

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information

Missing Data Missing Data Methods in ML Multiple Imputation

Missing Data Missing Data Methods in ML Multiple Imputation Missing Data Missing Data Methods in ML Multiple Imputation PRE 905: Multivariate Analysis Lecture 11: April 22, 2014 PRE 905: Lecture 11 Missing Data Methods Today s Lecture The basics of missing data:

More information

Week 4: Simple Linear Regression III

Week 4: Simple Linear Regression III Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of

More information

Cross-validation. Cross-validation is a resampling method.

Cross-validation. Cross-validation is a resampling method. Cross-validation Cross-validation is a resampling method. It refits a model of interest to samples formed from the training set, in order to obtain additional information about the fitted model. For example,

More information

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Age standardization Incidence and prevalence are strongly agedependent Risks rising

More information

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education) DUMMY VARIABLES AND INTERACTIONS Let's start with an example in which we are interested in discrimination in income. We have a dataset that includes information for about 16 people on their income, their

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Bivariate (Simple) Regression Analysis

Bivariate (Simple) Regression Analysis Revised July 2018 Bivariate (Simple) Regression Analysis This set of notes shows how to use Stata to estimate a simple (two-variable) regression equation. It assumes that you have set Stata up on your

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 9. oktober 2012 Multivariate tables Agenda today Age standardization Missing data 1 2 3 4 Age standardization

More information

An Algorithm for Creating Models for Imputation Using the MICE Approach:

An Algorithm for Creating Models for Imputation Using the MICE Approach: An Algorithm for Creating Models for Imputation Using the MICE Approach: An application in Stata Rose Anne rosem@ats.ucla.edu Statistical Consulting Group Academic Technology Services University of California,

More information

Week 9: Modeling II. Marcelo Coca Perraillon. Health Services Research Methods I HSMP University of Colorado Anschutz Medical Campus

Week 9: Modeling II. Marcelo Coca Perraillon. Health Services Research Methods I HSMP University of Colorado Anschutz Medical Campus Week 9: Modeling II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Taking the log Retransformation

More information

Preparing for Data Analysis

Preparing for Data Analysis Preparing for Data Analysis Prof. Andrew Stokes March 21, 2017 Managing your data Entering the data into a database Reading the data into a statistical computing package Checking the data for errors and

More information

(Refer Slide Time 3:31)

(Refer Slide Time 3:31) Digital Circuits and Systems Prof. S. Srinivasan Department of Electrical Engineering Indian Institute of Technology Madras Lecture - 5 Logic Simplification In the last lecture we talked about logic functions

More information

Modelling Proportions and Count Data

Modelling Proportions and Count Data Modelling Proportions and Count Data Rick White May 4, 2016 Outline Analysis of Count Data Binary Data Analysis Categorical Data Analysis Generalized Linear Models Questions Types of Data Continuous data:

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

Chapters 5-6: Statistical Inference Methods

Chapters 5-6: Statistical Inference Methods Chapters 5-6: Statistical Inference Methods Chapter 5: Estimation (of population parameters) Ex. Based on GSS data, we re 95% confident that the population mean of the variable LONELY (no. of days in past

More information

Modelling Proportions and Count Data

Modelling Proportions and Count Data Modelling Proportions and Count Data Rick White May 5, 2015 Outline Analysis of Count Data Binary Data Analysis Categorical Data Analysis Generalized Linear Models Questions Types of Data Continuous data:

More information

Notes on Simulations in SAS Studio

Notes on Simulations in SAS Studio Notes on Simulations in SAS Studio If you are not careful about simulations in SAS Studio, you can run into problems. In particular, SAS Studio has a limited amount of memory that you can use to write

More information

Chapter 3 Analyzing Normal Quantitative Data

Chapter 3 Analyzing Normal Quantitative Data Chapter 3 Analyzing Normal Quantitative Data Introduction: In chapters 1 and 2, we focused on analyzing categorical data and exploring relationships between categorical data sets. We will now be doing

More information

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Exercise: Graphing and Least Squares Fitting in Quattro Pro Chapter 5 Exercise: Graphing and Least Squares Fitting in Quattro Pro 5.1 Purpose The purpose of this experiment is to become familiar with using Quattro Pro to produce graphs and analyze graphical data.

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,

More information

May 24, Emil Coman 1 Yinghui Duan 2 Daren Anderson 3

May 24, Emil Coman 1 Yinghui Duan 2 Daren Anderson 3 Assessing Health Disparities in Intensive Longitudinal Data: Gender Differences in Granger Causality Between Primary Care Provider and Emergency Room Usage, Assessed with Medicaid Insurance Claims May

More information

Programming and Post-Estimation

Programming and Post-Estimation Programming and Post-Estimation Bootstrapping Monte Carlo Post-Estimation Simulation (Clarify) Extending Clarify to Other Models Censored Probit Example What is Bootstrapping? A computer-simulated nonparametric

More information

(Refer Slide Time: 06:01)

(Refer Slide Time: 06:01) Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 28 Applications of DFS Today we are going to be talking about

More information

Bootstrap and multiple imputation under missing data in AR(1) models

Bootstrap and multiple imputation under missing data in AR(1) models EUROPEAN ACADEMIC RESEARCH Vol. VI, Issue 7/ October 2018 ISSN 2286-4822 www.euacademic.org Impact Factor: 3.4546 (UIF) DRJI Value: 5.9 (B+) Bootstrap and multiple imputation under missing ELJONA MILO

More information

TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT

TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT PRIMER FOR ACS OUTCOMES RESEARCH COURSE: TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT STEP 1: Install STATA statistical software. STEP 2: Read through this primer and complete the

More information

Title. Description. Quick start. stata.com. misstable Tabulate missing values

Title. Description. Quick start. stata.com. misstable Tabulate missing values Title stata.com misstable Tabulate missing values Description Quick start Menu Syntax Options Remarks and examples Stored results Also see Description misstable makes tables that help you understand the

More information

An introduction to plotting data

An introduction to plotting data An introduction to plotting data Eric D. Black California Institute of Technology February 25, 2014 1 Introduction Plotting data is one of the essential skills every scientist must have. We use it on a

More information

Generalized least squares (GLS) estimates of the level-2 coefficients,

Generalized least squares (GLS) estimates of the level-2 coefficients, Contents 1 Conceptual and Statistical Background for Two-Level Models...7 1.1 The general two-level model... 7 1.1.1 Level-1 model... 8 1.1.2 Level-2 model... 8 1.2 Parameter estimation... 9 1.3 Empirical

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

8. MINITAB COMMANDS WEEK-BY-WEEK

8. MINITAB COMMANDS WEEK-BY-WEEK 8. MINITAB COMMANDS WEEK-BY-WEEK In this section of the Study Guide, we give brief information about the Minitab commands that are needed to apply the statistical methods in each week s study. They are

More information

Week 10: Heteroskedasticity II

Week 10: Heteroskedasticity II Week 10: Heteroskedasticity II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Dealing with heteroskedasticy

More information

Week 5: Multiple Linear Regression II

Week 5: Multiple Linear Regression II Week 5: Multiple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Adjusted R

More information

Handling Your Data in SPSS. Columns, and Labels, and Values... Oh My! The Structure of SPSS. You should think about SPSS as having three major parts.

Handling Your Data in SPSS. Columns, and Labels, and Values... Oh My! The Structure of SPSS. You should think about SPSS as having three major parts. Handling Your Data in SPSS Columns, and Labels, and Values... Oh My! You might think that simple intuition will guide you to a useful organization of your data. If you follow that path, you might find

More information

Estimation of Item Response Models

Estimation of Item Response Models Estimation of Item Response Models Lecture #5 ICPSR Item Response Theory Workshop Lecture #5: 1of 39 The Big Picture of Estimation ESTIMATOR = Maximum Likelihood; Mplus Any questions? answers Lecture #5:

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent

More information

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM * Which directories are used for input files and output files? See menu-item "Options" and page 22 in the manual.

More information

Data corruption, correction and imputation methods.

Data corruption, correction and imputation methods. Data corruption, correction and imputation methods. Yerevan 8.2 12.2 2016 Enrico Tucci Istat Outline Data collection methods Duplicated records Data corruption Data correction and imputation Data validation

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

4. Descriptive Statistics: Measures of Variability and Central Tendency

4. Descriptive Statistics: Measures of Variability and Central Tendency 4. Descriptive Statistics: Measures of Variability and Central Tendency Objectives Calculate descriptive for continuous and categorical data Edit output tables Although measures of central tendency and

More information

Intro. Scheme Basics. scm> 5 5. scm>

Intro. Scheme Basics. scm> 5 5. scm> Intro Let s take some time to talk about LISP. It stands for LISt Processing a way of coding using only lists! It sounds pretty radical, and it is. There are lots of cool things to know about LISP; if

More information

Disclaimer. Lect 2: empirical analyses of graphs

Disclaimer. Lect 2: empirical analyses of graphs 462 Page 1 Lect 2: empirical analyses of graphs Tuesday, September 11, 2007 8:30 AM Disclaimer These are my personal notes from this lecture. They may be wrong or inaccurate, and have not carefully been

More information

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

The problem we have now is called variable selection or perhaps model selection. There are several objectives. STAT-UB.0103 NOTES for Wednesday 01.APR.04 One of the clues on the library data comes through the VIF values. These VIFs tell you to what extent a predictor is linearly dependent on other predictors. We

More information

Section 4 General Factorial Tutorials

Section 4 General Factorial Tutorials Section 4 General Factorial Tutorials General Factorial Part One: Categorical Introduction Design-Ease software version 6 offers a General Factorial option on the Factorial tab. If you completed the One

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

Four steps in an effective workflow...

Four steps in an effective workflow... Four steps in an effective workflow... 1. Cleaning data Things to do: Verify your data are accurate Variables should be well named Variables should be properly labeled Ask yourself: Do the variables have

More information

Compute MI estimates of coefficients using previously saved estimation results

Compute MI estimates of coefficients using previously saved estimation results Title mi estimate using Estimation using previously saved estimation results Syntax Compute MI estimates of coefficients using previously saved estimation results mi estimate using miestfile [, options

More information

Data analysis using Microsoft Excel

Data analysis using Microsoft Excel Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data

More information

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional. independence assumption and auxiliary information Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional

More information

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed. Preface Here are my online notes for my Algebra course that I teach here at Lamar University, although I have to admit that it s been years since I last taught this course. At this point in my career I

More information

Analysis of Complex Survey Data with SAS

Analysis of Complex Survey Data with SAS ABSTRACT Analysis of Complex Survey Data with SAS Christine R. Wells, Ph.D., UCLA, Los Angeles, CA The differences between data collected via a complex sampling design and data collected via other methods

More information

- 1 - Fig. A5.1 Missing value analysis dialog box

- 1 - Fig. A5.1 Missing value analysis dialog box WEB APPENDIX Sarstedt, M. & Mooi, E. (2019). A concise guide to market research. The process, data, and methods using SPSS (3 rd ed.). Heidelberg: Springer. Missing Value Analysis and Multiple Imputation

More information

(Refer Slide Time 6:48)

(Refer Slide Time 6:48) Digital Circuits and Systems Prof. S. Srinivasan Department of Electrical Engineering Indian Institute of Technology Madras Lecture - 8 Karnaugh Map Minimization using Maxterms We have been taking about

More information

Missing Data. Where did it go?

Missing Data. Where did it go? Missing Data Where did it go? 1 Learning Objectives High-level discussion of some techniques Identify type of missingness Single vs Multiple Imputation My favourite technique 2 Problem Uh data are missing

More information

Statistics Case Study 2000 M. J. Clancy and M. C. Linn

Statistics Case Study 2000 M. J. Clancy and M. C. Linn Statistics Case Study 2000 M. J. Clancy and M. C. Linn Problem Write and test functions to compute the following statistics for a nonempty list of numeric values: The mean, or average value, is computed

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

An Introduction to Growth Curve Analysis using Structural Equation Modeling

An Introduction to Growth Curve Analysis using Structural Equation Modeling An Introduction to Growth Curve Analysis using Structural Equation Modeling James Jaccard New York University 1 Overview Will introduce the basics of growth curve analysis (GCA) and the fundamental questions

More information

Missing Data Analysis with SPSS

Missing Data Analysis with SPSS Missing Data Analysis with SPSS Meng-Ting Lo (lo.194@osu.edu) Department of Educational Studies Quantitative Research, Evaluation and Measurement Program (QREM) Research Methodology Center (RMC) Outline

More information

Math Lab 6: Powerful Fun with Power Series Representations of Functions Due noon Thu. Jan. 11 in class *note new due time, location for winter quarter

Math Lab 6: Powerful Fun with Power Series Representations of Functions Due noon Thu. Jan. 11 in class *note new due time, location for winter quarter Matter & Motion Winter 2017 18 Name: Math Lab 6: Powerful Fun with Power Series Representations of Functions Due noon Thu. Jan. 11 in class *note new due time, location for winter quarter Goals: 1. Practice

More information

Introduction to Programming

Introduction to Programming CHAPTER 1 Introduction to Programming Begin at the beginning, and go on till you come to the end: then stop. This method of telling a story is as good today as it was when the King of Hearts prescribed

More information

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Contents Overview 2 Generating random numbers 2 rnorm() to generate random numbers from

More information

MISSING DATA AND MULTIPLE IMPUTATION

MISSING DATA AND MULTIPLE IMPUTATION Paper 21-2010 An Introduction to Multiple Imputation of Complex Sample Data using SAS v9.2 Patricia A. Berglund, Institute For Social Research-University of Michigan, Ann Arbor, Michigan ABSTRACT This

More information

Multiple imputation of missing values: Further update of ice, with an emphasis on categorical variables

Multiple imputation of missing values: Further update of ice, with an emphasis on categorical variables The Stata Journal (2009) 9, Number 3, pp. 466 477 Multiple imputation of missing values: Further update of ice, with an emphasis on categorical variables Patrick Royston Hub for Trials Methodology Research

More information

Health Disparities (HD): It s just about comparing two groups

Health Disparities (HD): It s just about comparing two groups A review of modern methods of estimating the size of health disparities May 24, 2017 Emil Coman 1 Helen Wu 2 1 UConn Health Disparities Institute, 2 UConn Health Modern Modeling conference, May 22-24,

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. 1/44 Cross-validation and the Bootstrap In the section we discuss two resampling

More information

Fall 09, Homework 5

Fall 09, Homework 5 5-38 Fall 09, Homework 5 Due: Wednesday, November 8th, beginning of the class You can work in a group of up to two people. This group does not need to be the same group as for the other homeworks. You

More information

How to use FSBforecast Excel add in for regression analysis

How to use FSBforecast Excel add in for regression analysis How to use FSBforecast Excel add in for regression analysis FSBforecast is an Excel add in for data analysis and regression that was developed here at the Fuqua School of Business over the last 3 years

More information

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Week 11: Interpretation plus

Week 11: Interpretation plus Week 11: Interpretation plus Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline A bit of a patchwork

More information

A Short Introduction to STATA

A Short Introduction to STATA A Short Introduction to STATA 1) Introduction: This session serves to link everyone from theoretical equations to tangible results under the amazing promise of Stata! Stata is a statistical package that

More information

An Introductory Guide to Stata

An Introductory Guide to Stata An Introductory Guide to Stata Scott L. Minkoff Assistant Professor Department of Political Science Barnard College sminkoff@barnard.edu Updated: July 9, 2012 1 TABLE OF CONTENTS ABOUT THIS GUIDE... 4

More information

Missing data analysis. University College London, 2015

Missing data analysis. University College London, 2015 Missing data analysis University College London, 2015 Contents 1. Introduction 2. Missing-data mechanisms 3. Missing-data methods that discard data 4. Simple approaches that retain all the data 5. RIBG

More information

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION

More information