AMELIA II: A Program for Missing Data

AMELIA II: A Program for Missing Data Amelia II is an R package that performs multiple imputation to deal with missing data, instead of other methods, such as pairwise and listwise deletion. In multiple imputation, values are imputed for each missing cell in your data set and completed data sets are created. In these completed data sets, the observed values stay the same, but the missing values are filled in with imputations based on a bootstrapped EMB algorithm. After imputation, you conduct your statistical analyses with the completed data sets and then combine the results of the imputed data sets. For this example, the world95.sav from the PASW 17.0 sample data sets is being used. There are 11 variables, nine are continuous variables and two are categorical values. As you can see below, data is missing, signified by the blank cells. You can download the Amelia II package from http://gking.harvard.edu/amelia/ (please make sure you have downloaded the R program first [http://cran.r project.org/]). You can also use the standalone AmeliaView package by the following commands in R: > library(amelia) > AmeliaView() Page 1 of 8

Step 1. Use Amelia II to impute data for missing values. I. Once the package is downloaded, AmeliaView will open in a new window. II. III. Select Import CSV. Locate your file that is saved as a comma separated value (.csv) file and click open. The data is now loaded into the program. This view provides you descriptive statistics (Min, Max, Mean, SD) of your variables, as well as how many data points are missing per variable (Missing). For example, the urban variable is missing one value out of 109 values. Transformation. Use this option to classify variable measurement type and to transform variables (e.g., logistic or square root), if necessary. Lag. Use this option for time series data; lags are variables that take the value of another variable in the previous time period. Lead. Use this option for time series data; leads take the value of another variable in the next time period. Bounds. Use this option to place restrictions on the range of the imputed values. Page 2 of 8

IV. Transformation. The Amelia package recognized the country variable as an ID variable and classified it as such. To transform the two categorical variables (region and climate), right click on the row of the variable. Select Nominal. Page 3 of 8

Repeat the same steps for the climate variable. V. Bounds. Since four of the continuous variables (urban, literacy, lit_male, and lit_female) are percentages, bounds need to be added to restrict the imputed values range from 0 to 100. Also, the climate variable needs to be restricted for the available values of 1 to 9 (the region variable does not have missing data, so no bound added to the variable). Right click on the urban variable row, and select Add or Edit Bounds. The Add or Edit Bounds box appears for you to enter the minimum and maximum values; type 0 for minimum and 100 for maximum. Select OK. Repeat the same steps for the remaining variables that need bounds added. VI. Select Options from the top menu. Select Output File Options. The Output Options box appears; by default the name of the imputed datasets have imp at the end of the file name and 5 imputed datasets are selected. Select OK. Page 4 of 8

VII. Select Impute! Imputation is complete when Successful Imputation. appears at the bottom right of the screen. VIII. Select Output Log. The output log gives you the chain length of each imputation. For example, Imputation 4 s chain length was 133. IX. The 5 imputed datasets are saved in the same location as the original file. Page 5 of 8

X. Below is an example of one complete dataset from the imputed files. Page 6 of 8

Step 2. Pool parameter estimates and standard errors. I. Run statistical analyses (e.g., multiple regression, canonical correlation, etc.) on the 5 imputed datasets. II. Compute the mean of the parameter estimates of the 5 imputed datasets. For example, a multiple regression was conducted use the 5 imputed datasets and there are 5 beta estimates for the literacy variable. The mean of the 5 beta estimates is 3.41. Imputation b SE Variance (SE 2 ) 1 3.23 0.15 0.022 2 3.43 0.15 0.023 3 3.35 0.11 0.011 4 3.55 0.15 0.023 5 3.47 0.15 0.022 3.41 III. To pool the standard error estimates, you need to compute the within imputation variance and the between imputation variance. a. The within imputation variance is the average of the squared standard errors across the m analyses, U = 1 m m U ˆ i i 1 where Uˆi is the variance estimate from the i th imputed data set, and m is the number of imputations. First, sum the variance estimates (0.101), then multiply by one fifth. The withinimputation variance is 0.020. b. Between imputation variance is the variability of the m parameter estimates around the mean estimate. m 1 2 B = ( Qˆ i Q ) m i 1 where m is the number of imputations, Qˆi data, and Q is the mean parameter. is the beta estimate from the ith imputed Page 7 of 8

First, find the deviation scores for the beta estimates, square them, and then sum the squared deviations. Next, multiply that value by one fifth. The between imputation variance is 0.009. c. Use the following equation to compute the total variance: 1 T = U + 1 + B m For the example, the total variance is 0.031. Therefore, the multiple imputation standard error is 0.18 ( 0.031. Imputation b SE Variance (SE 2 ) 1 3.23 0.15 0.022 2 3.43 0.15 0.023 3 3.35 0.11 0.011 4 3.55 0.15 0.023 5 3.47 0.15 0.022 3.41 0.18 IV. Repeat the steps for all of the parameter estimates and standard errors in your model. Page 8 of 8