Tools for Imputing Missing Data

Size: px

Start display at page:

Download "Tools for Imputing Missing Data"

Cory Simmons
5 years ago
Views:

1 ABSTRACT Tools for Imputing Missing Data Taylor Lewis, University of Maryland, College Park, MD Missing data frequently pose a problem to applied researchers and statisticians. Although a common approach is to simply ignore the missing data and analyze only the fully observed portion, an alternative is to impute, or fill in, the missing data, which can often prove advantageous. This paper begins by discussing patterns of missing data as well as the assumptions behind techniques to compensate for it. Much of the paper focuses on tools conveniently built into PROC MI, which allows one to conduct multiple imputation as a way to incorporate additional uncertainty inherent when imputing missing data. Of course, PROC MI can still be used for single imputation. The paper will also illustrate the %IMPUTE module of IVEware, a powerful (free) SAS add-on developed by researchers at the Institute for Social Research at the University of Michigan, which is particularly helpful for tackling multivariate missingness. INTRODUCTION Lewis (2012) discussed the dilemma of missing data within the realm of applied survey research and distinguished between two broad types of missingness: unit nonresponse and item nonresponse. Unit nonresponse refers to the situation in which all key outcome variables are missing that is, the sample unit fails to respond to the survey request. On the other hand, item nonresponse refers to the situation in which some, but not all, outcome variables are missing. The sample unit may have refused or been unable to answer certain items, or perhaps one or more items were unintentionally skipped. These two missing data situations are juxtaposed in Figure 1. Unit Nonresponse Item Nonresponse Outcome Variables Outcome Variables X Y 1 Y 2 Y 3 Y 4 X Y 1 Y 2 Y 3 Y 4???????????? Figure 1. Illustration of Unit Nonresponse versus Item Nonresponse. The typical remedy for unit nonresponse is to reweight the responding cases such that they better reflect known distributions of the sample (or population) with respect to a set of auxiliary variables, denoted X in Figure 1 above. Lewis (2012) provided some of the basic theory behind those weighting techniques and demonstrated SAS syntax to implement them. The current paper is a continuation of that paper, shifting focusing towards techniques that mitigate item nonresponse by imputing, or filling in, the missing data. These methods exploit the relationship between X and the outcome variable(s) for the observed cases to derive plausible values for the outcome variables of missing cases. That is, it is assumed X is fully observed for the entire data set, respondents and nonrespondents. In addition, certain underlying assumptions about the missingness mechanism must hold. A brief taxonomy of these assumptions is discussed next. The stochastic view of survey nonresponse posits that each sample unit possesses a fixed (but unknown) probability of responding to a survey request. Following the terminology of Rosenbaum and Rubin (1983), this is often called a response propensity and denoted i. Bethlehem (1988) showed that in a simple random sample of size n from a sample frame of N population units, the expected bias of ŷ r, the mean using only responding sample units, relative to y, the full population mean, can be expressed as bias( yˆ r ) N i 1 ( i )( yi y ), where denotes the average N 1

2 response propensity across all population units. Thus, the bias is proportional to the population covariance between the propensities and the outcome variable. If we are to adopt this perspective about survey nonresponse, the three distinct missing data assumptions defined by Little and Rubin (2002) are useful for considering how harmful the nonresponse is and whether any potential biases can be eliminated. The first assumes data are missing completely at random (MCAR), which implies i. Since the i s do not vary, they are necessarily uncorrelated with any outcome variable(s). This poses the least harmful situation, as the responding cases can be thought of as a completely random subsample. There would be no expected bias using ŷ r without making any adjustments, although there would likely be a loss in precision. The second assumption is that the data are missing at random (MAR), which is to say the i s vary only with regard to the sample units vector of auxiliary variables. Units with comparable X i s share comparable i s, and there is no additional dependency between the likelihood of item nonresponse and any outcome variable. This is the situation generally assumed by the imputation methods demonstrated in this paper as well as the weighting methods demonstrated in Lewis (2012). The first and second assumptions are collectively referred to as ignorable missingness mechanisms by Little and Rubin (2002). This sometimes confuses analysts, because in actuality the missingness is ignorable only after you properly adjust for it. The third assumption is the most perilous, data that are not missing at random (NMAR). This is categorized by Little and Rubin (2002) as a non-ignorable missingness mechanism and implies there is a dependency between the i s and the outcome variable beyond what can be accounted for by X. For example, suppose a mail survey is aimed at measuring the proportion of the electorate that voted in the most recent presidential election. If people who did not vote are less inclined to respond the survey request across all auxiliary variables on the sample frame (e.g., race/ethnicity, age, neighborhood), it is doubtful that an imputation approach using those variables would be able to completely eliminate the bias. Rather sophisticated techniques are required to handle the NMAR situation (c.f., Ch. 6 of Rubin, 1987; Andridge and Little, 2011); they are beyond the scope of this paper. There are critics who believe imputation is fabricating data and that only the observed portion should be analyzed. As Brick and Kalton (1996) point out, however, foregoing any kind of adjustment is tantamount to assuming data are MCAR, which is far more questionable than the MAR assumption implicit in most imputation approaches! Despite the fact that any one of the three classifications outlined by Little and Rubin (2002) is rarely possible to verify (or refute), the plausibility of the MAR assumption increases with a larger number of auxiliary variables. AN EXAMPLE SURVEY To motivate the application of these techniques, assume an employee satisfaction survey was conducted on a sample of individuals who work in a large organization. From a large personnel database serving as the sample frame, a simple random sample of n = 8,583 was drawn, and these employees were sent a personalized link via to a Web-based survey instrument containing a variety of attitudinal questions and a few demographics. Weekly reminder s were sent to nonrespondents, but after a few weeks the survey closed with r = 4,558 completes corresponding to a response rate of 4,558 / 8,583 = 53.1%. For the m = n r = 8,583 4,558 = 4,025 employees who never responded to the survey, unit nonresponse and item nonresponse can be treated equivalently in terms of the compensation method of choice. Whereas Lewis (2012) demonstrated approaches to reweight the 4,558 respondents to better reflect demographic distributions of the entire organization (i.e., the finite population of interest), this paper discusses imputation techniques to fill in the missing data and render a completed data set for the entire sample of 8,583 originally planned. Fortuitously, a set of auxiliary variables maintained in the personnel database is known for the entire sample and can thus be used to facilitate the imputation process. The following four will be utilized during illustrations in this paper: GENDER M/F indicator of employee gender SUPERVISOR 0/1 indicator of where the employee is a supervisor AGE age of employee at time of survey MINORITY Y/N indicator of minority status Suppose the data set SAMPLE contains these four variables for all 8,583 sampled employees as well as the survey outcome variables where observed. A few example outcome variables to be considered: LOS a continuous variable housing the employee s length of service with the organization 2

3 Q1_FULL respondent s selection on a five-point Likert scale (i.e., ranging from Completely Agree to Completely Disagree) to the question I like the kind of work I do. Q1 a dichotomized version of Q1_FULL, a 0/1 indicator of whether an employee responded positively (i.e., answered Completely Agree or Agree ) to the question I like the kind of work I do. IMPUTATION CONCEPTS AND MODEL SPECIFICATIONS The purpose of this section is to outline the various ways an imputation model can be specified and illustrate the actual imputation process with the help of a few simple examples. A naïve approach frequently employed to handle item nonresponse is to fill in the m missing cases with the overall mean of the r observed cases. This is also known as unconditional mean imputation, and is an example of a deterministic imputation model. This approach has no impact on certain descriptive statistics such as the sample mean the sample means prior to and after imputation are equivalent but it generally leads to an unjust reduction in variance. To see why, consider the variance approximation formula for the unadjusted sample mean of the observed r 2 1 ( yi yˆ r ) cases ŷ r : v ar( yˆ r ). Substituting ŷ r for the m missing cases has no effect on the summation term. r ( r 1) i 1 Even though the summation now runs to n instead of r, the m values added each contribute a squared deviation of 0; however, note that the denominator terms both increase from r and (r 1) to n and (n 1), respectively, which results in a decreased variance approximation. This seems ill-advised considering no new information has been introduced. A somewhat preferred alternative, at least from a variance perspective, is to employ a stochastic imputation model in which we start the mean or expected value and then add on some kind of residual. To further compare and contrast concepts of deterministic versus stochastic imputation models, consider the task of imputing LOS for the 4,025 nonrespondents using the auxiliary variable AGE. Figure 2 portrays the relationship between these two variables for a subset of the 4,558 observed cases. A straightforward approach might be to suppose a simple linear regression model holds such as LOSi 0 1AGE i i, where the i s are assumed normally distributed with mean 0 and some constant variance 2. The dashed line in Figure 1 depicts this model. The idea would be to estimate parameters of this model using the observed data and plug in AGE for the i th nonrespondent to derive a plausible value of LOS to impute. Consider a nonrespondent who is 40 years old. A deterministic imputation model approach might stipulate the missing LOS value be extracted directly from what lies on the regression line, an example of a conditional mean imputation model. On the other hand, a stochastic imputation model approach might start at the regression line then add a random residual in proportion to the estimated mean squared error (MSE) of the fitted model. If we denote the MSE ˆ 2 2, this process is operationalized by assigning these values as i z i ˆ, where z i represents a random normal deviate drawn independently for the i th nonrespondent. 3

4 Figure 2. Sample of Observed Cases Relationship between Employee Age and Length of Service in the Hypothetical Personnel Satisfaction Survey. The stochastic imputation model approach discussed above is an example of an explicit or parametric imputation model. In contrast, an example of an implicit or nonparametric imputation model would be to place respondents and nonrespondents into cells based on some sensible age categorization (e.g., < 40, 40 60, and > 60) and impute instances of missing LOS by randomly selecting an observed case s value within the same cell. This technique is often referred to as hot-deck imputation and is one of the earliest techniques utilized in practice see Andridge and Little (2010) for an excellent review. The next logical question posed by analysts is How should we form cells? It turns out guidance discussed in Lewis (2012) for the adjustment cell and propensity stratification methods carries over here. Glancing back at the bias formula attributable to Bethlehem (1998), the goal is to create cells in which either the i s or y i s (or both) are similar. However the cells are formed, it is important to bear in mind that the missing data assumption when applying either explicit or implicit models is the same: data are assumed MCAR given the covariates X. The only subtlety is that X can be perceived as a series of cell membership indicator variables in the implicit model formulation. Neither form is superior; both have advantages and disadvantages. One potential downside applying an explicit model such as the one above is that the imputed value returned could be nonsensical. For example, we could get a negative LOS value, which is not possible in the observed data. This is less of a concern in the cell-based approach, which necessarily imputes values actually observed. A key advantage of the explicit model approach, however, is that it can accommodate a large number of covariates. By comparison, defining cells based on the joint distribution of a large number of covariates tends to present a logistical problem, as small or empty cells seem inevitable. As Schenker and Taylor (1996, p. 430) note, it is undesirable to draw from too small a pool of observed values or utilize the same observe values for imputation too frequently. Before continuing, it should be mentioned that there are techniques bridging the gap between strictly explicit and strictly implicit modeling approaches. These are typically referred to as partially parametric or semi-parametric techniques. Examples are the approaches discussed in Heitjan and Little (1991) and Schenker and Taylor (1996), which call for fitting an explicit regression model but affixing real residuals in some randomized fashion from neighboring observations in the raw data. The ideas discussed above translate to categorical variables, although the visualization and modeling approaches differ somewhat. Figure 3 shows the observed relationship between employee gender and the proportion of respondents reacting positively to the statement I like the kind of work I do, computed simply by finding the mean the Q1 indicator variable. We can infer from the observed data that males are less likely to answer on the positive end of the scale than females, so this auxiliary variable is at least somewhat useful for recapturing a portion of the uncertainty attributable to the missingness (if we are comfortable assuming data are MCAR conditional on gender). In lieu of a linear regression approach, however, fitting a logistic regression model (Hosmer and Lemeshow, 2000) would be more appropriate in this setting. 4

Figure 3. Observed Cases Relationship between Employee Gender and Proportion Reacting Positively to the Statement I Like the Kind of Work I Do (Q1).

5 Figure 3. Observed Cases Relationship between Employee Gender and Proportion Reacting Positively to the Statement I Like the Kind of Work I Do (Q1). Although stochastic imputation models tend to result in a less pronounced underestimation of variance, there is still a residual component of uncertainty ignored by treating the imputed values in the completed data set as truth. While there has been more than one method proposed to account for this uncertainty (e.g., Rao and Shao, 1992; Efron, 1994; Fay, 1996; Kim and Fuller, 2002), arguably the most popular is multiple imputation (MI) (Rubin, 1987). Indeed, PROC MI has been developed to perform multiple imputation under a variety of imputation models. Syntax examples will be given in the next section. The notion of multiple imputation is to fill in the missing data independently M times (M 2), rendering M completed data sets. Figure 4 visualizes the process. If we demarcate the m th completed data set estimateˆ m the overall MI M estimate is found by simply averaging the M completed data set estimates, ˆ 1 M M ˆm. The overall MI variance m 1 M 1 is found by adding together the average of the M completed data set estimated variances, UM v ar(ˆ m), M m1 which can be thought of as a measure of within-imputation variability, and a measure of between-imputation M 1 2 variability, B M ( ˆ m ˆ M ). In words, the between-imputation term represents the variance of the M M 1 m1 1 completed data set estimates themselves. In total, the MI variance is approximated by v ar( ˆ M ) UM 1 BM. M 1 (The term 1 represents a finite imputation correction factor that tends to 1 as M.) General theta M notation is used to emphasize the fact that these formulas apply regardless of the estimator at hand. As we will see in specific examples presented later in the paper, the basic strategy is to independently compute the M estimator(s) and completed data set measure(s) of variability, then supply these figures to PROC MIANALYZE to carry out the formulas outlined above. 5

Figure 4. Visualization of the Multiple Imputation Process. A tenet of the MI process advocated by Rubin is that one must account for the imputation model uncertainty when deriving imputations.

6 Figure 4. Visualization of the Multiple Imputation Process. A tenet of the MI process advocated by Rubin is that one must account for the imputation model uncertainty when deriving imputations. Failing to do so is improper (pp ) in his terminology. This is a complicating factor not well understood by many analysts in this author s humble opinion, yet it is an essential component of the process. A few simple examples are given next to help illustrate the concept. Suppose that in our sample of n = 8,583 employees in which r = 4,558 responded and m = 4,025 did not respond we are willing to assume data are MCAR, but we still sought to compensate for the missingness by imputing the missing values in a completely random fashion. We have already seen how filling in the mean leads to an imprudent variance reduction for the sample mean. It is worthwhile to consider two MI strategies that, in expectation, do not result in a comparable expected variance reduction. In other words, for large M, the overall MI variance equals the variance of the observed cases. The first to be considered is a nonparametric, single-cell hot-deck routine. We begin by selecting r cases with replacement from the set of r observed values. We might denote this set r*. From r*, we select m missing cases with replacement and use these to impute the m missing cases. We then repeat this two-stage process independently M times to create M completed data sets. Rubin and Schenker (1986) refer to this technique as the approximate Bayesian bootstrap (ABB). The second approach is a parametric version of the first. Suppose that the sample mean of interest was actually a proportion, the mean of the Q1 0/1 indicator variable. From the observed data, it is straightforward to compute this pˆ (1 pˆ ) proportion pˆ r and its variance v ar(ˆ p ) r r r. The first step for deriving values of Q1 for the nonrespondents is r 1 to account for the uncertainty in pˆ r by drawing pˆ r * pˆ r zi v ar(ˆ pr ), where z i is a random normal deviate. From * here, we draw r i, an uniformly distributed random variable between 0 and 1, and impute a 1 for Q1 if r i pˆr and 0 otherwise. This process is conducted independently M times. Again, the pleasing property of either technique is that the expected MI variance estimate matches the variance estimate from only the observed cases. Although for brevity purposes we will not do so here, this can be verified via simulation. The first step in either approach is critical, however, the step accounting for the uncertainty inherent in the imputation model. Without it, some degree of variance underestimation will likely remain. 6

7 TOOLS FOR HANDLING UNIVARIATE MISSINGNESS We begin the discussion regarding the specific imputation tools available within SAS by addressing univariate missingness, the scenario in which one or more completely observed variables is used to impute missing values of a single variable. It turns out univariate missingness is a special case of the more general monotone missingness pattern, which will be defined more formally in the next section when it is contrasted with an arbitrary missingness pattern (see Figure 5). But at this point suffice it to say this explains why the handiest built-in imputation algorithms for this setting reside within the MONOTONE statement of PROC MI. Suppose we wanted to multiply impute the missing values of LOS in the data set SAMPLE M = 5 times with a model drawing upon all four auxiliary variables: GENDER, SUPERVISOR, AGE, and MINORITY. The PROC MI syntax in the example code below accomplishes this task. Since LOS is a continuous variable, we might opt for the linear regression approach implemented by the REG option of the MONOTONE statement. Although there is output generated in the listing, the key output is the M = 5 completed data sets which we are stored on top of one another in the data set we named SAMPLE_MI_5. In general, we can expect the output data set to consist of M times the number of observations in the input data set, where M is specified by the NIMPUTE= option in the PROC statement. Note that the automatically appended numeric variable _IMPUTATION_ taking on values 1, 2,, M can be used to extract the m th completed data set. In the event you wish to only conduct single imputation, you can either extract one particular completed data set at random or specify NIMPUTE=1. Lastly, it is always advisable to specify a random number in the SEED= option to ensure results can be replicated if the PROC MI step is resubmitted at a later time. Categorical variables must be listed in the CLASS statement and must also appear in the VAR statement alongside continuous variables. Since the variable SUPERVISOR is stored numerically as a 0/1 indicator it does not need to appear in the CLASS statement. In general, the variables in the MONOTONE statement must be listed in descending order according to their rates of missingness, but in the present case of univariate missingness, we need only to ensure LOS is listed last. Our explicit model is specified in typical MODEL statement syntax structure in parentheses after the REG option in the MONOTONE statement. Note that you can augment the list of explanatory variables to the right of the equals sign with syntax such as VAR1*VAR2 to include an interaction term for the effects of VAR1 and VAR2, or VAR1 VAR2 to include all two-way interaction terms between VAR1, VAR2, and VAR3. proc mi data=sample out=sample_mi_5 nimpute=5 seed=943222; class gender minority; var gender minority age supervisor los; monotone reg (los=age supervisor gender minority / details); The DETAILS option within parentheses in the syntax above requests the imputation model coefficients (betas) for the observed portion of the data to be output as well as the perturbed values drawn independently for each imputation. The output below shows these values through the first three imputations. The similarity or dissimilarity of each vector of coefficients from one imputation to the next is a function of the precision of the imputation model fitted from the observed data. The perturbation process of model coefficients in this setting is substantially more complicated than the simple examples discussed in the previous section and will not be discussed here. The reader seeking a more in-depth discussion is referred to the PROC MI documentation. The reason for highlighting this process is to reiterate that accounting for the uncertainty in the imputation model is a critical stage in the process of deriving imputed values Imputation Effect GENDER Minority Obs-Data Intercept AGE SUPERVISOR GENDER F Minority N Output 1. Partial Output from an Example PROC MI Run with the DETAILS Option Specified from a Linear Regression Imputation Model for a Dichotomous Outcome Variable Recall a potential downside to utilizing a parametric modeling approach such as the one here is the risk of a nonsensical imputed value such as a negative value for LOS. The PREDMM option in the MONOTONE statement can be employed to implement the semi-parametric techniques discussed previously and more formally described in 7

8 Heitjan and Little (1991) and Schenker and Taylor (1996). The syntax is virtually equivalent, at least in terms of how we specify the model, but recall a real residual is drawn from a pool of observed residuals neighboring the predicted value. See the documentation for more details. Examples of the syntax necessary to actually analyze each of the M completed data sets contained within SAMPLE_MI_5 and unify the results in a PROC MIANALYZE step are deferred to a later section in the paper. The direction of the paper at this point is to demonstrate univariate imputation syntax examples for variables measured on alternative scales namely, dichotomous, ordinal, and nominal and eventually segue into multivariate missingness examples. We next consider dichotomous variables. The LOGISTIC option in the MONOTONE statement of PROC MI is equipped to impute variables of this type. As the name suggests, it does so by fitting a logistic regression model to the observed data. The syntax below imputes the Q1 0/1 indicator variable exploiting the same four auxiliary variables used previously. As before, since we have specified NIMPUTE=5, the output data set SAMPLE_MI_5 contains five times the number of observations appearing in the input data set. Immediately following the syntax is a portion of the output generated by the DETAILS option. Note that these still represent the fully observed data and imputation-specific perturbed model coefficients, except they correspond to logistic regression model coefficients and not linear regression model coefficients as in Output 1. proc mi data=sample nimpute=5 seed=7726 out=sample_mi_5; class gender minority Q1; var gender minority age supervisor Q1; monotone logistic (Q1=gender minority age supervisor / details); Imputation Effect GENDER Minority Obs-Data Intercept GENDER F Minority N AGE SUPERVISOR Output 2. Partial Output from an Example PROC MI Run with the DETAILS Option Specified from a Logistic Regression Imputation Model for a Dichotomous Outcome Variable. The LOGISTIC option can actually be implemented for variables with three or more distinct values, but only if you are willing to assume an ordinal scale. This is because an ordinal logistic regression model is the only option to be fitted. At the time of this writing, the multinomial version is not available, which would be a more suitable avenue for nominally-scaled variables. The ordinal assumption may not be implausible for the variable Q1_FULL, the precollapsed version of Q1 ranging from 1 to 5, the respondent s election on a five-point Likert scale. The only technique in PROC MI developed specifically for nominal variables is the DISCRIM option in the MONOTONE statement, which implements the discriminant method of multiple imputation as described in Brand (1999). This approach is restrictive, however, because it requires multivariate normality in the predictor variables, effectively eliminating the possibility of categorical predictor variables. An alternative is the PROPENSITY option in the MONOTONE statement, which uses the covariates specified to estimate propensities of missingness for all cases in the input data set. The propensities are then stratified according to their magnitudes rendering cells within which the ABB routine described earlier is performed to derive the multiple imputations. This is reminiscent of the propensity stratification reweighting approach discussed in Lewis (2012). By default, 5 strata are created, but this can be modified using the NGROUPS= option after the slash within parentheses during the propensity model specification step. Although there are no scale requirements for this form of (hot-deck) imputation procedure, at the time of this writing, only numeric variables are accommodated. Hopefully, this will be rectified in a future release of SAS, but in the meantime the syntax outlined below serves as a general work-around. Without going into too much descriptive detail, the strategy is to create a numeric shadow variable (Q1_FULL_NUM) for the PROC MI step that consists of one unique integer for each unique category of the underlying nominal variable (Q1_FULL). This is accomplished with the help of a few user-defined formats. At the conclusion of the imputation process, the numeric version is back-transformed to the original categorical version. * create data set of distinct, non-missing values of the given categorical variable; 8

9 proc sql; create table toformat1 as select distinct Q1_full as start from sample where not missing(q1_full); quit; * create a format to substitute a sequential integer in place of the specific values; data toformat1; set toformat1; label=_n_; type='c'; fmtname='$char_num'; proc format cntlin=toformat1; * create a placeholder numeric variable to utilize in PROC MI step; data sample; set sample; * initialize the shadow variable as numeric; Q1_full_num=0; Q1_full_num=put(Q1_full,$char_num.); proc mi data=sample nimpute=5 seed= out=sample_mi_5; class gender minority; var gender minority age supervisor Q1_full_num; monotone propensity (Q1_full_num=gender minority age supervisor); * swap the start/label values in the toformat data set; data toformat2; set toformat1 (rename=(start=label label=start)); fmtname='num_char'; type='n'; proc format cntlin=toformat2; * convert integer placeholders in output data set back to their respective values; data sample_mi_5; set sample_mi_5; if Q1_full=' ' then Q1_full=put(Q1_full_num,num_char.); drop Q1_full_num; Another option for imputing nominal variables is to utilize IVEware, demonstrated in the next section. IVEware fits and applies a sequence of binary logistic regression models in a manner that is functionally equivalent to a multinomial logistic regression approach. TOOLS FOR HANDLING MULTIVARIATE MISSINGNESS Examples up to this point have dealt exclusively with univariate missingness problems, but a given data set may have two or more variables plagued by missingness. We next consider techniques to mitigate this situation. Multiple MONOTONE statements can be specified in a single PROC MI step only when the data set exhibits a monotone missingness pattern. Figure 5 offers a side-by-side comparison of this particular pattern with the alternative, an arbitrary pattern of missingness. 9

10 Monotone Missingness Arbitrary Missingness Outcome Variables Outcome Variables Y 1 Y 2 Y 3 Y 4 Y 1 Y 2 Y 3 Y 4????? Figure 5. Monotone vs. Arbitrary Patterns of Missingness.??????? A data set is said to have a monotone pattern of missingness if its rows and columns can be rearranged such that as we move left to right the variables are ordered in terms of their rates of missingness, that each subsequent variable contains an equal or greater number of cases plagued by missing data. For the present case in which LOS and Q1 are either both observed or both missing for all observations in the data set SAMPLE, a monotone missingness pattern holds. The example PROC MI step below illustrates the approach of specifying multiple MONOTONE statements. The first step is to fill in LOS via a linear regression model, and the second is to fill in Q1 via a logistic regression model. Although no output is shown, these are conducted in tandem with the end result being that the data set SAMPLE_MI_5 consists of M = 5 completed data sets with no missing values for LOS and Q1. proc mi data=sample nimpute=5 seed=87332 out=sample_mi_5; class gender minority Q1; var age supervisor gender minority Q1 los; monotone reg (los=gender minority age supervisor); monotone logistic (Q1=gender minority age supervisor); When data are afflicted by a Swiss cheese missingness pattern such as the example data set appearing on the right-hand side of the Figure 5, specifying multiple MONOTONE statements will produce an error message in the log. The EM and MCMC statements in PROC MI were designed to handle arbitrary patterns of missingness, but only when data are multivariate normal. Again, this precludes the use of categorical variables, which is a non-trivial barrier in applied survey research. As such, no examples will be shown, but the reader interested in learning more about these statements and the underlying algorithms implemented the expectation maximization (EM) and Markov chain Monte Carlo (MCMC) algorithms, respectively is encouraged to consult the documentation. A remedy occasionally feasible is to conduct single imputation for the purpose of achieving a monotone missingness pattern, at which point multiple MONOTONE statements can be specified for an example, see p. 9 of Berglund (2010). For brevity, we will not illustrate that approach here. Instead, we will demonstrate how to employ the %IMPUTE module of IVEware, a set of free SAS-callable macros developed by researchers at the University of Michigan ( The %IMPUTE module is an extraordinarily flexible tool for imputation, and is capable of handling either monotone or arbitrary patterns of missingness via the sequential regression approach detailed in Raghunathan et al. (2001). The process begins by imputing the variable with the least amount of missingness using only the fully observed portion of the data, then proceeds to impute the variable with the second smallest amount of missingness and imputes using the observed portion of the data as well as any imputed values from the first step. The sequence continues until all missing values have been imputed. To build interdependence amongst the chained sequence of imputed values, the algorithm cycles back through all variables, re-imputing all values that were originally missing. After iterating through this process 10 times a default number that can be modified a completed data set is released. The entire process is begun anew to generate each of the M completed data sets independently. IVEware offers five model formulations corresponding to five possible variable types: (1) linear regression models for continuous variables; (2) logistic regression models for binary variables; (3) a sequence of binary logistic regression models for nominal variables (also applicable for ordinal variables); (4) Poisson regression models for count variables; and (5) a binary logistic regression model coupled with a linear regression model for semi-continuous variables, or variables taking on either 0 or a positive value on a continuous scale (e.g., a measure of smoking? 10

11 duration in years, which is 0 for individuals who indicate they have never smoked on a regular basis). One rather peculiar restriction is that it will not work within the enhanced editor window of PC SAS. Rather, IVEware code must be run from within the program editor window, which can be opened by selecting the Program Editor option View menu of the PC SAS session. As discussed in the installation guide ( there is an option to implement IVEware during the SAS session without explicitly running any executable files (although a standalone version exists). This is the avenue to be demonstrated presently. To get up and running, simply store the contents of the downloadable ZIP file in a directory and point SAS to it via a straightforward OPTIONS statement. For example, assuming the files have been stored in the local drive C:\SRCLIB\, the following line of code is all that is necessary: options set = SRCLIB "C:\Srclib" sasautos = ('!SRCLIB' sasautos) mautosource; The shell syntax below is an annotated description of the various statements available within the %IMPUTE module of IVEware. This is this not a comprehensive listing, and some of the statements are optional. Scrutinizing the syntax offers some insight into why this is termed the %IMPUTE module as opposed to a macro. To execute a previously compiled macro, we generally only concern ourselves with assigning values to the macro parameters in parentheses. We can observe there are three macro parameters defined below which in my applied work I have never found occasion to change but there are also distinct statements ending with a semicolon. %IMPUTE (NAME=TEST,DIR=.,SETUP=NEW); DATAIN /* input data set */; DATAOUT /* output data set */ ALL; /* specify ALL after naming output data set to have all M completed data sets stacked on top of one another default is just the first completed data set */ TRANSFER /* variables to be transferred to output data set, not used during imputation process */; CONTINUOUS /* list of continuous variables, if DEFAULT statement is used */; CATEGORICAL /* list of categorical variables, binomial or multinomial */; COUNT /* list of count variables to be imputed via Poisson regression */; MIXED /* list of semi-continuous variables */; DEFAULT /* optional statement to assign a default variable type */; BOUNDS /* variable name followed by imputed value condition or range in RESTRICT parentheses */; /* variable name followed by condition for imputation to occur at all in parentheses */; MULTIPLES /* statement to assign M */; SEED /* way to assign a random number seed to replicate analysis */; PRINT /* few options for types/amount of output generated to the listing */; RUN; The DATAIN and DATAOUT operate like the DATA= and OUT= options of the PROC MI statement. By default, IVEware only outputs the first completed data set, even if the MULTIPLES statement calls for M > 1. The ALL option in the DATAOUT statement overrides this default. As with PROC MI, the M completed data sets are stacked on top of one another with an identifier variable automatically appended. While PROC MI calls this variable _IMPUTATION_, IVEware calls it _MULT_. IVEware treats all variables in the input data set as potential predictor variables for the imputation process. If there are certain variables that would be inappropriate to use, list them in the TRANSFER statement. A prime example is a unique respondent identification variable. In general, the utilization of all non-transfer-statement variables in the imputation process brings up what could be perceived as a limitation to the tool: it does not offer the capability to tailor a subset of predictor variables for each outcome variable s imputation model. By comparison, this capability is offered in PROC MI, at least when the missingness pattern is monotone. There is a school of thought in the missing data literature, however, that takes the stance one should include as many predictor variables as possible (e.g., Rubin, 1996). A simulation study by Reiter et al. (2006) concluded that maintaining random noise variables in an imputation model merely increased variability somewhat relative to a model that excluded the terms, which the authors felt was a small price to pay in exchange for increased plausibility of the MAR assumption resulting from the larger covariate set. The purpose of the next series of statements outlined in the shell syntax above is to assign the respective variable types of which the input data set is composed. Each non-continuous variable must be listed in either the CATEGORICAL (dichotomous or nominal), COUNT, or MIXED (semi-continuous) statements, unless the DEFAULT statement is used to define another variable type. For instance, if most variables in the input data set are categorical, you can specify DEFAULT CATEGORICAL as its own statement and then assign the scale of the non-categorical variables amongst the other three statements (including CONTINUOUS) to shorten the amount of syntax required. 11

12 The BOUNDS and RESTRICT statements are optional, but frequently come in handy. The general syntax is to specify a variable name followed by some condition(s) in parentheses. For example, specifying BOUNDS LOS (>0) ensures only non-negative values of LOS will be imputed. Multiple conditions can be separated by commas. For example, specifying BOUNDS LOS(>0,< AGE 18) also guards against an imputed LOS suggesting the employee began his/her tenure before the age of 18. The RESTRICT statement can be used to impute only for those cases meeting the condition(s) specified within parentheses. An example might be imputing a missing value for the variable INCOME only for respondents who are (or get imputed as being) employed. If we suppose the condition of being currently employed is flagged by the character variable EMPLOYED equaling Y, we might include the statement RESTRICT INCOME(EMPLOYED= Y ). The code given below illustrates how to impute LOS and Q1 using IVEware. This author s preference is to completely enumerate the variables being passed to the %IMPUTE module with the help of a commented KEEP statement in a preliminary DATA step, as it helps ensure they are all handled properly. Because continuous variables are the default type, we do not need to list AGE and LOS explicitly. The four categorical variables at hand are listed in the CATEGORICAL statement, and the unique respondent identifier EMPID is assigned as a TRANSFER statement variable since it would not make sense to use this as a continuous covariate in any imputation models. The MULTIPLES statement requests M = 5 completed data sets be output to the data set SAMPLE_FROMIVE. data sample_toive; set sample; keep EMPID /* ID variable to re-merge IVEware output data set */ AGE GENDER MINORITY SUPERVISOR /* fully observed covariates */ Q1 LOS /* partially observed variables requiring imputation */ ; %IMPUTE (NAME=TEST,DIR=.,SETUP=NEW); DATAIN sample_toive; DATAOUT sample_fromive ALL; TRANSFER EMPID; CATEGORICAL GENDER MINORITY SUPERVISOR Q1; BOUNDS LOS(>0); MULTIPLES 5; RUN; INFERENCES FROM MULTIPLY IMPUTED DATA While we have seen numerous syntax examples performing multiple imputation (M = 5), we have yet to see code consolidating results from the completed data sets into a single point estimate and measure of variability. The present section focuses on this task. As was expressed formulaically earlier, the average of the M point estimates serves as the overall MI estimate, but the overall MI variance is the sum of (1) the average M completed data set variances plus (2) a term reflecting the variability of the M estimates themselves. A nice property of the technique that has only helped bolster its appeal is that these formulas are the same regardless of the quantity being estimated (e.g., mean, total, quantile). We can even make use of the SAS/STAT procedure PROC MIANALYZE to carry out the computations. The only subtlety we need to concern ourselves with is whether the quantity is a descriptive statistic or a multivariate statistic (i.e., whether the measure of variability is expressed as a scalar or a matrix). In either case, however, the general process is to capture the estimate and measure(s) of variability independently from each of the M completed data sets and supply a summarized data set housing the results to PROC MIANALYZE. Let us first consider a simple example estimating the mean of LOS after multiple imputation. Without loss of generality, suppose we are using the IVEware-generated concatenated data set SAMPLE_FROMIVE created in the previous section comprised of M x n = 5 x 8,583 = 42,915 distinct observations. Since we are analyzing survey data, it is appropriate to use one of the SAS/STAT procedures prefixed by SURVEY here, PROC SURVEYMEANS in lieu of PROC MEANS (Lewis, 2010). The entire SAMPLE_FROMIVE data set is fed to PROC SURVEYMEANS with independent estimates obtained by specifying _MULT_ in the BY statement. A preliminary PROC SORT step is conducted to ensure the data set is oriented properly by this variable. The ODS OUTPUT statement stores the (M = 5) estimates and standard errors, among a few other default statistics, in a data set named STATS. Figure 6 is a screen shot of this summary data set. * sort the concatenated data by the completed data set indicator _MULT_; proc sort data=sample_fromive; by _MULT_; * compute and store the M estimated means and standard errors of LOS; ods output statistics=stats; 12

13 proc surveymeans data=sample_fromive mean stderr; by _MULT_; var LOS; Figure 6. Screen Shot of STATS Data Set. The next step is to feed the STATS data set to PROC MIANALYZE and point it to the variables maintaining the estimates and standard errors, respectively, via the MODELEFFECTS and STDERR statements. Conveniently, PROC SURVEYMEANS names them MEAN and STDERR. Note that we feed the summarized measures of uncertainty to PROC MIANALYZE in terms of standard errors (the square root of the variances) in this instance, yet this is not always the case, as we will see with the next example. PROC MIANALYZE assumes the M estimates are stacked vertically hence, it ascertains M from the number of rows in the input data set. * get the overall MI estimate and standard error; proc mianalyze data=stats; modeleffects mean; stderr stderr; The MIANALYZE Procedure Model Information Data Set WORK.STATS Number of Imputations 5 Variance Information Variance Parameter Between Within Total mean Variance Information Relative Fraction Increase Missing Relative Parameter in Variance Information Efficiency mean Parameter Estimates Parameter Estimate Std Error 95% Confidence Limits mean Output 3. Partial Output from PROC MIANALYZE for the Mean of LOS Based on the M = 5 Completed Data Sets Generated by IVEware. 13

The overall MI mean and standard error can be located in the Parameter Estimates portion of the PROC MIANALYZE output. Note that the value 17.

$While there are various additional quantities output to the listing by default, one particularly useful diagnostic of the imputation process is the fraction of missing information (FMI) (see Section$

14 The overall MI mean and standard error can be located in the Parameter Estimates portion of the PROC MIANALYZE output. Note that the value reported under the Estimate heading is the arithmetic mean of the five estimated means summarized in the STATS data set, whereas the quantity under the Std Err heading is a bit larger than the respective completed data set standard errors. This reflects the incorporation of the between-imputation variability component. While there are various additional quantities output to the listing by default, one particularly useful diagnostic of the imputation process is the fraction of missing information (FMI) (see Section 3.3 of Rubin, 1987). The FMI is defined as the between-imputation variance component divided by the total MI variance, and can be thought of as the portion of the variance attributable to multiple imputation. In fact, a small FMI could be grounds for justifying single imputation over multiple imputation. In an uninformative model, such as an intercept-only regression model without covariates or the single-cell hot-deck routine discussed above during the exposition of the concept of proper multiple imputation, we would expect the FMI to more or less equal the item nonresponse rate. To the extent that the FMI is smaller than the nonresponse rate, it is evidence the covariates employed in the imputation model serve to recapture a portion of the missing data uncertainty. In this particular analysis, the FMI is approximately 46%, which is not far from the item nonresponse rate of 47%, suggesting the four covariates used for imputing LOS have poor explanatory power. The FMI is estimate-specific, however, so a high or low value may not prevail to all estimates of interest.. The only other PROC MIANALYZE wrinkle worthy of demonstration is when multivariate quantities are estimated independently on each completed data set, since the set-up differs somewhat. Suppose instead of the mean of LOS we were interested in modeling Q1 as a function of the respondent s length of service. Since Q1 is a dichotomous outcome, logistic regression is the preferred modeling tool. The syntax below utilizes PROC SURVEYLOGISTIC again, the SURVEY companion procedure to PROC LOGISTIC, both of which are available in SAS/STAT to fit this model. The BY statement is used comparably to what was shown before. The ODS OUTPUT statement stores the model parameters and covariance matrices for each of the M = 5 completed data sets in data sets named BETAS and COV_MATRIX, respectively. A little renaming is done via data set options because in the multivariate form PROC MIANALYZE will be looking for the more familiar variable _IMPUTATION_ instead of _MULT_ output by IVEware. Screen shots of these two summary data sets appear after the example code. ods output ParameterEstimates=betas (rename=(_mult_=_imputation_)) covb=cov_matrix (rename=(_mult_=_imputation_)); proc surveylogistic data=sample_fromive; by _MULT_; model Q1(event='1') = gender LOS / covb; Figure 7. Screen Shots of Data Sets BETAS and COV_MATRIX. Under the structure in Figure 7, there is one data set housing the M distinct model parameters and a separate data set housing the respective covariance matrices. Note the subtle differences relative to the summarized data set STATS from Figure 6. Aside from the obvious fact that estimates and measures of variability are housed in separate data sets, the COV_MATRIX data set houses variances in lieu of standard errors. The code below illustrates the slightly altered PROC MIANALYZE syntax. The output generated is very similar to that generated above, so for brevity it is not reproduced here. Again, the objective was to illustrate the modified PROC MIANALYZE syntax this kind of analysis necessitates. proc mianalyze parms=betas covb=cov_matrix; modeleffects intercept LOS; 14

15 CONCLUSION This paper began by defining some of the terminology pertinent to missing data within the realm of applied survey research, and then briefly outlined the various models that can be formulated to combat the missingness via the technique of imputing, or filling in the missing values. These methods typically operate under the MAR assumption defined by Little and Rubin (2002), and necessitate at least some fully observed information be known for the entire sample. This auxiliary information was denoted X. The general notion of imputation is to model the relationship between X and the survey outcome for the observed cases and employ it to derive values for the missing cases. PROC MI has a variety of tools when the missingness pattern is monotone. For arbitrary patterns of missingness, particularly when categorical variables are involved, IVEware offers a little more flexibility. Both tools have been designed to multiply-impute the missing values, but even if single imputation is the objective, one can retain only one of the M completed data sets generated. For those who do perform multiple imputation, this paper demonstrated a few examples of PROC MIANALYZE, a procedure that can be used to combine the M estimates and measures of variability computed independently on each completed data set into a single estimate and measure of variability using the rules defined by Rubin (1987). REFERENCES Andridge, R., and Little, R. (2010). A Review of Hot Deck Imputation for Survey Non-response, International Statistical Review, 78, pp Andridge, R., and Little, R. (2011). Proxy Pattern-Mixture Analysis for Survey Nonresponse, Journal of Official Statistics, 27, pp Berglund, P. (2010). An Introduction to Multiple Imputation of Complex Sample Data using SAS v9.2, Paper presented at the SAS Global Forum, Seattle, WA, April Available online at: Bethlehem, J. (1988). Reduction of Nonresponse Bias Through Regression Estimation. Journal of Official Statistics, 4, pp Brick, J.M., and Kalton, G. (1996). Handling Missing Data in Survey Research. Statistical Methods in Medical Research, 5, pp Efron, B. (1994). Missing Data, Imputation, and the Bootstrap, Journal of the American Statistical Association, 89, pp Fay, R. (1996). When Are Inferences from Multiple Imputation Valid? Proceedings of the Joint Statistical Meetings of the American Statistical Association. Heitjan, F., and Little, R. (1991). "Multiple Imputation for the Fatal Accident Reporting System." Applied Statistics, 40, pp Hosmer, D., and Lemeshow, S. (2000). Applied Logistic Regression. 2 nd Edition. New York, NY: Wiley. Kim, J.K., and Fuller, W. (2004). Fractional Hot Deck Imputation, Biometrika, 91, pp Lewis, T. (2010). Principles of Proper Inferences from Complex Survey Data. Paper presented at the SAS Global Forum. Seattle, WA, April Available online at: Lewis, T. (2012). Weighting Adjustment Methods for Nonresponse in Surveys. Invited paper presented at the Western Users of SAS Software (WUSS) Conference. Long Beach, CA, September 5 7. Available online at: Little, R., and Rubin, D. (2002). Statistical Analysis with Missing Data. 2nd edition. New York, NY: Wiley. Raghunathan, T., Lepkowski, J., Van Hoewyk, J, and Solenberger, P. (2001). "A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models." Survey Methodology, 27, pp Rao, J.N.K., and Shao, J. (1992). Jackknife Variance Estimation with Survey Data Under Hot Deck Imputation, Biometrika, 79, pp Rosenbaum, P., and Rubin, D. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika, 70, pp Rubin, D., and Schenker, N. (1986). "Multiple Imputation for Interval Estimation from Simple Random Samples with 15

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION