Tools for Imputing Missing Data

Size: px
Start display at page:

Download "Tools for Imputing Missing Data"

Transcription

1 ABSTRACT Tools for Imputing Missing Data Taylor Lewis, University of Maryland, College Park, MD Missing data frequently pose a problem to applied researchers and statisticians. Although a common approach is to simply ignore the missing data and analyze only the fully observed portion, an alternative is to impute, or fill in, the missing data, which can often prove advantageous. This paper begins by discussing patterns of missing data as well as the assumptions behind techniques to compensate for it. Much of the paper focuses on tools conveniently built into PROC MI, which allows one to conduct multiple imputation as a way to incorporate additional uncertainty inherent when imputing missing data. Of course, PROC MI can still be used for single imputation. The paper will also illustrate the %IMPUTE module of IVEware, a powerful (free) SAS add-on developed by researchers at the Institute for Social Research at the University of Michigan, which is particularly helpful for tackling multivariate missingness. INTRODUCTION Lewis (2012) discussed the dilemma of missing data within the realm of applied survey research and distinguished between two broad types of missingness: unit nonresponse and item nonresponse. Unit nonresponse refers to the situation in which all key outcome variables are missing that is, the sample unit fails to respond to the survey request. On the other hand, item nonresponse refers to the situation in which some, but not all, outcome variables are missing. The sample unit may have refused or been unable to answer certain items, or perhaps one or more items were unintentionally skipped. These two missing data situations are juxtaposed in Figure 1. Unit Nonresponse Item Nonresponse Outcome Variables Outcome Variables X Y 1 Y 2 Y 3 Y 4 X Y 1 Y 2 Y 3 Y 4???????????? Figure 1. Illustration of Unit Nonresponse versus Item Nonresponse. The typical remedy for unit nonresponse is to reweight the responding cases such that they better reflect known distributions of the sample (or population) with respect to a set of auxiliary variables, denoted X in Figure 1 above. Lewis (2012) provided some of the basic theory behind those weighting techniques and demonstrated SAS syntax to implement them. The current paper is a continuation of that paper, shifting focusing towards techniques that mitigate item nonresponse by imputing, or filling in, the missing data. These methods exploit the relationship between X and the outcome variable(s) for the observed cases to derive plausible values for the outcome variables of missing cases. That is, it is assumed X is fully observed for the entire data set, respondents and nonrespondents. In addition, certain underlying assumptions about the missingness mechanism must hold. A brief taxonomy of these assumptions is discussed next. The stochastic view of survey nonresponse posits that each sample unit possesses a fixed (but unknown) probability of responding to a survey request. Following the terminology of Rosenbaum and Rubin (1983), this is often called a response propensity and denoted i. Bethlehem (1988) showed that in a simple random sample of size n from a sample frame of N population units, the expected bias of ŷ r, the mean using only responding sample units, relative to y, the full population mean, can be expressed as bias( yˆ r ) N i 1 ( i )( yi y ), where denotes the average N 1

2 response propensity across all population units. Thus, the bias is proportional to the population covariance between the propensities and the outcome variable. If we are to adopt this perspective about survey nonresponse, the three distinct missing data assumptions defined by Little and Rubin (2002) are useful for considering how harmful the nonresponse is and whether any potential biases can be eliminated. The first assumes data are missing completely at random (MCAR), which implies i. Since the i s do not vary, they are necessarily uncorrelated with any outcome variable(s). This poses the least harmful situation, as the responding cases can be thought of as a completely random subsample. There would be no expected bias using ŷ r without making any adjustments, although there would likely be a loss in precision. The second assumption is that the data are missing at random (MAR), which is to say the i s vary only with regard to the sample units vector of auxiliary variables. Units with comparable X i s share comparable i s, and there is no additional dependency between the likelihood of item nonresponse and any outcome variable. This is the situation generally assumed by the imputation methods demonstrated in this paper as well as the weighting methods demonstrated in Lewis (2012). The first and second assumptions are collectively referred to as ignorable missingness mechanisms by Little and Rubin (2002). This sometimes confuses analysts, because in actuality the missingness is ignorable only after you properly adjust for it. The third assumption is the most perilous, data that are not missing at random (NMAR). This is categorized by Little and Rubin (2002) as a non-ignorable missingness mechanism and implies there is a dependency between the i s and the outcome variable beyond what can be accounted for by X. For example, suppose a mail survey is aimed at measuring the proportion of the electorate that voted in the most recent presidential election. If people who did not vote are less inclined to respond the survey request across all auxiliary variables on the sample frame (e.g., race/ethnicity, age, neighborhood), it is doubtful that an imputation approach using those variables would be able to completely eliminate the bias. Rather sophisticated techniques are required to handle the NMAR situation (c.f., Ch. 6 of Rubin, 1987; Andridge and Little, 2011); they are beyond the scope of this paper. There are critics who believe imputation is fabricating data and that only the observed portion should be analyzed. As Brick and Kalton (1996) point out, however, foregoing any kind of adjustment is tantamount to assuming data are MCAR, which is far more questionable than the MAR assumption implicit in most imputation approaches! Despite the fact that any one of the three classifications outlined by Little and Rubin (2002) is rarely possible to verify (or refute), the plausibility of the MAR assumption increases with a larger number of auxiliary variables. AN EXAMPLE SURVEY To motivate the application of these techniques, assume an employee satisfaction survey was conducted on a sample of individuals who work in a large organization. From a large personnel database serving as the sample frame, a simple random sample of n = 8,583 was drawn, and these employees were sent a personalized link via to a Web-based survey instrument containing a variety of attitudinal questions and a few demographics. Weekly reminder s were sent to nonrespondents, but after a few weeks the survey closed with r = 4,558 completes corresponding to a response rate of 4,558 / 8,583 = 53.1%. For the m = n r = 8,583 4,558 = 4,025 employees who never responded to the survey, unit nonresponse and item nonresponse can be treated equivalently in terms of the compensation method of choice. Whereas Lewis (2012) demonstrated approaches to reweight the 4,558 respondents to better reflect demographic distributions of the entire organization (i.e., the finite population of interest), this paper discusses imputation techniques to fill in the missing data and render a completed data set for the entire sample of 8,583 originally planned. Fortuitously, a set of auxiliary variables maintained in the personnel database is known for the entire sample and can thus be used to facilitate the imputation process. The following four will be utilized during illustrations in this paper: GENDER M/F indicator of employee gender SUPERVISOR 0/1 indicator of where the employee is a supervisor AGE age of employee at time of survey MINORITY Y/N indicator of minority status Suppose the data set SAMPLE contains these four variables for all 8,583 sampled employees as well as the survey outcome variables where observed. A few example outcome variables to be considered: LOS a continuous variable housing the employee s length of service with the organization 2

3 Q1_FULL respondent s selection on a five-point Likert scale (i.e., ranging from Completely Agree to Completely Disagree) to the question I like the kind of work I do. Q1 a dichotomized version of Q1_FULL, a 0/1 indicator of whether an employee responded positively (i.e., answered Completely Agree or Agree ) to the question I like the kind of work I do. IMPUTATION CONCEPTS AND MODEL SPECIFICATIONS The purpose of this section is to outline the various ways an imputation model can be specified and illustrate the actual imputation process with the help of a few simple examples. A naïve approach frequently employed to handle item nonresponse is to fill in the m missing cases with the overall mean of the r observed cases. This is also known as unconditional mean imputation, and is an example of a deterministic imputation model. This approach has no impact on certain descriptive statistics such as the sample mean the sample means prior to and after imputation are equivalent but it generally leads to an unjust reduction in variance. To see why, consider the variance approximation formula for the unadjusted sample mean of the observed r 2 1 ( yi yˆ r ) cases ŷ r : v ar( yˆ r ). Substituting ŷ r for the m missing cases has no effect on the summation term. r ( r 1) i 1 Even though the summation now runs to n instead of r, the m values added each contribute a squared deviation of 0; however, note that the denominator terms both increase from r and (r 1) to n and (n 1), respectively, which results in a decreased variance approximation. This seems ill-advised considering no new information has been introduced. A somewhat preferred alternative, at least from a variance perspective, is to employ a stochastic imputation model in which we start the mean or expected value and then add on some kind of residual. To further compare and contrast concepts of deterministic versus stochastic imputation models, consider the task of imputing LOS for the 4,025 nonrespondents using the auxiliary variable AGE. Figure 2 portrays the relationship between these two variables for a subset of the 4,558 observed cases. A straightforward approach might be to suppose a simple linear regression model holds such as LOSi 0 1AGE i i, where the i s are assumed normally distributed with mean 0 and some constant variance 2. The dashed line in Figure 1 depicts this model. The idea would be to estimate parameters of this model using the observed data and plug in AGE for the i th nonrespondent to derive a plausible value of LOS to impute. Consider a nonrespondent who is 40 years old. A deterministic imputation model approach might stipulate the missing LOS value be extracted directly from what lies on the regression line, an example of a conditional mean imputation model. On the other hand, a stochastic imputation model approach might start at the regression line then add a random residual in proportion to the estimated mean squared error (MSE) of the fitted model. If we denote the MSE ˆ 2 2, this process is operationalized by assigning these values as i z i ˆ, where z i represents a random normal deviate drawn independently for the i th nonrespondent. 3

4 Figure 2. Sample of Observed Cases Relationship between Employee Age and Length of Service in the Hypothetical Personnel Satisfaction Survey. The stochastic imputation model approach discussed above is an example of an explicit or parametric imputation model. In contrast, an example of an implicit or nonparametric imputation model would be to place respondents and nonrespondents into cells based on some sensible age categorization (e.g., < 40, 40 60, and > 60) and impute instances of missing LOS by randomly selecting an observed case s value within the same cell. This technique is often referred to as hot-deck imputation and is one of the earliest techniques utilized in practice see Andridge and Little (2010) for an excellent review. The next logical question posed by analysts is How should we form cells? It turns out guidance discussed in Lewis (2012) for the adjustment cell and propensity stratification methods carries over here. Glancing back at the bias formula attributable to Bethlehem (1998), the goal is to create cells in which either the i s or y i s (or both) are similar. However the cells are formed, it is important to bear in mind that the missing data assumption when applying either explicit or implicit models is the same: data are assumed MCAR given the covariates X. The only subtlety is that X can be perceived as a series of cell membership indicator variables in the implicit model formulation. Neither form is superior; both have advantages and disadvantages. One potential downside applying an explicit model such as the one above is that the imputed value returned could be nonsensical. For example, we could get a negative LOS value, which is not possible in the observed data. This is less of a concern in the cell-based approach, which necessarily imputes values actually observed. A key advantage of the explicit model approach, however, is that it can accommodate a large number of covariates. By comparison, defining cells based on the joint distribution of a large number of covariates tends to present a logistical problem, as small or empty cells seem inevitable. As Schenker and Taylor (1996, p. 430) note, it is undesirable to draw from too small a pool of observed values or utilize the same observe values for imputation too frequently. Before continuing, it should be mentioned that there are techniques bridging the gap between strictly explicit and strictly implicit modeling approaches. These are typically referred to as partially parametric or semi-parametric techniques. Examples are the approaches discussed in Heitjan and Little (1991) and Schenker and Taylor (1996), which call for fitting an explicit regression model but affixing real residuals in some randomized fashion from neighboring observations in the raw data. The ideas discussed above translate to categorical variables, although the visualization and modeling approaches differ somewhat. Figure 3 shows the observed relationship between employee gender and the proportion of respondents reacting positively to the statement I like the kind of work I do, computed simply by finding the mean the Q1 indicator variable. We can infer from the observed data that males are less likely to answer on the positive end of the scale than females, so this auxiliary variable is at least somewhat useful for recapturing a portion of the uncertainty attributable to the missingness (if we are comfortable assuming data are MCAR conditional on gender). In lieu of a linear regression approach, however, fitting a logistic regression model (Hosmer and Lemeshow, 2000) would be more appropriate in this setting. 4

5 Figure 3. Observed Cases Relationship between Employee Gender and Proportion Reacting Positively to the Statement I Like the Kind of Work I Do (Q1). Although stochastic imputation models tend to result in a less pronounced underestimation of variance, there is still a residual component of uncertainty ignored by treating the imputed values in the completed data set as truth. While there has been more than one method proposed to account for this uncertainty (e.g., Rao and Shao, 1992; Efron, 1994; Fay, 1996; Kim and Fuller, 2002), arguably the most popular is multiple imputation (MI) (Rubin, 1987). Indeed, PROC MI has been developed to perform multiple imputation under a variety of imputation models. Syntax examples will be given in the next section. The notion of multiple imputation is to fill in the missing data independently M times (M 2), rendering M completed data sets. Figure 4 visualizes the process. If we demarcate the m th completed data set estimateˆ m the overall MI M estimate is found by simply averaging the M completed data set estimates, ˆ 1 M M ˆm. The overall MI variance m 1 M 1 is found by adding together the average of the M completed data set estimated variances, UM v ar(ˆ m), M m1 which can be thought of as a measure of within-imputation variability, and a measure of between-imputation M 1 2 variability, B M ( ˆ m ˆ M ). In words, the between-imputation term represents the variance of the M M 1 m1 1 completed data set estimates themselves. In total, the MI variance is approximated by v ar( ˆ M ) UM 1 BM. M 1 (The term 1 represents a finite imputation correction factor that tends to 1 as M.) General theta M notation is used to emphasize the fact that these formulas apply regardless of the estimator at hand. As we will see in specific examples presented later in the paper, the basic strategy is to independently compute the M estimator(s) and completed data set measure(s) of variability, then supply these figures to PROC MIANALYZE to carry out the formulas outlined above. 5

6 Figure 4. Visualization of the Multiple Imputation Process. A tenet of the MI process advocated by Rubin is that one must account for the imputation model uncertainty when deriving imputations. Failing to do so is improper (pp ) in his terminology. This is a complicating factor not well understood by many analysts in this author s humble opinion, yet it is an essential component of the process. A few simple examples are given next to help illustrate the concept. Suppose that in our sample of n = 8,583 employees in which r = 4,558 responded and m = 4,025 did not respond we are willing to assume data are MCAR, but we still sought to compensate for the missingness by imputing the missing values in a completely random fashion. We have already seen how filling in the mean leads to an imprudent variance reduction for the sample mean. It is worthwhile to consider two MI strategies that, in expectation, do not result in a comparable expected variance reduction. In other words, for large M, the overall MI variance equals the variance of the observed cases. The first to be considered is a nonparametric, single-cell hot-deck routine. We begin by selecting r cases with replacement from the set of r observed values. We might denote this set r*. From r*, we select m missing cases with replacement and use these to impute the m missing cases. We then repeat this two-stage process independently M times to create M completed data sets. Rubin and Schenker (1986) refer to this technique as the approximate Bayesian bootstrap (ABB). The second approach is a parametric version of the first. Suppose that the sample mean of interest was actually a proportion, the mean of the Q1 0/1 indicator variable. From the observed data, it is straightforward to compute this pˆ (1 pˆ ) proportion pˆ r and its variance v ar(ˆ p ) r r r. The first step for deriving values of Q1 for the nonrespondents is r 1 to account for the uncertainty in pˆ r by drawing pˆ r * pˆ r zi v ar(ˆ pr ), where z i is a random normal deviate. From * here, we draw r i, an uniformly distributed random variable between 0 and 1, and impute a 1 for Q1 if r i pˆr and 0 otherwise. This process is conducted independently M times. Again, the pleasing property of either technique is that the expected MI variance estimate matches the variance estimate from only the observed cases. Although for brevity purposes we will not do so here, this can be verified via simulation. The first step in either approach is critical, however, the step accounting for the uncertainty inherent in the imputation model. Without it, some degree of variance underestimation will likely remain. 6

7 TOOLS FOR HANDLING UNIVARIATE MISSINGNESS We begin the discussion regarding the specific imputation tools available within SAS by addressing univariate missingness, the scenario in which one or more completely observed variables is used to impute missing values of a single variable. It turns out univariate missingness is a special case of the more general monotone missingness pattern, which will be defined more formally in the next section when it is contrasted with an arbitrary missingness pattern (see Figure 5). But at this point suffice it to say this explains why the handiest built-in imputation algorithms for this setting reside within the MONOTONE statement of PROC MI. Suppose we wanted to multiply impute the missing values of LOS in the data set SAMPLE M = 5 times with a model drawing upon all four auxiliary variables: GENDER, SUPERVISOR, AGE, and MINORITY. The PROC MI syntax in the example code below accomplishes this task. Since LOS is a continuous variable, we might opt for the linear regression approach implemented by the REG option of the MONOTONE statement. Although there is output generated in the listing, the key output is the M = 5 completed data sets which we are stored on top of one another in the data set we named SAMPLE_MI_5. In general, we can expect the output data set to consist of M times the number of observations in the input data set, where M is specified by the NIMPUTE= option in the PROC statement. Note that the automatically appended numeric variable _IMPUTATION_ taking on values 1, 2,, M can be used to extract the m th completed data set. In the event you wish to only conduct single imputation, you can either extract one particular completed data set at random or specify NIMPUTE=1. Lastly, it is always advisable to specify a random number in the SEED= option to ensure results can be replicated if the PROC MI step is resubmitted at a later time. Categorical variables must be listed in the CLASS statement and must also appear in the VAR statement alongside continuous variables. Since the variable SUPERVISOR is stored numerically as a 0/1 indicator it does not need to appear in the CLASS statement. In general, the variables in the MONOTONE statement must be listed in descending order according to their rates of missingness, but in the present case of univariate missingness, we need only to ensure LOS is listed last. Our explicit model is specified in typical MODEL statement syntax structure in parentheses after the REG option in the MONOTONE statement. Note that you can augment the list of explanatory variables to the right of the equals sign with syntax such as VAR1*VAR2 to include an interaction term for the effects of VAR1 and VAR2, or VAR1 VAR2 to include all two-way interaction terms between VAR1, VAR2, and VAR3. proc mi data=sample out=sample_mi_5 nimpute=5 seed=943222; class gender minority; var gender minority age supervisor los; monotone reg (los=age supervisor gender minority / details); The DETAILS option within parentheses in the syntax above requests the imputation model coefficients (betas) for the observed portion of the data to be output as well as the perturbed values drawn independently for each imputation. The output below shows these values through the first three imputations. The similarity or dissimilarity of each vector of coefficients from one imputation to the next is a function of the precision of the imputation model fitted from the observed data. The perturbation process of model coefficients in this setting is substantially more complicated than the simple examples discussed in the previous section and will not be discussed here. The reader seeking a more in-depth discussion is referred to the PROC MI documentation. The reason for highlighting this process is to reiterate that accounting for the uncertainty in the imputation model is a critical stage in the process of deriving imputed values Imputation Effect GENDER Minority Obs-Data Intercept AGE SUPERVISOR GENDER F Minority N Output 1. Partial Output from an Example PROC MI Run with the DETAILS Option Specified from a Linear Regression Imputation Model for a Dichotomous Outcome Variable Recall a potential downside to utilizing a parametric modeling approach such as the one here is the risk of a nonsensical imputed value such as a negative value for LOS. The PREDMM option in the MONOTONE statement can be employed to implement the semi-parametric techniques discussed previously and more formally described in 7

8 Heitjan and Little (1991) and Schenker and Taylor (1996). The syntax is virtually equivalent, at least in terms of how we specify the model, but recall a real residual is drawn from a pool of observed residuals neighboring the predicted value. See the documentation for more details. Examples of the syntax necessary to actually analyze each of the M completed data sets contained within SAMPLE_MI_5 and unify the results in a PROC MIANALYZE step are deferred to a later section in the paper. The direction of the paper at this point is to demonstrate univariate imputation syntax examples for variables measured on alternative scales namely, dichotomous, ordinal, and nominal and eventually segue into multivariate missingness examples. We next consider dichotomous variables. The LOGISTIC option in the MONOTONE statement of PROC MI is equipped to impute variables of this type. As the name suggests, it does so by fitting a logistic regression model to the observed data. The syntax below imputes the Q1 0/1 indicator variable exploiting the same four auxiliary variables used previously. As before, since we have specified NIMPUTE=5, the output data set SAMPLE_MI_5 contains five times the number of observations appearing in the input data set. Immediately following the syntax is a portion of the output generated by the DETAILS option. Note that these still represent the fully observed data and imputation-specific perturbed model coefficients, except they correspond to logistic regression model coefficients and not linear regression model coefficients as in Output 1. proc mi data=sample nimpute=5 seed=7726 out=sample_mi_5; class gender minority Q1; var gender minority age supervisor Q1; monotone logistic (Q1=gender minority age supervisor / details); Imputation Effect GENDER Minority Obs-Data Intercept GENDER F Minority N AGE SUPERVISOR Output 2. Partial Output from an Example PROC MI Run with the DETAILS Option Specified from a Logistic Regression Imputation Model for a Dichotomous Outcome Variable. The LOGISTIC option can actually be implemented for variables with three or more distinct values, but only if you are willing to assume an ordinal scale. This is because an ordinal logistic regression model is the only option to be fitted. At the time of this writing, the multinomial version is not available, which would be a more suitable avenue for nominally-scaled variables. The ordinal assumption may not be implausible for the variable Q1_FULL, the precollapsed version of Q1 ranging from 1 to 5, the respondent s election on a five-point Likert scale. The only technique in PROC MI developed specifically for nominal variables is the DISCRIM option in the MONOTONE statement, which implements the discriminant method of multiple imputation as described in Brand (1999). This approach is restrictive, however, because it requires multivariate normality in the predictor variables, effectively eliminating the possibility of categorical predictor variables. An alternative is the PROPENSITY option in the MONOTONE statement, which uses the covariates specified to estimate propensities of missingness for all cases in the input data set. The propensities are then stratified according to their magnitudes rendering cells within which the ABB routine described earlier is performed to derive the multiple imputations. This is reminiscent of the propensity stratification reweighting approach discussed in Lewis (2012). By default, 5 strata are created, but this can be modified using the NGROUPS= option after the slash within parentheses during the propensity model specification step. Although there are no scale requirements for this form of (hot-deck) imputation procedure, at the time of this writing, only numeric variables are accommodated. Hopefully, this will be rectified in a future release of SAS, but in the meantime the syntax outlined below serves as a general work-around. Without going into too much descriptive detail, the strategy is to create a numeric shadow variable (Q1_FULL_NUM) for the PROC MI step that consists of one unique integer for each unique category of the underlying nominal variable (Q1_FULL). This is accomplished with the help of a few user-defined formats. At the conclusion of the imputation process, the numeric version is back-transformed to the original categorical version. * create data set of distinct, non-missing values of the given categorical variable; 8

9 proc sql; create table toformat1 as select distinct Q1_full as start from sample where not missing(q1_full); quit; * create a format to substitute a sequential integer in place of the specific values; data toformat1; set toformat1; label=_n_; type='c'; fmtname='$char_num'; proc format cntlin=toformat1; * create a placeholder numeric variable to utilize in PROC MI step; data sample; set sample; * initialize the shadow variable as numeric; Q1_full_num=0; Q1_full_num=put(Q1_full,$char_num.); proc mi data=sample nimpute=5 seed= out=sample_mi_5; class gender minority; var gender minority age supervisor Q1_full_num; monotone propensity (Q1_full_num=gender minority age supervisor); * swap the start/label values in the toformat data set; data toformat2; set toformat1 (rename=(start=label label=start)); fmtname='num_char'; type='n'; proc format cntlin=toformat2; * convert integer placeholders in output data set back to their respective values; data sample_mi_5; set sample_mi_5; if Q1_full=' ' then Q1_full=put(Q1_full_num,num_char.); drop Q1_full_num; Another option for imputing nominal variables is to utilize IVEware, demonstrated in the next section. IVEware fits and applies a sequence of binary logistic regression models in a manner that is functionally equivalent to a multinomial logistic regression approach. TOOLS FOR HANDLING MULTIVARIATE MISSINGNESS Examples up to this point have dealt exclusively with univariate missingness problems, but a given data set may have two or more variables plagued by missingness. We next consider techniques to mitigate this situation. Multiple MONOTONE statements can be specified in a single PROC MI step only when the data set exhibits a monotone missingness pattern. Figure 5 offers a side-by-side comparison of this particular pattern with the alternative, an arbitrary pattern of missingness. 9

10 Monotone Missingness Arbitrary Missingness Outcome Variables Outcome Variables Y 1 Y 2 Y 3 Y 4 Y 1 Y 2 Y 3 Y 4????? Figure 5. Monotone vs. Arbitrary Patterns of Missingness.??????? A data set is said to have a monotone pattern of missingness if its rows and columns can be rearranged such that as we move left to right the variables are ordered in terms of their rates of missingness, that each subsequent variable contains an equal or greater number of cases plagued by missing data. For the present case in which LOS and Q1 are either both observed or both missing for all observations in the data set SAMPLE, a monotone missingness pattern holds. The example PROC MI step below illustrates the approach of specifying multiple MONOTONE statements. The first step is to fill in LOS via a linear regression model, and the second is to fill in Q1 via a logistic regression model. Although no output is shown, these are conducted in tandem with the end result being that the data set SAMPLE_MI_5 consists of M = 5 completed data sets with no missing values for LOS and Q1. proc mi data=sample nimpute=5 seed=87332 out=sample_mi_5; class gender minority Q1; var age supervisor gender minority Q1 los; monotone reg (los=gender minority age supervisor); monotone logistic (Q1=gender minority age supervisor); When data are afflicted by a Swiss cheese missingness pattern such as the example data set appearing on the right-hand side of the Figure 5, specifying multiple MONOTONE statements will produce an error message in the log. The EM and MCMC statements in PROC MI were designed to handle arbitrary patterns of missingness, but only when data are multivariate normal. Again, this precludes the use of categorical variables, which is a non-trivial barrier in applied survey research. As such, no examples will be shown, but the reader interested in learning more about these statements and the underlying algorithms implemented the expectation maximization (EM) and Markov chain Monte Carlo (MCMC) algorithms, respectively is encouraged to consult the documentation. A remedy occasionally feasible is to conduct single imputation for the purpose of achieving a monotone missingness pattern, at which point multiple MONOTONE statements can be specified for an example, see p. 9 of Berglund (2010). For brevity, we will not illustrate that approach here. Instead, we will demonstrate how to employ the %IMPUTE module of IVEware, a set of free SAS-callable macros developed by researchers at the University of Michigan ( The %IMPUTE module is an extraordinarily flexible tool for imputation, and is capable of handling either monotone or arbitrary patterns of missingness via the sequential regression approach detailed in Raghunathan et al. (2001). The process begins by imputing the variable with the least amount of missingness using only the fully observed portion of the data, then proceeds to impute the variable with the second smallest amount of missingness and imputes using the observed portion of the data as well as any imputed values from the first step. The sequence continues until all missing values have been imputed. To build interdependence amongst the chained sequence of imputed values, the algorithm cycles back through all variables, re-imputing all values that were originally missing. After iterating through this process 10 times a default number that can be modified a completed data set is released. The entire process is begun anew to generate each of the M completed data sets independently. IVEware offers five model formulations corresponding to five possible variable types: (1) linear regression models for continuous variables; (2) logistic regression models for binary variables; (3) a sequence of binary logistic regression models for nominal variables (also applicable for ordinal variables); (4) Poisson regression models for count variables; and (5) a binary logistic regression model coupled with a linear regression model for semi-continuous variables, or variables taking on either 0 or a positive value on a continuous scale (e.g., a measure of smoking? 10

11 duration in years, which is 0 for individuals who indicate they have never smoked on a regular basis). One rather peculiar restriction is that it will not work within the enhanced editor window of PC SAS. Rather, IVEware code must be run from within the program editor window, which can be opened by selecting the Program Editor option View menu of the PC SAS session. As discussed in the installation guide ( there is an option to implement IVEware during the SAS session without explicitly running any executable files (although a standalone version exists). This is the avenue to be demonstrated presently. To get up and running, simply store the contents of the downloadable ZIP file in a directory and point SAS to it via a straightforward OPTIONS statement. For example, assuming the files have been stored in the local drive C:\SRCLIB\, the following line of code is all that is necessary: options set = SRCLIB "C:\Srclib" sasautos = ('!SRCLIB' sasautos) mautosource; The shell syntax below is an annotated description of the various statements available within the %IMPUTE module of IVEware. This is this not a comprehensive listing, and some of the statements are optional. Scrutinizing the syntax offers some insight into why this is termed the %IMPUTE module as opposed to a macro. To execute a previously compiled macro, we generally only concern ourselves with assigning values to the macro parameters in parentheses. We can observe there are three macro parameters defined below which in my applied work I have never found occasion to change but there are also distinct statements ending with a semicolon. %IMPUTE (NAME=TEST,DIR=.,SETUP=NEW); DATAIN /* input data set */; DATAOUT /* output data set */ ALL; /* specify ALL after naming output data set to have all M completed data sets stacked on top of one another default is just the first completed data set */ TRANSFER /* variables to be transferred to output data set, not used during imputation process */; CONTINUOUS /* list of continuous variables, if DEFAULT statement is used */; CATEGORICAL /* list of categorical variables, binomial or multinomial */; COUNT /* list of count variables to be imputed via Poisson regression */; MIXED /* list of semi-continuous variables */; DEFAULT /* optional statement to assign a default variable type */; BOUNDS /* variable name followed by imputed value condition or range in RESTRICT parentheses */; /* variable name followed by condition for imputation to occur at all in parentheses */; MULTIPLES /* statement to assign M */; SEED /* way to assign a random number seed to replicate analysis */; PRINT /* few options for types/amount of output generated to the listing */; RUN; The DATAIN and DATAOUT operate like the DATA= and OUT= options of the PROC MI statement. By default, IVEware only outputs the first completed data set, even if the MULTIPLES statement calls for M > 1. The ALL option in the DATAOUT statement overrides this default. As with PROC MI, the M completed data sets are stacked on top of one another with an identifier variable automatically appended. While PROC MI calls this variable _IMPUTATION_, IVEware calls it _MULT_. IVEware treats all variables in the input data set as potential predictor variables for the imputation process. If there are certain variables that would be inappropriate to use, list them in the TRANSFER statement. A prime example is a unique respondent identification variable. In general, the utilization of all non-transfer-statement variables in the imputation process brings up what could be perceived as a limitation to the tool: it does not offer the capability to tailor a subset of predictor variables for each outcome variable s imputation model. By comparison, this capability is offered in PROC MI, at least when the missingness pattern is monotone. There is a school of thought in the missing data literature, however, that takes the stance one should include as many predictor variables as possible (e.g., Rubin, 1996). A simulation study by Reiter et al. (2006) concluded that maintaining random noise variables in an imputation model merely increased variability somewhat relative to a model that excluded the terms, which the authors felt was a small price to pay in exchange for increased plausibility of the MAR assumption resulting from the larger covariate set. The purpose of the next series of statements outlined in the shell syntax above is to assign the respective variable types of which the input data set is composed. Each non-continuous variable must be listed in either the CATEGORICAL (dichotomous or nominal), COUNT, or MIXED (semi-continuous) statements, unless the DEFAULT statement is used to define another variable type. For instance, if most variables in the input data set are categorical, you can specify DEFAULT CATEGORICAL as its own statement and then assign the scale of the non-categorical variables amongst the other three statements (including CONTINUOUS) to shorten the amount of syntax required. 11

12 The BOUNDS and RESTRICT statements are optional, but frequently come in handy. The general syntax is to specify a variable name followed by some condition(s) in parentheses. For example, specifying BOUNDS LOS (>0) ensures only non-negative values of LOS will be imputed. Multiple conditions can be separated by commas. For example, specifying BOUNDS LOS(>0,< AGE 18) also guards against an imputed LOS suggesting the employee began his/her tenure before the age of 18. The RESTRICT statement can be used to impute only for those cases meeting the condition(s) specified within parentheses. An example might be imputing a missing value for the variable INCOME only for respondents who are (or get imputed as being) employed. If we suppose the condition of being currently employed is flagged by the character variable EMPLOYED equaling Y, we might include the statement RESTRICT INCOME(EMPLOYED= Y ). The code given below illustrates how to impute LOS and Q1 using IVEware. This author s preference is to completely enumerate the variables being passed to the %IMPUTE module with the help of a commented KEEP statement in a preliminary DATA step, as it helps ensure they are all handled properly. Because continuous variables are the default type, we do not need to list AGE and LOS explicitly. The four categorical variables at hand are listed in the CATEGORICAL statement, and the unique respondent identifier EMPID is assigned as a TRANSFER statement variable since it would not make sense to use this as a continuous covariate in any imputation models. The MULTIPLES statement requests M = 5 completed data sets be output to the data set SAMPLE_FROMIVE. data sample_toive; set sample; keep EMPID /* ID variable to re-merge IVEware output data set */ AGE GENDER MINORITY SUPERVISOR /* fully observed covariates */ Q1 LOS /* partially observed variables requiring imputation */ ; %IMPUTE (NAME=TEST,DIR=.,SETUP=NEW); DATAIN sample_toive; DATAOUT sample_fromive ALL; TRANSFER EMPID; CATEGORICAL GENDER MINORITY SUPERVISOR Q1; BOUNDS LOS(>0); MULTIPLES 5; RUN; INFERENCES FROM MULTIPLY IMPUTED DATA While we have seen numerous syntax examples performing multiple imputation (M = 5), we have yet to see code consolidating results from the completed data sets into a single point estimate and measure of variability. The present section focuses on this task. As was expressed formulaically earlier, the average of the M point estimates serves as the overall MI estimate, but the overall MI variance is the sum of (1) the average M completed data set variances plus (2) a term reflecting the variability of the M estimates themselves. A nice property of the technique that has only helped bolster its appeal is that these formulas are the same regardless of the quantity being estimated (e.g., mean, total, quantile). We can even make use of the SAS/STAT procedure PROC MIANALYZE to carry out the computations. The only subtlety we need to concern ourselves with is whether the quantity is a descriptive statistic or a multivariate statistic (i.e., whether the measure of variability is expressed as a scalar or a matrix). In either case, however, the general process is to capture the estimate and measure(s) of variability independently from each of the M completed data sets and supply a summarized data set housing the results to PROC MIANALYZE. Let us first consider a simple example estimating the mean of LOS after multiple imputation. Without loss of generality, suppose we are using the IVEware-generated concatenated data set SAMPLE_FROMIVE created in the previous section comprised of M x n = 5 x 8,583 = 42,915 distinct observations. Since we are analyzing survey data, it is appropriate to use one of the SAS/STAT procedures prefixed by SURVEY here, PROC SURVEYMEANS in lieu of PROC MEANS (Lewis, 2010). The entire SAMPLE_FROMIVE data set is fed to PROC SURVEYMEANS with independent estimates obtained by specifying _MULT_ in the BY statement. A preliminary PROC SORT step is conducted to ensure the data set is oriented properly by this variable. The ODS OUTPUT statement stores the (M = 5) estimates and standard errors, among a few other default statistics, in a data set named STATS. Figure 6 is a screen shot of this summary data set. * sort the concatenated data by the completed data set indicator _MULT_; proc sort data=sample_fromive; by _MULT_; * compute and store the M estimated means and standard errors of LOS; ods output statistics=stats; 12

13 proc surveymeans data=sample_fromive mean stderr; by _MULT_; var LOS; Figure 6. Screen Shot of STATS Data Set. The next step is to feed the STATS data set to PROC MIANALYZE and point it to the variables maintaining the estimates and standard errors, respectively, via the MODELEFFECTS and STDERR statements. Conveniently, PROC SURVEYMEANS names them MEAN and STDERR. Note that we feed the summarized measures of uncertainty to PROC MIANALYZE in terms of standard errors (the square root of the variances) in this instance, yet this is not always the case, as we will see with the next example. PROC MIANALYZE assumes the M estimates are stacked vertically hence, it ascertains M from the number of rows in the input data set. * get the overall MI estimate and standard error; proc mianalyze data=stats; modeleffects mean; stderr stderr; The MIANALYZE Procedure Model Information Data Set WORK.STATS Number of Imputations 5 Variance Information Variance Parameter Between Within Total mean Variance Information Relative Fraction Increase Missing Relative Parameter in Variance Information Efficiency mean Parameter Estimates Parameter Estimate Std Error 95% Confidence Limits mean Output 3. Partial Output from PROC MIANALYZE for the Mean of LOS Based on the M = 5 Completed Data Sets Generated by IVEware. 13

14 The overall MI mean and standard error can be located in the Parameter Estimates portion of the PROC MIANALYZE output. Note that the value reported under the Estimate heading is the arithmetic mean of the five estimated means summarized in the STATS data set, whereas the quantity under the Std Err heading is a bit larger than the respective completed data set standard errors. This reflects the incorporation of the between-imputation variability component. While there are various additional quantities output to the listing by default, one particularly useful diagnostic of the imputation process is the fraction of missing information (FMI) (see Section 3.3 of Rubin, 1987). The FMI is defined as the between-imputation variance component divided by the total MI variance, and can be thought of as the portion of the variance attributable to multiple imputation. In fact, a small FMI could be grounds for justifying single imputation over multiple imputation. In an uninformative model, such as an intercept-only regression model without covariates or the single-cell hot-deck routine discussed above during the exposition of the concept of proper multiple imputation, we would expect the FMI to more or less equal the item nonresponse rate. To the extent that the FMI is smaller than the nonresponse rate, it is evidence the covariates employed in the imputation model serve to recapture a portion of the missing data uncertainty. In this particular analysis, the FMI is approximately 46%, which is not far from the item nonresponse rate of 47%, suggesting the four covariates used for imputing LOS have poor explanatory power. The FMI is estimate-specific, however, so a high or low value may not prevail to all estimates of interest.. The only other PROC MIANALYZE wrinkle worthy of demonstration is when multivariate quantities are estimated independently on each completed data set, since the set-up differs somewhat. Suppose instead of the mean of LOS we were interested in modeling Q1 as a function of the respondent s length of service. Since Q1 is a dichotomous outcome, logistic regression is the preferred modeling tool. The syntax below utilizes PROC SURVEYLOGISTIC again, the SURVEY companion procedure to PROC LOGISTIC, both of which are available in SAS/STAT to fit this model. The BY statement is used comparably to what was shown before. The ODS OUTPUT statement stores the model parameters and covariance matrices for each of the M = 5 completed data sets in data sets named BETAS and COV_MATRIX, respectively. A little renaming is done via data set options because in the multivariate form PROC MIANALYZE will be looking for the more familiar variable _IMPUTATION_ instead of _MULT_ output by IVEware. Screen shots of these two summary data sets appear after the example code. ods output ParameterEstimates=betas (rename=(_mult_=_imputation_)) covb=cov_matrix (rename=(_mult_=_imputation_)); proc surveylogistic data=sample_fromive; by _MULT_; model Q1(event='1') = gender LOS / covb; Figure 7. Screen Shots of Data Sets BETAS and COV_MATRIX. Under the structure in Figure 7, there is one data set housing the M distinct model parameters and a separate data set housing the respective covariance matrices. Note the subtle differences relative to the summarized data set STATS from Figure 6. Aside from the obvious fact that estimates and measures of variability are housed in separate data sets, the COV_MATRIX data set houses variances in lieu of standard errors. The code below illustrates the slightly altered PROC MIANALYZE syntax. The output generated is very similar to that generated above, so for brevity it is not reproduced here. Again, the objective was to illustrate the modified PROC MIANALYZE syntax this kind of analysis necessitates. proc mianalyze parms=betas covb=cov_matrix; modeleffects intercept LOS; 14

15 CONCLUSION This paper began by defining some of the terminology pertinent to missing data within the realm of applied survey research, and then briefly outlined the various models that can be formulated to combat the missingness via the technique of imputing, or filling in the missing values. These methods typically operate under the MAR assumption defined by Little and Rubin (2002), and necessitate at least some fully observed information be known for the entire sample. This auxiliary information was denoted X. The general notion of imputation is to model the relationship between X and the survey outcome for the observed cases and employ it to derive values for the missing cases. PROC MI has a variety of tools when the missingness pattern is monotone. For arbitrary patterns of missingness, particularly when categorical variables are involved, IVEware offers a little more flexibility. Both tools have been designed to multiply-impute the missing values, but even if single imputation is the objective, one can retain only one of the M completed data sets generated. For those who do perform multiple imputation, this paper demonstrated a few examples of PROC MIANALYZE, a procedure that can be used to combine the M estimates and measures of variability computed independently on each completed data set into a single estimate and measure of variability using the rules defined by Rubin (1987). REFERENCES Andridge, R., and Little, R. (2010). A Review of Hot Deck Imputation for Survey Non-response, International Statistical Review, 78, pp Andridge, R., and Little, R. (2011). Proxy Pattern-Mixture Analysis for Survey Nonresponse, Journal of Official Statistics, 27, pp Berglund, P. (2010). An Introduction to Multiple Imputation of Complex Sample Data using SAS v9.2, Paper presented at the SAS Global Forum, Seattle, WA, April Available online at: Bethlehem, J. (1988). Reduction of Nonresponse Bias Through Regression Estimation. Journal of Official Statistics, 4, pp Brick, J.M., and Kalton, G. (1996). Handling Missing Data in Survey Research. Statistical Methods in Medical Research, 5, pp Efron, B. (1994). Missing Data, Imputation, and the Bootstrap, Journal of the American Statistical Association, 89, pp Fay, R. (1996). When Are Inferences from Multiple Imputation Valid? Proceedings of the Joint Statistical Meetings of the American Statistical Association. Heitjan, F., and Little, R. (1991). "Multiple Imputation for the Fatal Accident Reporting System." Applied Statistics, 40, pp Hosmer, D., and Lemeshow, S. (2000). Applied Logistic Regression. 2 nd Edition. New York, NY: Wiley. Kim, J.K., and Fuller, W. (2004). Fractional Hot Deck Imputation, Biometrika, 91, pp Lewis, T. (2010). Principles of Proper Inferences from Complex Survey Data. Paper presented at the SAS Global Forum. Seattle, WA, April Available online at: Lewis, T. (2012). Weighting Adjustment Methods for Nonresponse in Surveys. Invited paper presented at the Western Users of SAS Software (WUSS) Conference. Long Beach, CA, September 5 7. Available online at: Little, R., and Rubin, D. (2002). Statistical Analysis with Missing Data. 2nd edition. New York, NY: Wiley. Raghunathan, T., Lepkowski, J., Van Hoewyk, J, and Solenberger, P. (2001). "A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models." Survey Methodology, 27, pp Rao, J.N.K., and Shao, J. (1992). Jackknife Variance Estimation with Survey Data Under Hot Deck Imputation, Biometrika, 79, pp Rosenbaum, P., and Rubin, D. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika, 70, pp Rubin, D., and Schenker, N. (1986). "Multiple Imputation for Interval Estimation from Simple Random Samples with 15

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION

More information

MISSING DATA AND MULTIPLE IMPUTATION

MISSING DATA AND MULTIPLE IMPUTATION Paper 21-2010 An Introduction to Multiple Imputation of Complex Sample Data using SAS v9.2 Patricia A. Berglund, Institute For Social Research-University of Michigan, Ann Arbor, Michigan ABSTRACT This

More information

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options

More information

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background An Introduction to Multiple Imputation and its Application Craig K. Enders University of California - Los Angeles Department of Psychology cenders@psych.ucla.edu Background Work supported by Institute

More information

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy Simulation of Imputation Effects Under Different Assumptions Danny Rithy ABSTRACT Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive

More information

Multiple-imputation analysis using Stata s mi command

Multiple-imputation analysis using Stata s mi command Multiple-imputation analysis using Stata s mi command Yulia Marchenko Senior Statistician StataCorp LP 2009 UK Stata Users Group Meeting Yulia Marchenko (StataCorp) Multiple-imputation analysis using mi

More information

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data Donsig Jang, Amang Sukasih, Xiaojing Lin Mathematica Policy Research, Inc. Thomas V. Williams TRICARE Management

More information

Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA

Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA ABSTRACT Statistical analyses can be greatly hampered by missing

More information

SAS/STAT 14.2 User s Guide. The SURVEYIMPUTE Procedure

SAS/STAT 14.2 User s Guide. The SURVEYIMPUTE Procedure SAS/STAT 14.2 User s Guide The SURVEYIMPUTE Procedure This document is an individual chapter from SAS/STAT 14.2 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional. independence assumption and auxiliary information Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional

More information

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS ABSTRACT Paper 1938-2018 Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS Robert M. Lucas, Robert M. Lucas Consulting, Fort Collins, CO, USA There is confusion

More information

The Use of Sample Weights in Hot Deck Imputation

The Use of Sample Weights in Hot Deck Imputation Journal of Official Statistics, Vol. 25, No. 1, 2009, pp. 21 36 The Use of Sample Weights in Hot Deck Imputation Rebecca R. Andridge 1 and Roderick J. Little 1 A common strategy for handling item nonresponse

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to

More information

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa Ronald H. Heck 1 In this handout, we will address a number of issues regarding missing data. It is often the case that the weakest point of a study is the quality of the data that can be brought to bear

More information

Missing Data Missing Data Methods in ML Multiple Imputation

Missing Data Missing Data Methods in ML Multiple Imputation Missing Data Missing Data Methods in ML Multiple Imputation PRE 905: Multivariate Analysis Lecture 11: April 22, 2014 PRE 905: Lecture 11 Missing Data Methods Today s Lecture The basics of missing data:

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION Introduction CHAPTER 1 INTRODUCTION Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data. Mplus offers researchers a wide choice of models, estimators,

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

Statistics, Data Analysis & Econometrics

Statistics, Data Analysis & Econometrics ST009 PROC MI as the Basis for a Macro for the Study of Patterns of Missing Data Carl E. Pierchala, National Highway Traffic Safety Administration, Washington ABSTRACT The study of missing data patterns

More information

Analysis of Complex Survey Data with SAS

Analysis of Complex Survey Data with SAS ABSTRACT Analysis of Complex Survey Data with SAS Christine R. Wells, Ph.D., UCLA, Los Angeles, CA The differences between data collected via a complex sampling design and data collected via other methods

More information

Statistical Matching using Fractional Imputation

Statistical Matching using Fractional Imputation Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application:

More information

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland 1 1 University of Michigan School of Public Health April 2011 Complete (Ideal) vs. Observed

More information

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi 1 Introduction Imputation is the commonly used

More information

The Importance of Modeling the Sampling Design in Multiple. Imputation for Missing Data

The Importance of Modeling the Sampling Design in Multiple. Imputation for Missing Data The Importance of Modeling the Sampling Design in Multiple Imputation for Missing Data Jerome P. Reiter, Trivellore E. Raghunathan, and Satkartar K. Kinney Key Words: Complex Sampling Design, Multiple

More information

Missing data analysis. University College London, 2015

Missing data analysis. University College London, 2015 Missing data analysis University College London, 2015 Contents 1. Introduction 2. Missing-data mechanisms 3. Missing-data methods that discard data 4. Simple approaches that retain all the data 5. RIBG

More information

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 9. oktober 2012 Multivariate tables Agenda today Age standardization Missing data 1 2 3 4 Age standardization

More information

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Age standardization Incidence and prevalence are strongly agedependent Risks rising

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more.

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more. SPM Users Guide Model Compression via ISLE and RuleLearner This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more. Title: Model Compression

More information

- 1 - Fig. A5.1 Missing value analysis dialog box

- 1 - Fig. A5.1 Missing value analysis dialog box WEB APPENDIX Sarstedt, M. & Mooi, E. (2019). A concise guide to market research. The process, data, and methods using SPSS (3 rd ed.). Heidelberg: Springer. Missing Value Analysis and Multiple Imputation

More information

Dynamic Thresholding for Image Analysis

Dynamic Thresholding for Image Analysis Dynamic Thresholding for Image Analysis Statistical Consulting Report for Edward Chan Clean Energy Research Center University of British Columbia by Libo Lu Department of Statistics University of British

More information

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by Paper CC-016 A macro for nearest neighbor Lung-Chang Chien, University of North Carolina at Chapel Hill, Chapel Hill, NC Mark Weaver, Family Health International, Research Triangle Park, NC ABSTRACT SAS

More information

Data analysis using Microsoft Excel

Data analysis using Microsoft Excel Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data

More information

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT This chapter provides step by step instructions on how to define and estimate each of the three types of LC models (Cluster, DFactor or Regression) and also

More information

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Missing Data. SPIDA 2012 Part 6 Mixed Models with R: The best solution to the missing data problem is not to have any. Stef van Buuren, developer of mice SPIDA 2012 Part 6 Mixed Models with R: Missing Data Georges Monette 1 May 2012 Email: georges@yorku.ca

More information

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES UNIVERSITY OF GLASGOW MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES by KHUNESWARI GOPAL PILLAY A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in

More information

Handbook of Statistical Modeling for the Social and Behavioral Sciences

Handbook of Statistical Modeling for the Social and Behavioral Sciences Handbook of Statistical Modeling for the Social and Behavioral Sciences Edited by Gerhard Arminger Bergische Universität Wuppertal Wuppertal, Germany Clifford С. Clogg Late of Pennsylvania State University

More information

Missing Data Techniques

Missing Data Techniques Missing Data Techniques Paul Philippe Pare Department of Sociology, UWO Centre for Population, Aging, and Health, UWO London Criminometrics (www.crimino.biz) 1 Introduction Missing data is a common problem

More information

Bootstrap and multiple imputation under missing data in AR(1) models

Bootstrap and multiple imputation under missing data in AR(1) models EUROPEAN ACADEMIC RESEARCH Vol. VI, Issue 7/ October 2018 ISSN 2286-4822 www.euacademic.org Impact Factor: 3.4546 (UIF) DRJI Value: 5.9 (B+) Bootstrap and multiple imputation under missing ELJONA MILO

More information

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS Examples: Missing Data Modeling And Bayesian Analysis CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS Mplus provides estimation of models with missing data using both frequentist and Bayesian

More information

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999. 2 Schafer, J. L., Graham, J. W.: (2002). Missing Data: Our View of the State of the Art. Psychological methods, 2002, Vol 7, No 2, 47 77 Rosner, B. (2005) Fundamentals of Biostatistics, 6th ed, Wiley.

More information

Missing Data Analysis with SPSS

Missing Data Analysis with SPSS Missing Data Analysis with SPSS Meng-Ting Lo (lo.194@osu.edu) Department of Educational Studies Quantitative Research, Evaluation and Measurement Program (QREM) Research Methodology Center (RMC) Outline

More information

Weighting and estimation for the EU-SILC rotational design

Weighting and estimation for the EU-SILC rotational design Weighting and estimation for the EUSILC rotational design JeanMarc Museux 1 (Provisional version) 1. THE EUSILC INSTRUMENT 1.1. Introduction In order to meet both the crosssectional and longitudinal requirements,

More information

NORM software review: handling missing values with multiple imputation methods 1

NORM software review: handling missing values with multiple imputation methods 1 METHODOLOGY UPDATE I Gusti Ngurah Darmawan NORM software review: handling missing values with multiple imputation methods 1 Evaluation studies often lack sophistication in their statistical analyses, particularly

More information

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU. SOS3003 Applied data analysis for social science Lecture note 04-2009 Erling Berge Department of sociology and political science NTNU Erling Berge 2009 1 Missing data Literature Allison, Paul D 2002 Missing

More information

Missing Data in Orthopaedic Research

Missing Data in Orthopaedic Research in Orthopaedic Research Keith D Baldwin, MD, MSPT, MPH, Pamela Ohman-Strickland, PhD Abstract Missing data can be a frustrating problem in orthopaedic research. Many statistical programs employ a list-wise

More information

Multiple Imputation with Mplus

Multiple Imputation with Mplus Multiple Imputation with Mplus Tihomir Asparouhov and Bengt Muthén Version 2 September 29, 2010 1 1 Introduction Conducting multiple imputation (MI) can sometimes be quite intricate. In this note we provide

More information

Multiple imputation using chained equations: Issues and guidance for practice

Multiple imputation using chained equations: Issues and guidance for practice Multiple imputation using chained equations: Issues and guidance for practice Ian R. White, Patrick Royston and Angela M. Wood http://onlinelibrary.wiley.com/doi/10.1002/sim.4067/full By Gabrielle Simoneau

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

Handling Data with Three Types of Missing Values:

Handling Data with Three Types of Missing Values: Handling Data with Three Types of Missing Values: A Simulation Study Jennifer Boyko Advisor: Ofer Harel Department of Statistics University of Connecticut Storrs, CT May 21, 2013 Jennifer Boyko Handling

More information

Show how the LG-Syntax can be generated from a GUI model. Modify the LG-Equations to specify a different LC regression model

Show how the LG-Syntax can be generated from a GUI model. Modify the LG-Equations to specify a different LC regression model Tutorial #S1: Getting Started with LG-Syntax DemoData = 'conjoint.sav' This tutorial introduces the use of the LG-Syntax module, an add-on to the Advanced version of Latent GOLD. In this tutorial we utilize

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Enterprise Miner Tutorial Notes 2 1

Enterprise Miner Tutorial Notes 2 1 Enterprise Miner Tutorial Notes 2 1 ECT7110 E-Commerce Data Mining Techniques Tutorial 2 How to Join Table in Enterprise Miner e.g. we need to join the following two tables: Join1 Join 2 ID Name Gender

More information

Estimation of Item Response Models

Estimation of Item Response Models Estimation of Item Response Models Lecture #5 ICPSR Item Response Theory Workshop Lecture #5: 1of 39 The Big Picture of Estimation ESTIMATOR = Maximum Likelihood; Mplus Any questions? answers Lecture #5:

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Creating a data file and entering data

Creating a data file and entering data 4 Creating a data file and entering data There are a number of stages in the process of setting up a data file and analysing the data. The flow chart shown on the next page outlines the main steps that

More information

STAT10010 Introductory Statistics Lab 2

STAT10010 Introductory Statistics Lab 2 STAT10010 Introductory Statistics Lab 2 1. Aims of Lab 2 By the end of this lab you will be able to: i. Recognize the type of recorded data. ii. iii. iv. Construct summaries of recorded variables. Calculate

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

Chapter 6: Examples 6.A Introduction

Chapter 6: Examples 6.A Introduction Chapter 6: Examples 6.A Introduction In Chapter 4, several approaches to the dual model regression problem were described and Chapter 5 provided expressions enabling one to compute the MSE of the mean

More information

Generalized least squares (GLS) estimates of the level-2 coefficients,

Generalized least squares (GLS) estimates of the level-2 coefficients, Contents 1 Conceptual and Statistical Background for Two-Level Models...7 1.1 The general two-level model... 7 1.1.1 Level-1 model... 8 1.1.2 Level-2 model... 8 1.2 Parameter estimation... 9 1.3 Empirical

More information

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression Lecture Simple Regression, An Overview, and Simple Linear Regression Learning Objectives In this set of lectures we will develop a framework for simple linear, logistic, and Cox Proportional Hazards Regression

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects

Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects Patralekha Bhattacharya Thinkalytics The PDLREG procedure in SAS is used to fit a finite distributed lagged model to time series data

More information

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models Bengt Muth en University of California, Los Angeles Tihomir Asparouhov Muth en & Muth en Mplus

More information

Handling missing data for indicators, Susanne Rässler 1

Handling missing data for indicators, Susanne Rässler 1 Handling Missing Data for Indicators Susanne Rässler Institute for Employment Research & Federal Employment Agency Nürnberg, Germany First Workshop on Indicators in the Knowledge Economy, Tübingen, 3-4

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,

More information

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

HILDA PROJECT TECHNICAL PAPER SERIES No. 2/08, February 2008

HILDA PROJECT TECHNICAL PAPER SERIES No. 2/08, February 2008 HILDA PROJECT TECHNICAL PAPER SERIES No. 2/08, February 2008 HILDA Standard Errors: A Users Guide Clinton Hayes The HILDA Project was initiated, and is funded, by the Australian Government Department of

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

You ve already read basics of simulation now I will be taking up method of simulation, that is Random Number Generation

You ve already read basics of simulation now I will be taking up method of simulation, that is Random Number Generation Unit 5 SIMULATION THEORY Lesson 39 Learning objective: To learn random number generation. Methods of simulation. Monte Carlo method of simulation You ve already read basics of simulation now I will be

More information

A Fast Multivariate Nearest Neighbour Imputation Algorithm

A Fast Multivariate Nearest Neighbour Imputation Algorithm A Fast Multivariate Nearest Neighbour Imputation Algorithm Norman Solomon, Giles Oatley and Ken McGarry Abstract Imputation of missing data is important in many areas, such as reducing non-response bias

More information

Missing Data. Where did it go?

Missing Data. Where did it go? Missing Data Where did it go? 1 Learning Objectives High-level discussion of some techniques Identify type of missingness Single vs Multiple Imputation My favourite technique 2 Problem Uh data are missing

More information

Notes on Simulations in SAS Studio

Notes on Simulations in SAS Studio Notes on Simulations in SAS Studio If you are not careful about simulations in SAS Studio, you can run into problems. In particular, SAS Studio has a limited amount of memory that you can use to write

More information

Lecture 26: Missing data

Lecture 26: Missing data Lecture 26: Missing data Reading: ESL 9.6 STATS 202: Data mining and analysis December 1, 2017 1 / 10 Missing data is everywhere Survey data: nonresponse. 2 / 10 Missing data is everywhere Survey data:

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

UNIT 4. Research Methods in Business

UNIT 4. Research Methods in Business UNIT 4 Preparing Data for Analysis:- After data are obtained through questionnaires, interviews, observation or through secondary sources, they need to be edited. The blank responses, if any have to be

More information

A Bayesian analysis of survey design parameters for nonresponse, costs and survey outcome variable models

A Bayesian analysis of survey design parameters for nonresponse, costs and survey outcome variable models A Bayesian analysis of survey design parameters for nonresponse, costs and survey outcome variable models Eva de Jong, Nino Mushkudiani and Barry Schouten ASD workshop, November 6-8, 2017 Outline Bayesian

More information

BACKGROUND INFORMATION ON COMPLEX SAMPLE SURVEYS

BACKGROUND INFORMATION ON COMPLEX SAMPLE SURVEYS Analysis of Complex Sample Survey Data Using the SURVEY PROCEDURES and Macro Coding Patricia A. Berglund, Institute For Social Research-University of Michigan, Ann Arbor, Michigan ABSTRACT The paper presents

More information

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research Liping Huang, Center for Home Care Policy and Research, Visiting Nurse Service of New York, NY, NY ABSTRACT The

More information

SAS Graphics Macros for Latent Class Analysis Users Guide

SAS Graphics Macros for Latent Class Analysis Users Guide SAS Graphics Macros for Latent Class Analysis Users Guide Version 2.0.1 John Dziak The Methodology Center Stephanie Lanza The Methodology Center Copyright 2015, Penn State. All rights reserved. Please

More information

Contents of SAS Programming Techniques

Contents of SAS Programming Techniques Contents of SAS Programming Techniques Chapter 1 About SAS 1.1 Introduction 1.1.1 SAS modules 1.1.2 SAS module classification 1.1.3 SAS features 1.1.4 Three levels of SAS techniques 1.1.5 Chapter goal

More information

Data corruption, correction and imputation methods.

Data corruption, correction and imputation methods. Data corruption, correction and imputation methods. Yerevan 8.2 12.2 2016 Enrico Tucci Istat Outline Data collection methods Duplicated records Data corruption Data correction and imputation Data validation

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients

More information

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis Applied Mathematical Sciences, Vol. 5, 2011, no. 57, 2807-2818 Simulation Study: Introduction of Imputation Methods for Missing Data in Longitudinal Analysis Michikazu Nakai Innovation Center for Medical

More information

Opening Windows into the Black Box

Opening Windows into the Black Box Opening Windows into the Black Box Yu-Sung Su, Andrew Gelman, Jennifer Hill and Masanao Yajima Columbia University, Columbia University, New York University and University of California at Los Angels July

More information

Performance of Sequential Imputation Method in Multilevel Applications

Performance of Sequential Imputation Method in Multilevel Applications Section on Survey Research Methods JSM 9 Performance of Sequential Imputation Method in Multilevel Applications Enxu Zhao, Recai M. Yucel New York State Department of Health, 8 N. Pearl St., Albany, NY

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help Get help interpreting a table

Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help Get help interpreting a table Q Cheat Sheets What to do when you cannot figure out how to use Q What to do when the data looks wrong Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help

More information

Telephone Survey Response: Effects of Cell Phones in Landline Households

Telephone Survey Response: Effects of Cell Phones in Landline Households Telephone Survey Response: Effects of Cell Phones in Landline Households Dennis Lambries* ¹, Michael Link², Robert Oldendick 1 ¹University of South Carolina, ²Centers for Disease Control and Prevention

More information

Handling Your Data in SPSS. Columns, and Labels, and Values... Oh My! The Structure of SPSS. You should think about SPSS as having three major parts.

Handling Your Data in SPSS. Columns, and Labels, and Values... Oh My! The Structure of SPSS. You should think about SPSS as having three major parts. Handling Your Data in SPSS Columns, and Labels, and Values... Oh My! You might think that simple intuition will guide you to a useful organization of your data. If you follow that path, you might find

More information

A User Manual for the Multivariate MLE Tool. Before running the main multivariate program saved in the SAS file Part2-Main.sas,

A User Manual for the Multivariate MLE Tool. Before running the main multivariate program saved in the SAS file Part2-Main.sas, A User Manual for the Multivariate MLE Tool Before running the main multivariate program saved in the SAS file Part-Main.sas, the user must first compile the macros defined in the SAS file Part-Macros.sas

More information

SAS/STAT 14.3 User s Guide The SURVEYFREQ Procedure

SAS/STAT 14.3 User s Guide The SURVEYFREQ Procedure SAS/STAT 14.3 User s Guide The SURVEYFREQ Procedure This document is an individual chapter from SAS/STAT 14.3 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute

More information

COPULA MODELS FOR BIG DATA USING DATA SHUFFLING

COPULA MODELS FOR BIG DATA USING DATA SHUFFLING COPULA MODELS FOR BIG DATA USING DATA SHUFFLING Krish Muralidhar, Rathindra Sarathy Department of Marketing & Supply Chain Management, Price College of Business, University of Oklahoma, Norman OK 73019

More information

SAS Enterprise Miner : Tutorials and Examples

SAS Enterprise Miner : Tutorials and Examples SAS Enterprise Miner : Tutorials and Examples SAS Documentation February 13, 2018 The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2017. SAS Enterprise Miner : Tutorials

More information