Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data

Size: px

Start display at page:

Download "Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data"

Alice Horn
5 years ago
Views:

1 Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi 1 Introduction Imputation is the commonly used approach to compensate for missing or invalid values in sample surveys (Kalton and Kasprzyi, 1986). Imputation consists in replacing predicted values for the not available/not acceptable values. In Official Statistics, imputation is preferred to other treatments (lie analysis of complete data, analysis of available data, weighting adjustment) for several reasons: in sample surveys it is usually desirable to produce complete and consistent sets of microdata, thus allowing the application of standard complete-data methods for subsequent data analyses; furthermore, the results obtained from different analyses are consistent with one another, unlie the results of analyses from an incomplete data set; finally, imputation allows the use of the same survey weight for all items, unlie the weighting adjustment method. In this paper we focus on hot dec imputation (Little and Rubin, 2002). In hot dec methods the missing or invalid data for a (partially) non responding unit are replaced by the observed values taen from a similar" sampling unit (donor), properly chosen. The accuracy of imputations can be improved by first forming imputation classes, and then performing imputations separately within each of them. In random hot dec the donor is randomly chosen among the responding units in imputation cells. In nearest-neighbour donor imputation the donor corresponds to the most similar respondent in the imputation class with respect to some covariates. The popularity of hot dec methods is due to its attractive features. First of all, hot dec imputation can handle variables that are difficult to treat by explicit modelling, hence this method is expected to be more robust against departures from model assumptions than methods based on parametric models, lie ratio and regression imputation. Furthermore, since observed values are used for imputation, no synthetic values are imputed (as opposed to other methods lie mean, ratio or regression which may produce nonsensical values). The main problem with imputation is that it introduces an extra component of variability that must be considered in estimation. Analyses performed on imputed values treated as if they were observed can be misleading when estimates of the variance do not include the variability component due to imputation. As a result, the precision of estimates is overstated, and subsequent statistical analyses can be misleading (e.g., confidence intervals have lower than nominal levels). The approaches proposed in literature to obtain valid variance estimators for parameter estimates in presence of imputed data include model-assisted techniques (Särndal, 1992), multiple imputation (Rubin, 1987), resampling methods (Rao and Shao, 1992). Among non parametric resampling techniques, the jacnife method is a very popular one. A major advantage of jacnife is the fact that the variance of complicated estimators (e.g. nonlinear statistics) can be calculated in a relatively easy way without the

2 theoretical derivation of variance formulas as in other approaches. It is straightforward to note that the naive jacnife that uses the standard jacnife formulas and treats imputed values as if they were observed, underestimates the true variance. For imputed data, different jacnife estimators have been proposed for variance estimation under different imputation models, for different target parameters and sampling designs. Among others, Chen and Shao (2001) proposed jacnife variance estimators asymptotically unbiased and consistent for the sample mean for stratified multistage surveys under nearest-neighbour hot dec imputation; Saigo and Sitter (2005) proposed consistent reimputation jacnife variance estimators for sample totals and means for stratified multistage samples under ratio and regression imputation. In general, jacnife is a computationally intensive technique, and its application may be prohibitive especially in case of large scale surveys. In order to overcome these limitations, Kott (2001) proposed, in the case of complete data, a revised version of the jacnife procedure that is computationally feasible. The first version of this method, suitable when there is a large number of sampled units per stratum, is the delete-a-group jacnife (DAGJK). It was extended by Kott to deal with situations where the number of sampled units per stratum is less than the number of jacnife replicates (EDGJK). In this paper we propose an adjusted version of DAGJK (AD-DAGJK) and EDAGJK (AD- EDAGJK) to deal with the variance estimation in presence of imputed data. In particular, we focus on random hot dec imputation. The adjustment is analogous to that proposed by Rao and Shao (1992) for jacnife variance estimation under random hot dec imputation. In the paper, the variance estimators for AD-DAGJK and AD-EDAGJK are evaluated in comparison to the adjusted jacnife by Rao and Shao (AJ) and the standard variance estimation formula that treats imputed data as they were observed, thus ignoring the extracomponent of the variability due to imputation. The comparison is based on a Monte Carlo experiment consisting in drawing 500 samples from the 1991 Population Census according to the Italian Labour Force Survey sampling design. For all the items of interest, a controlled percentage of item nonresponses is simulated using a Missing at Random (MAR) nonresponse mechanism. Missing values are imputed using a stratified random hot dec. The evaluation is based on the analysis of the relative bias, the relative root mean squared error, the variance, the confidence interval coverage and the confidence intervals length (95%). The paper is structured as follows. In Section 2 and 3 the resampling variance estimation methods are described. Section 4 contains the description of the simulation study, while results are illustrated in Section 5. Conclusions and future wor are reported in Section 6. 2 The adjusted jacnife under random hot dec imputation Let A be the full sample and let us denote with A R and A M the subset of respondents and nonrespondents respectively. Let us assume that a sample A of size n is obtained through a stratified sample design with strata h = 1,..., L with n h observations in the hth stratum. The sampling weights are generally denoted by w for = 1,..., n. We suppose to treat the missing items via random hot dec (HD) imputation within classes, i.e., for each missing item we randomly select an observed value within the same class (imputation cell). The imputation cells should be defined so that the missing mechanism can be considered completely at random (MCAR) within each imputation cell. In the following, the hot dec

3 cells are indexed by g = 1,..., G, n Rg is the number of respondents (A Rg ) in the gth cell, and the imputed values are denoted by y for A M. Rao and Shao (1992) proposed an adjusted jacnife (AJ) variance estimator under HD imputation that is consistent assuming a uniform response mechanism within each imputation cell. For a stratified multistage sampling design with ignorable finite population correction factor, the AJ is where V ar(ŷi) = Ŷ I = L n h h=1 j=1 ( L h=1 [(n h 1)/n h ] (Ŷ (j) Ih ŶI) 2 (1) A Rh w y + A Mh w y is the estimate of the total computed on observed and imputed data. The term Ŷ (j) Ih in equation (1) is the estimate of Ŷ when the jth unit (in a two stage sampling design, we refer to the PSU j h) is omitted G Ŷ (j) Ih = y + ( ) y + ŷ (j) Rg ȳ Rg (3) g=1 A Rg w (j) A Mg w (j) where w (j) is the sampling weight for unit adjusted to account for the omission of unit j, ŷ (j) Rg = A Rg w (j) y / A Rg w (j) and ȳ Rg = A Rg w y / A Rg w. ) (2) 3 Delete-A-Group Jacnife and the Extended method The use of jacnife in survey statistics has many advantages such as: no need for model assumptions, straightforward variance computation of nonlinear statistics and easy calculation of domain estimates. Furthermore, unit nonresponses are easily dealt with and, by means of AJ, item nonresponses can be dealt with as well. Nevertheless, the application of jacnife, that is computer intensive, is generally not feasible for large scale surveys. In this section we describe the delete-a-group jacnife (DAGJK) and the extended DAGJK (EDAGJK) methods. These techniques are computationally less intensive than classical jacnife and can be applied also in case of large scale surveys. 3.1 Variance estimation in case of complete data DAGJK and EDAGJK (Kott, 2001) belong to the group of strategies aiming at reducing the number of jacnife replications, while maintaining adequate precision of variance estimates. These methods are based on the following jacnife procedure: 1. Primary Sample Units (PSUs) are randomly ordered in each stratum; 2. the PSUs are systematically allocated into R groups; 3. for each unit, R different sampling weights (replicate sampling weights) are computed as follows:

4 for DAGJK w, when h and no PSU h belongs to the group r = [ 0, when ] PSU in group r n h /(n h n (r) h ) w, otherwise. for EDAGJK = w when h and no PSU h belongs to the group r w [1 (n h 1) Z], when PSU in group r w (1 + Z) otherwise (4) (5) where Z 2 = R/ [(R 1) n h (n h 1)] Given the replicate weights expressed by formula (4) for DAGJK or by formula (5) for EDAGJK, the variance estimator is V ar(ŷ ) = R R 1 R (Ŷ (r) Ŷ )2 (6) where Ŷ (r) = s w(r) y. One important characteristic of DAGJK and EDAGJK is that the variance estimates improve when the number of random groups increases. A ey aspect that distinguishes the two methods is their behaviour with respect to the number of sampled PSUs in the strata. In particular, DAGJK provides upward biased estimates when this number is small, while EDAGJK is expected to fix this problem as shown in Kott (2001). r=1 3.2 DAGJK and EDAGJK variance estimation with Rao and Shao adjustment for imputation In order to tae into account item nonresponse in variance estimation, we propose an adjusted version of DAGJK (AD-DAGJK) and of EDAGJK (AD-EDAGJK) based on the Rao and Shao adjustment for HD imputation. The proposed variance estimator, which combines the estimator in formula (6) with the jacnife adjustments described in formulas (2) and (3), is V ar(ŷi) = R R 1 R r=1 (Ŷ (r) I ŶI) 2 (7) where, analogously to formula (3), Ŷ (r) I Ŷ (r) I = G g=1 i i A Rg is defined as y i + i i A Mg ( ) yi + ŷ (r) Rg ȳ Rg (8) where i i A Rg i are the replicate weights computed according to formula (4) or (5), ŷ (r) Rg = y i / i A Rg i, and ŶI is already defined in formula (2). In the case of incomplete data and HD imputation, the theoretical properties of AD- DAGJK and AD-EDAGJK must be investigated. In next section, we empirically study the properties of these methods.

5 4 Description of a Monte Carlo simulation study on real survey data The AD-DAGJK and AD-EDAGJK are evaluted by means of a Monte Carlo simulation study based on real survey data. A comparison with the standard Horvitz-Thompson variance estimation (without taing into account the imputation process) and the AJ estimation under HD imputation is shown. In the experiment, 500 samples are selected from the 1991 Italian Population Census data. The target population of the experiment is the geographical region Lazio, excluding the province of Rome (1,372,572 units). The samples are drawn according to the Italian Labour Force sampling design, that is summarized in the following steps: the municipalities of each province are ordered by population size and strata of municipalities with population size equal to a given threshold are formed. Strata with only one municipality are referred to as self-representing (S-R). There are 7 S-R strata; in each S-R stratum a sample of households (PSUs) of size proportional to the population is selected (stratified cluster design); in the non S-R stratum (NS-R) a pps sample of municipalities (PSUs) of size 2 is drawn, and a sample of households is selected (two stage stratified design). There are 18 NS-R strata; the number of sampled PSUs (municipalities and households) is 552. We are interested in estimating the total number of employed and unemployed persons. Missing items are introduced in the variable employment (employed/not employed) according to a missinbg at random mechanism (MAR). The nonresponse probabilities depend on two observed variables: X 1 (levels: 1,2,3,4) referring to the household s type; X 2, an indicator variable that depends on whether the unit belongs to either a S-R or a NS-R stratum. The nonresponse probabilities are reported in Table 1. Missing items are imputed by means of HD within imputation cells corresponding to the classes used for generating nonresponse, see Table 1. Table 1: Missing rates for the simulated nonresponse mechanism X 1 = 1 X 1 = 2 X 1 = 3 X 1 = 4 NS-R 10% 20% 30% 40% S-R 40% 30% 20% 10% It is worthwhile to note that jacnife methods are nearly unbiased when the number of sampled PSUs in each stratum is large. Kott (1998, 2001) suggests that each stratum should have at least 5 PSUs. Therefore, the AD-DAGJK and AJ methods do not seem to be appropriate with this sampling desing, while the AD-EDAGJK is expected to overcome this problem. An additional problem is that the presence of a non negligible PSU sampling fraction can determine an upward bias of the variance estimates. We notice that we are in this critical situation since, as shown in Table 2, 5 NS-R strata have a sampling fraction greater than 40%.

6 Table 2: Frequency of NS-R strata by PSU sampling fraction < 20% 20% 40% 40% 60% > 60% T otal F requency Results The first step of the analysis tests the effect of the number of random groups on the variance estimates obtained via EDAGJK and AD-EDAGJK. Two main quality indicators are considered: the Relative Bias (RB) and the Relative Root Mean Squared Error (RRMSE). Furthermore, the two indicators are computed under two cases: in absence of nonresponse (case A), and with nonresponse adjusted by HD imputation (case B). In both cases, we need a true reference variance for the evaluation. In case A, this is computed by the following steps: draw 5000 samples according to the sample design previously described; for each sample compute the Horvitz-Thompson estimate ; compute the variance of the 5000 estimates. In case B: draw 5000 samples according to the sample design previously described; in each sample, mechanism; simulate nonresponse according to the previously described impute each incomplete sample through HD; compute the Horvitz-Thompson estimate for each imputed sample; compute the variance of the 5000 estimates. In both cases, we denote the reference variance by V ar(ỹ ). The RB indicator is expressed by [ RB = V ar( Ỹ )] V ar(ỹ ) (c) 500 V ar(ỹ ) (9) where [ V ar( Ỹ )] (c) c=1 is the EDAGJK or AD-EDAGJK variance estimator, in case A and B respectively, obtained in the cth Monte Carlo iteration. We remar that, in case A, Ỹ indicates Ŷ, while in case B Ỹ is equal to ŶI. The RRMSE is given by { [ } 2 RRMSE = V ar( Ỹ )] V ar(ỹ ) (c) ] (10) [V ar(ỹ ) c=1

7 Table 3 shows that in both cases A and B, the number of random groups has a slight influence on the RB values. The minimum RB is attained with 5 random groups. In general, the imputation process increases the bias of the variance estimates. The RRMSE seems to be more dependent than RB with respect to the random group number. RRMSE decreases when the number of random groups increases. 30 random groups seem a good compromise between having good quality variance estimates and bounding the computational efforts. In the following, the comparative analysis is cariied out using 30 random groups. Table 3: RB and RRMSE of EDAGJK and AD-EDAGJK variance estimators R Without missing data With imputed data RB RRMSE RB RRMSE Figure 1 depicts the boxplot of the variance estimates obtained through EDAGJK, DAGJK, the standard Horvitz-Thompson variance estimator (STANDARD), and the jacnife method in case A. The dashed line represents the true reference variance. Let us note that EDAGJK and STANDARD are nearly unbiased estimators. Neverthless STANDARD method shows a slightly more efficient variance estimation than EDAGJK. The poor performance of the DAGJK and the jacnife may be caused by the sampling design. This design has many strata with only two PSUs (NS-R strata) and, as described in Kott (2001), when the stratum sample size is smaller than the number of random groups the DAGJK has an upward bias. This is also true for the jacnife method that may be viewed as a particular case of DAGJK where the number of random groups equals the overall number of PSUs. On the other hand, the greater number of random groups in jacnife produces a more efficient estimate than DAGJK method. Figure 2 shows the boxplots of the variance estimates obtained through AD-EDAGJK, AD-DAGJK, STANDARD, and AJ in case B. We may still observe that the proposed AD- EDAGJK method is nearly unbiased, while STANDARD method ignoring the adjustment for imputation produces a downward biased estimates. DAGJK and jacnife show an upward bias. Hence, results concerning the case B point out that AD-EDAGJK outperforms the other methods under study. The last step of the analysis compares the methods in terms of coverage rate of 95% confidence interval (Table 4). In case A, Table 4 shows that the coverage rates for the confidence intervals are generally close to the 95% level. The less biased estimator, EDAGJK and STANDARD, have a coverage rates around 91%, while the coverage rates greater than 97% for DAGJK and jacnife are due to larger lenght of the confidence interval. In case B, STANDARD method attains the 88% coverage rate, very far from the nominal one. The proposed AD-EDAGJK produces a coverage rate (92.5%) closest to the nominal level with a quite small relative lenght of the confidence interval.

8 Figure 1: Boxplots of the variance estimation methods without missing values Figure 2: Boxplots of the variance estimation methods with imputed data

9 Table 4: 95% Confidence interval (CI) coverage and CI relative lenght (RL) by method Without missing data With imputed data METHODS 95% CI CI RL 95% CI CI RL COVERAGE COVERAGE EDAGJK/AD-EDAGJK DAGJK/AD-DAGJK STANDARD JACKKNIFE/AJ Conclusion Imputation is commonly used to compensate for missing or invalid values in sample surveys, and variance estimation, that does not tae into account imputed data, can produce misleading measures of quality of the estimates. This problem is becoming a pressing target in Official Statistics. in case of large scale surveys, in the implementation of a suitable method of variance estimation, it is also crucial to tae into account the computational feasibility. The proposed approach based on EDAGJK with Rao and Shao adjustment to balance quality aspects with computational issues shows good properties in the Labour Force Survey Monte Carlo experiment in terms of precision of the variance estimates being, at the same time, computational feasible. Furthermore, the empirical results show that the approach seems to be suitable for the complex designs generally used in Official Statistics. Future analyses may be developed in the following directions: investigation of theoretical properties; study of the effect of the introduction of a finite population correction factor; application of the method with calibration estimators. References Chen, J., Shao, J., (2000). Nearest Neighbor Imputation for Survey Data. Journal of Official Statistics, 16, pp Chen, J., Shao, J., (2001). Jacnife Variance Estimation for Nearest-Neighbour Imputation. Journal of the American Statistical Association, 96, No Kalton, G., Kasprzy, D., (1986). The Treatment of Missing Survey Data. Survey Methodology, 12, pp Kott, P. S., (2001). The delete-a-group jacnife. Journal of Official Statistics, 17, pp Lee H., Rancourt E., Särndal C.-E., (2002). Variance Estimation from Survey Data under Single Value Imputation, in Survey Nonresponse, in Groves, R. et al (eds.), J. Wiley and Sons, New Yor, pp Little, J., Rubin, D., (2002). Statistical Analysis with Missing data, New Yor, Wiley. Rancourt, E., Särndal, C.-E., Lee, H., (1994). Estimation of the variance in presence of Nearest Neughbour Imputation, Proceedings of the Section on Survey Research methods,

10 American Statistical Association, pp Rao, J.N.K., Shao, J., (1992). Jacnife variance estimation with survey data under hot dec imputation, Biometria, 79, 4, pp Rubin, D.B., (1987). Multiple imputation for nonresponse in surveys. Wiley, New Yor. Saigo, H., Sitter R.R., (2005). Jacnife variance estimator with reimputation for randomly imputed survey data. Statistics and probability Letters, 73, pp Särndal, C.-E., Swensson, B. and Wretman, J. (1992). Model assisted survey sampling. Springer-Verlag, New Yor.

The Use of Sample Weights in Hot Deck Imputation

Journal of Official Statistics, Vol. 25, No. 1, 2009, pp. 21 36 The Use of Sample Weights in Hot Deck Imputation Rebecca R. Andridge 1 and Roderick J. Little 1 A common strategy for handling item nonresponse