Statistical Matching of Two Surveys with a Common Subset

Size: px

Start display at page:

Download "Statistical Matching of Two Surveys with a Common Subset"

Johnathan Robertson
6 years ago
Views:

1 Marco Ballin, Marcello D Orazio, Marco Di Zio, Mauro Scanu, Nicola Torelli Statistical Matching of Two Surveys with a Common Subset Working Paper n

2 Statistical Matching of Two Surveys with a Common Subset Marco Ballin, Marcello D Orazio, Marco Di Zio, Mauro Scanu, Nicola Torelli 1 ISTAT, Italian National Statistical Institute, Rome, Italy ballin@istat.it, madorazi@istat.it, dizio@istat.it, scanu@istat.it Università di Trieste nicolat@econ.univ.trieste.it Abstract Statistical matching techniques are aimed at combining information available in two distinct datasets. Data are often collected in two independent surveys and it is assumed that records in the two datasets refer to different units. When a non negligible number of units are included in both the surveys one obtains valuable information that can be used in combining data from the two sources. This is, for instance the case when the two datasets contain data collected in enterprise surveys. In this paper statistical matching approaches when complete data are available on a common subset of units will be presented and discussed. The case of survey data collected in two agricultural surveys whose designs are negatively coordinated is considered. Key words: File concatenation, uncertainty, missing data, negative coordination, agricultural enterprises surveys. 1 Introduction Statistical matching techniques are aimed at combining information available in two distinct datasets. The two datasets, A and B, often contain data collected in two independent sample surveys of size n A and n B respectively and such that (i) the two samples contain distinct units; (ii) the two samples contain information on some variables X (common variables), while other variables are observed in only one of the two samples, say, Y in A and Z in B. Common variables X can be, for instance, used to create synthetic records containing information on joint distribution (X, Y, Z) but properties of the synthetic archive obtained need a careful examination and one has often to rely upon restrictive assumptions such as the conditional independence assumption (CIA) between Y and Z given X. When large scale national surveys are concerned, it is reasonable to assume that the probability that the same units are included in both the samples is zero. Nonetheless, when a subset of units is included in both the surveys and data on the joint distribution (X, Y, Z) are collected, this information should be taken into account in the statistical matching procedure. Namely, data on the common subset could provide valuable auxiliary information to alleviate the CIA. In principle, if it were known in advance that statistical matching will be used to combine data from the two surveys, it could be convenient to design a small supplementary survey to collect data on (X, Y, Z) for some units of the common population, as suggested by Renssen (1998), see also Singh et al. (1993). More generally, when data are collected according to complex survey designs it is likely that a small set of units is included in both the samples. This is, for instance, true for enterprise surveys: the sample designs assign in both the surveys high probability (usually close or equal to 1) of inclusion to large enterprises. It is worth noting that, in case of enterprise surveys, the common subset could not be representative of the target population since it might 1 Marco Ballin is researcher, Istat, via Ravà 150, Roma, Italy ( ballin@istat.it), Marcello D Orazio is researcher, Istat, via Cesare Balbo 16, Roma, Italy ( madorazi@istat.it), Marco Di Zio is researcher, Istat, via Cesare Balbo 16, Roma, Italy ( dizio@istat.it), Mauro Scanu is researcher, Istat, via Cesare Balbo 16, Roma, Italy ( scanu@istat.it), Nicola Torelli is professor, Università di Trieste, piazzale Europa 1, Trieste ( nicolat@econ.univ.trieste.it)

3 include only large enterprises. In both cases it is not clear how this extra information can be appropriately exploited in a statistical matching procedure. In this paper strategies for dealing with statistical matching of data collected in two complex sample surveys with a common subset are presented. Since data from two complex surveys are combined, the proposed strategies should also take into account the appropriate use of survey weights. As already recalled, the use of data from a common subset was considered by Renssen (1998). He takes explicitly into account the availability of data collected in a small supplementary survey specifically designed to collect data on (X, Y, Z) and it is based on calibration of survey weights in order to exploit this information. This procedure, which relies upon a sequence of calibration steps, cannot be easily generalized to cover more complex situations emerging when the common subset is not obtained from a specifically designed supplementary survey. A second approach, useful in statistical matching of data from two complex surveys with a common subset, is file concatenation as proposed by Rubin (1986). This procedure has been rarely applied since the evaluation of the sampling weights of the concatenated file can be very difficult (especially when the two sample designs explicitly admit that the probability that some units enter both the samples is not negligible). Strategies to overcome this problem will be considered in the paper and application of file concatenation for matching data from two surveys with a common subset is proposed. A third approach is analysis of uncertainty where properties of the unobserved (Y, Z) distribution (in terms of interval of plausible values) are inferred by marginal and conditional distributions actually estimable from surveys A and B (see D Orazio et al. (2006a), ch. 4). The paper is motivated by application of statistical matching techniques to data collected in two important Italian surveys on agricultural enterprises: the Farm Structure Survey (hereafter FSS) and the Farm Accountancy Data Network Survey (FADN). In this case, an extreme situation emerges since the common subset comprises almost exclusively large enterprises. This happens because the two survey designs, in order to reduce respondent burden, are negatively coordinated. The organization of the paper is as follows. Section 2 describes the statistical matching strategies recalled above and proposes a feasible procedure to take properly into account data collected on the common subset. Section 3 introduces the motivating problem and data: the two surveys, FSS and FADN, as well as their designs, are presented. Some results from application of file concatenation and evaluation of uncertainty, taking into account the common subset, are in Section 4. Section 5 contains some final comments. 2 The Methods The statistical matching problem in its basic form can be considered as an inferential problem with partial information. It is usually assumed that the two samples to be matched, denoted as A and B, do not overlap on the observed units and in this case the observed common set of variables X is the only available information for drawing inferences on the relationship between two other sets of variables, Y and Z, observed in A and B respectively, and never jointly observed (Figure 1 (a)). This framework allows a pointwise estimation of parameters on X, Y X and Z X, while anything related to the distribution of (Y, Z) or of (Y, Z X) can be estimated pointwise only under some simplifying assumptions, which are untestable for the data at hand. Usually, the conditional independence assumption (CIA) is assumed, i.e. Y and Z are considered as independent variables given the common set of observed variables X. If there are not enough clues in order to assume such an assumption, different procedures have been proposed in the statistical matching literature. One precious source of information is given by an additional file C which includes observations on all the variables of interest X, Y and Z. Sometimes, this file is given by a separate sample which is representative of the population under study, but it can also be obtained as the intersection of the two sample surveys A and B, as in Figure 1 (b). The statistical matching problem becomes more difficult when the samples at hand have been

To date, there are only a few procedures (file concatenation, incomplete two-way stratification and synthetic two-way stratification) that have been defined, or can be properly adapted, in order to

4 Figure 1: The case of two samples A and B (a) with empty intersection and (b) with a non empty intersection C. White spaces correspond to missing values (a) (b) drawn according to complex survey designs. In this case the problem of the treatment and harmonization of survey weights should be tackled. To date, there are only a few procedures (file concatenation, incomplete two-way stratification and synthetic two-way stratification) that have been defined, or can be properly adapted, in order to take into account complex survey designs in a statistical matching problem. Note that these procedures can be used also when C is not available or is not of adequate quality. Whatever procedure is actually used to get results on the (XY ) distribution one has often to rely upon assumptions (hopefully less restrictive assumptions when auxiliary information in form of the subset C is available). Since many of these assumptions can not always be tested using available data it could be useful to assess the uncertainty of statistical matching. File Concatenation The original proposal of file concatenation in Rubin (1986) consisted in modifying the sample weights of the two surveys A and B in order to get a unique sample given by the union of A and B (A B) with survey weights representative of the population of interest. The basic idea is that the new sampling weights can be derived by using the simplifying assumption that the probability of including a unit in both the samples is negligible, so that the concatenated file is as in Fig. 1 (a). This is generally true for two independent sample surveys, provided that the two sample designs do not assign to some units probability of inclusion close or equal to 1. Under the assumption stated above the inclusion probability π i of a record i in A B is simply: π i = π i,a + π i,b, (1) where π i,a and π i,b are the inclusion probabilities of unit i in A and B, respectively. A first consideration is given by the review in Moriarity and Scheuren (2003), p. 71, who noted The notion of file concatenation is appealing. However on a close examination it seems to have limited applicability. The inclusion probabilities (1) need knowledge of the inclusion probability of the records in A under the survey design in B, as well as the inclusion probability of the records in B under the survey design in A. It is worth noting that design variables of a survey may not necessarily be available in other surveys and for this reason the approach proposed by Rubin has been seldom applied. Nevertheless, statistical agencies usually have a complete knowledge of the sample designs as well as of the sampling frames the samples are drawn from, and in principle file concatenation can be applied. A second consideration relies on the fact that many survey designs (mainly for business surveys) admit units with inclusion probabilities near to 1, so that the concatenated file is as in Fig. 1 (b). Hence, the actual inclusion probability of a unit in A B (where units in A B are taken only once) is: π i = π i,a + π i,b π i,a B (2) where π i,a B is the probability that i is included in both the samples. Sometimes it is not easy to evaluate π i,a B when these terms are not assumed to be equal to 0. If the sampling frame

5 including both the sampling design variables used for selecting A as well as B is available, it is possible to approximate the inclusion probabilities π i,a B in (2) following an approach based on Monte Carlo simulations as suggested in Fattorini (2006): 1. draw T independent samples A (t) and B (t), t = 1,..., T, from the population according to the two survey designs, respectively; 2. compute the proportion of samples both including unit i. Once computed the new inclusion probabilities π i for the units in the concatenated sample A B, this sample can be treated as a single survey data file containing (massive) missing values. Any estimation procedure of a parameter in (X, Y, Z) can be performed by a method that deals with partially observed data sets, as in Little and Rubin (2002). Incomplete and Synthetic Two-way Stratification A different approach is outlined in Renssen (1998). This approach does not need additional information on the survey weights to attach to each unit in A and B under the two survey designs. It consists in analyzing separately A and B (as well as an additional survey sample C if it does exist) after these samples have been harmonized in terms of the common statistical information on X, Y, and Z. Harmonization is performed by means of calibration procedures. When only A and B are available, calibration procedures adjust the weights in A and B so that estimates of the distribution of X computed on A and B are the same. If also a file C is available, two different procedures are available. Incomplete two-way stratification adjusts the survey weights in C so that estimates relative to the marginal distributions of Y and Z computed on C are as those computed on A and B respectively. The adjusted file C is then used for the estimation of any parameter of interest on (Y, Z). Synthetic two-way stratification consists in estimating parameters on (Y, Z) as a sum of two components: (i) the first component is an estimate based on the CIA using only A and B; (ii) the second component is an estimate of the residual of the estimate in step (i) when a model different from the CIA holds: this residual is estimated on C. Uncertainty in Statistical Matching When C is not available, the only conclusions that can be considered in the statistical matching problem are intervals that include all those parameters compatible with the (X, Y ) (estimated in A) and the (X, Z) (estimated on B) parameters. In other words, the uncertainty due to the lack of joint information on X, Y, and Z should be assessed. Statistical matching literature focuses mainly on the case (X, Y, Z) is a trivariate normal distribution, as in Kadane (1978), Rubin (1986), Moriarity and Scheuren (2001), Moriarity and Scheuren (2003), Rässler (2002). The categorical case has been treated in D Orazio et al. (2006b), see also D Orazio et al. (2006a), ch. 4. They show also how it is to possible to reduce this interval by introducing suitable constraints on the unobserved variables. Uncertainty can be estimated in the following way. Estimate the marginal distribution of the common variables X on A B in case of file concatenation, either in A or in B in the approach by Renssen (1998). Estimate the conditional distributions of Y X and Z X from the original data sets A and B, respectively. This is straightforward for the approach in Renssen (1998), while survey weights in the concatenated file should appropriately be modified for the presence of missing data in A B.

6 All the distributions admitting the estimated marginal distribution of X and conditional distributions of Y X and Z X are compatible with the available sample information A B and describe the uncertainty of statistical matching for the data sets at hand. In this paper we will discuss the case X, Y and Z are categorical. Let θ hjk be the probability of the cell X = h, Y = j, Z = k, h = 1,..., H, j = 1,..., J, k = 1,..., K. According to the Fréchet bounds, the joint probability for Y and Z (θ.jk ) when the marginal distribution on X (θ h.. ) and the conditional distributions for Y and Z given X (θ j h and θ k h respectively) are known (estimated), lay in the interval with extremes: ) (θ Ḷ jk, θụ jk ) = ( h θ h.. max{0; θ j h + θ k h 1}, h θ h.. min{θ j h ; θ k h }. (3) Note that, if X was not used, the Fréchet bounds would have been: max{0; θ.j. + θ..k 1} θ.jk min{θ.j. ; θ..k }. (4) It is easy to prove by Jensen inequality that the interval (3) is narrower than (4). This can be considered as an effect of the exploitation of the common variables X, and suggests a measure on how informative these variables are in the matching process. The interval (3) can be modified if there are different sets of units characterized by different levels of information. Assume the population of interest can be split in D subpopulations. Different levels of information can be given by different sets of common variables X in the different subpopulations (possibly attached by the use of administrative archives or other sources of information). Sometimes, it may happen that for some subpopulations there is not uncertainty on (Y, Z). In this case, uncertainty is defined as: D (θ Ḷ jk, θụ jk ) = φ d (θ Ḷ jk d, θụ jk d ) (5) d=1 where φ d > 0 is equal to the relative frequency of units in the dth subpopulation ( d φ d = 1) and (θ Ḷ jk d, θụ jk d ) are the bounds (computed as in (3) of uncertainty intervals for the probability of cell (Y = j, Z = k) in the subpopulation d. Discussion A comparison on the accuracy of file concatenation, incomplete two-way stratification and synthetic two-way stratification has not already been done yet. Anyway, it is possible to draw some conclusions. When evaluation of uncertainty is the goal, an adequate estimate of the X, Y X and Z X distributions is needed. It seems that file concatenation can be more efficient than the other methods at least for the distribution of X, given that all the observations in both samples are used. The approaches by Renssen use a method (calibration of survey weights) that may need a great computational effort. In fact, these methods require the specification of additional constraints if compared to standard calibration ( Deville and Särndal, 1992). These additional constraints can imply the failure of convergence of the calibration procedure. This problem becomes more evident when the number of variables increases. Finally, file concatenation and the approaches suggested by Renssen differ in the kind of additional information C. Renssen supposes that C is a third sample, representative of the population of interest. On the contrary file concatenation does not consider a third sample survey (otherwise different weights than those in (2) should have been considered). File concatenation naturally uses the intersection between two samples as the joint information. This joint information is embedded

7 in a large sample A B with missing observations. This file should be treated with appropriate methods that tackle the problem of partially observed data. In the next sections, an example of statistical matching applied to two agricultural surveys is illustrated: this consists of two overlapping sample surveys. For this reason, file concatenation will be used. 3 Statistical Matching of Two Agricultural Samples 3.1 The Datasets The FSS (henceforth A) and the FADN (henceforth B) investigate the farms that in a given year (2003 in our application) have some characteristics in terms of Utilized Agricultural Area (UAA) and/or a certain proportion for sale and/or the size of the production unit. The population consists of 1.8 millions of farms approximately. A is carried out on farms every two years. Its main objective is to investigate the principal phenomena like crops, livestock, machinery and equipment, labour force, holder s family characteristics. Sampling units are selected according to a stratified random sampling design. The strata are designed according to location (region or province), UAA, Livestock Size Unit (LSU), Economic Size Unit (ESU) and typology of the agricultural holdings. A take all stratum (all the units in the take all stratum are selected into the sample) contains the largest farms. The total sample size is n A = B collects data on the economic structure and results of the farms, as costs, added value, employment labour cost, household income, etc. The sample is selected according to a stratified random sample. The strata are defined according to region or province code, typology classification (first digit), ESU classes, working day classes. A take all stratum contains the largest farms in terms of ESU. The sample consists of n B = units. The selection of the units in the two surveys is negatively coordinated, in order to reduce the response burden. Nevertheless, it is not possible to avoid that the largest farms are included in both surveys. The overlap between the two surveys resulted in n A B = farms. Among these, farms belong to the take all stratum of A. Table 1 summarizes the details concerning the sample size. Table 1: Sample sizes OUT FSS sample IN FSS sample Total OUT FADN sample IN FADN sample Total Deriving Sampling Weights of the Concatenated File The concatenation of the A and B givens a unique sample of size n A B = n A +n B n A B = As described in Section 2, the probability that unit i belongs to both A and B can be estimated by applying the following steps: 1. draw T pairs of independent samples (A (t), B (t) ), t = 1,..., T, from the population of farms according to the two survey sampling designs; 2. estimate the probabilities π i,a B through the empirical inclusion probabilities ˆπ i,a B = T t=1 I(t) A B (i) + 1 T + 1 i = 1, 2,..., n A B

8 where I (t) A B (i) is 1 when unit i is included in A(t) and B (t). The procedure consisted M = iterations. Although M is not large, the procedure seems quite efficient since the differences between the empirical probabilities computed at 2000th iteration and at M = were not significant. More details on how these probabilities are computed can be found in Ballin et al. (2009). The inclusion probabilities π A B have been estimated by considering the union of the two theoretical samples but, in practice, unit nonresponse in the two surveys must be considered. Denoting with m the number of responding units in the samples at hand, the concatenated file consists of m A B = farms, obtained by joining the responding farms in the two surveys (m A = and m B = ) and considering just once the 833 farms that responded to both the surveys (m A B = = ). In order to deal with unit nonresponse, the estimated inclusion probabilities ˆπ A B have been corrected, as in the original surveys, by using a calibration based on auxiliary information coming from administrative archives. From now on the weights in the concatenated file w A B refer to the inverse of the joint inclusion probabilities calibrated to known population totals. 4 Some Results In order to illustrate how the concatenated file can be used to explore the relationship among Y and Z attention is limited to a few variables. More precisely, the variables here considered are: the Number of Sale Channels (NoSC) (Y ), which is observed only in A and the Earnings Before Interests and Taxes (EBITDA) (Z), available only for the farms in B. The following common variables are considered: UAA, ESU and LSU. All the variables are actually continuous with the exception of NoSC (the following types of sale channels have been considered: (i) direct sale to the consumers; (ii) sale with contractual bonds to industrial enterprises or to business companies; (iii) sale without contractual bonds). The distribution of the continuous variables is very skewed and moreover for a large number of units the observed value is zero. For the sake of simplicity, all the continuous variables have been categorized by using the classes adopted when publishing official results (see Table 2). Some alternative strategies for using concatenated files, trying to taking into account auxiliary information given by the common subset, will be considered: 1. as a first step the CIA is assumed: it makes no explicit use of auxiliary information; 2. a second possibility is to consider the concatenated file and to estimate the joint distribution (Y, Z) by using statistical methods for inference with missing data; 3. since in this example the survey designs are negatively coordinated, one can split the sample into two strata: big and small farms where only for the former one a common subset is available; 4. finally, also with the aim of giving a framework to judge results from previous steps, evaluation of uncertainty will be considered taking also into account the two distinct strata introduced above. In this setting, the objective of the inference is the contingency table of NoSC vs. EBITDA. Obviously, this joint distribution can be easily estimated by using the CIA: ˆθ.jk = h ˆθ h.. ˆθj h ˆθ.k h In the file concatenation approach, a cell probability can be estimated by simply dividing the sum of the final survey weights of the units belonging to it by the estimated population size, i.e.

9 Table 2: Variables considered in FSS FADN Variable No. of categories categories X 1 =UAA (hectares) 5 [0,2.5), [2.5,5), [5,10), [10,25), 25 X 2 =ESU (thousand of Euro) 5 [0,4), [4,8), [8,40), [40,100), 100 X 3 =LSU 2 0, 1 Y =NoSC 4 0, 1, 2, 3 Z =EBITDA (thousand of Euro) 6 0, (0,1], (1,5], (5,10], (10,25], > 25 i A B ˆθ h.. = w i I h (x 1i ) i A B w, i where I h (x 1i ) is equal to 1 when x 1i = h and zero otherwise. Results obtained under the CIA are shown in Table 3. The CIA is a strong assumption and it could not be a valid one in this context. Moreover, it does not exploit information available in A B. In order to exploit this information the table of Y vs. Z is estimated from the whole concatenated data file directly by applying the EM algorithm (see Schafer, 1997). As in Patterson et al. (2002), the EM algorithm is applied substituting the unweighted cell counts with the weighted cell counts. The results are reported in Table 3 in rows labeled as EM all. In order to derive the final estimates, the version of the EM algorithm in the package cat (written by Schafer and ported to R by Harding and Tusell, 2009) developed for the R environment ( R Development Core team, 2009) is applied. Table 3: Estimate of (Y, Z) under the three different approaches Z = Classes of EBITDA (thousand of Euro) Y Method 0 (0, 1] (1, 5] (5, 10] (10, 25] > 25 Tot. 0 CIA EM all EM+rake CIA EM all EM+rake CIA EM all EM+rake CIA EM all EM+rake Tot Note that different starting points of the EM algorithm without the use of file C would converge to different estimates: all those in the likelihood ridge (see D Orazio et al. (2006a)). In this case, file C allows the EM algorithm to end up with a unique solution, that in Table 3. Anyway, this result is not very different from the one obtained under the CIA. This could be an effect of the small size of C (833 farms) when compared to the size of the whole concatenated data file ( farms). Moreover, the use of weights in the initial estimation step emphasizes the fact that the farms in C have weights equal or slightly greater than 1 (units in the take all strata) while, on the other hand, the remaining farms have weights much greater than 1. Results obtained with the EM algorithm are not far from those under the CIA. For this reason, another possible use of the information in A B is investigated. The whole concatenated data file A B is split in two non overlapping domains: small farms and large farms. The large

10 farms consist of the responding farms belonging to the take all strata in the A and B surveys. This subset includes also the farms observed in both the surveys A B. All the remaining farms are considered small. The sizes of the two domains are reported in the Table 4. In this context, an assumption can be that the structure of the association among Y and Z observed on large farms holds also for small ones. Table 4: Number of records (n) and estimated size of the domains Farms m ˆN large small Tot In order to replicate on the small farms the structure of association observed on large farms: (i) the EM algorithm is applied on large farms and the corresponding table of Y vs. Z is estimated; (ii) this table is raked with respect to the margins of Y and Z estimated from small farms. The association between Y and Z is quite well preserved: Kendall s τ c is equal to for large farms (step (i)), for small farms after raking (step (ii)) and finally, when small and large farms are jointly considered. The final estimates of cells of table (Y, Z) related to the whole population are showed in the rows labeled EM+rake in the Table 3. The new estimates are quite different form those obtained under the CIA or the EM applied to the whole concatenated data file. Some marked differences can be detected for the cells with Y = 0, 1 and Z = (0, 1]; (1, 5]. In order to understand how far are the (Y, Z) distributions estimated under the three approaches, the total variation distance is computed, ˆ = 1 2 jk ˆθ (1) (2).jk ˆθ.jk : Table 5: Total Variation Distance ( ˆ 100) among the tables of Y vs. Z. ˆ 100 CIA vs EM all 1.18 CIA vs EM+rake EM all vs EM+rake It is observed that the estimation procedure EM+rake provides quite different results from those obtained under the other two methods. As far as association is concerned, as expected the EM+rake provides different results from those obtained under the CIA or EM all, as shown in table 6 Table 6: Association between Y and Z ˆτ c CIA EM all EM+rake If the CIA is a strong assumption that is difficult to hold when studying the relationship among Y =NoSC and Z =EBITDA, it is also true that assuming the small farms have the same degree of association as the large ones is a strong untestable hypothesis, given the differences in the characteristics of the two types of farms. In absence of assumptions on the structure of association among Y =NoSC and Z =EBITDA, the other possible approach consist in evaluating the uncertainty in the estimation of the cell probabilities. In order to evaluate uncertainty in the estimation of the (Y, Z) table, the Fréchet bounds

11 are computed according to equations (3) and (4). Results are in Table 7. As expected, bounds according to equation (4) are wider than those given by equation (3), but only some cells show a marked reduction of the interval width. This is especially true for the largest classes of Z and Y, as a result of the fact that the matching variables X are those used for the selection of the take all strata, i.e. the largest farms. Table 7: Fréchet bounds for cell probabilities (x100) according to equation (3). Results under equation (4) are shown in brackets Z = Classes of EBITDA (thousand of Euro) Y bound 0 (0, 1] (1, 5] (5, 10] (10, 25] > 25 0 lower 0.00 (0.00) 3.27 (0.00) 0.01 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) upper (13.78) (32.87) (24.86) 6.70 (9.63) 5.64 (10.88) 2.59 (8.58) 1 lower 0.00 (0.00) 0.00 (0.00) 0.55 (0.00) 0.00 (0.00) 0.39 (0.00) 1.10 (0.00) upper (13.78) (32.87) (24.86) 9.63 (9.63) (10.88) 7.51 (8.58) 2 lower 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.57 (0.00) upper 5.94 (12.26) 5.72 (12.26) 8.55 (12.26) 7.95 (9.63) 9.02 (10.88) 6.42 (8.58) 3 lower 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.01 (0.00) upper 1.17 (1.56) 0.88 (1.56) 1.24 (1.56) 1.31 (1.56) 1.47 (1.56) 1.56 (1.56) The estimated bounds according to equation (3) reported in Table 7 show high uncertainty for all the cells. In particular, uncertainty in absolute terms appears to be high for cells with Y = 0, 1 and Z = 0; (0, 1]; (1, 5]. Note that the formula for estimating the Fréchet bounds does not take into account the joint information available for the subset of farms observed in both A and B. This information can be exploited by separating the subset A B from the other farms and using formula (5) with D = 2. A B allows the direct estimation of the table (Y, Z), i.e. the interval collapses to a single value for any cell. On the contrary, for farms that are not in A B, only the Fréchet bounds of the cell probabilities can be computed. The two results can then be combined as in (5). These new bounds are not reported here because they are similar to the ones in Table 7. Again this result is expected given the small size of the subset A B if compared to the remaining ones. 5 Final Comments Statistical matching can help giving an insight into relationships between variables that have not been jointly observed. When, at least for a small subset of units, the full set of variables is observed some, often unverifiable, assumptions about the joint distribution of the variables could be weakened. The value of the information collected in the joint subset strongly depends on the way these data are collected. In principle one should adopt a specific survey design targeting at estimation of some very specific quantities or relationships such as the ones related to the validity of CIA. Note that the CIA implies a structure of relationships between variables that cannot be tested by observing only a small sample: nonetheless information in the common sample should be properly used. In this paper some possible strategies to take into account information into a small common sample are discussed. It is assumed that the sample is already available and also a practical situation where this actually happens (enterprise surveys with complex survey designs) is presented. The design of sampling schemes for selecting the common subset when statistical matching is planned in advance goes beyond the scope of this paper and have not been considered. This is a topic that certainly deserves a more specific and detailed analysis. The analyses presented in the paper, though deriving only by a single real example, show that, even if a common subset of units is observed, statistical matching still remains a problem of dealing with missing values for a large majority of the units. The set of reasonable assumptions can be more precise and less restrictive when auxiliary information is available but uncertainty will still be present. For this reason the approach based on evaluation of uncertainty seems very useful especially when adopted jointly with, and not as an alternative to, other approaches. The

12 approach based on uncertainty has been only recently seriously considered as a viable solution for statistical matching. Its application opens a number of new issues that are not considered here and that are left for the future. The most important is certainly related on measuring statistical properties of uncertainty measures. Uncertainty has also been proposed, under the heading of partially identified models in econometrics (see Manski, 2003) and also some results on confidence regions for uncertainty bounds have been obtained, e.g. Imbens and Manski (2004). It is worth noting that the case of statistical matching would be even more complicated since complex survey designs are involved. Note also that uncertainty analysis can be even more informative if other auxiliary information is available: a notable example is given by structural zeros or inequality constraints when estimating entries into a two way table (as discussed in D Orazio et al. (2006a), D Orazio et al. (2006b)). Statistical matching, such as any small piece of applied statistics, is more than a collection of tools and technical solutions to be applied following specific guidelines: it is a practice which requires a deep understanding of the data, of the way they are collected, of subject related issues. More an art than a science. Acknowledgments References Ballin, M., Di Zio, M., D Orazio, M., Scanu, M., Torelli, N. (2009), File Concatenation of Survey Data: a Computer Intensive Approach to Sampling Weights Estimation, forthcoming on Rivista di Statistica Ufficiale D Orazio M., Di Zio M., Scanu M. (2006a), Statistical Matching: Theory and Practice, Wiley, New York. D Orazio M., Di Zio M., Scanu M. (2006b), Statistical matching for categorical data: displaying uncertainty and using logical constraints, Journal of Official Statistics, 22, Deville, J.C., Särndal, C.E. (1992), Calibration estimators in survey sampling, Journal of the American Statistical Association, 87, Fattorini L. (2006), Applying the Horwitz Thompson criterion in complex designs: a computerintensive perspective for estimating inclusion probabilities, Biometrika, 93, Harding, T., and Tusell, F. (2009) cat: Analysis of categorical-variable datasets with missing values. R package version , Ported to R by Ted Harding and Fernando Tusell Original by Joseph L. Schafer. Imbens, G., Manski, C. (2004), Confidence Intervals for Partially Identified Parameters. Econometrica, 72, Kadane, J. B. (1978), Some Statistical Problems in Merging Data Files, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., (Reprinted in 2001, Journal of Official Statistics, 17, ). Little, R.J.A., Rubin, D.B. (2002) Statistical Analysis with Missing Data, Second Edition, New York: Wiley. Manski, C.F. (2003), Partial Identification of Probability Distributions, Springer verlag, New York. Moriarity, C., and Scheuren, F. (2001), Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure, Journal of Official Statistics, 17,

13 Moriarity, C., and Scheuren, F. (2003), A Note on Rubin s Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputation, Journal of Business & Economic Statistics, 21(1), Patterson,B.H., Dayton, C.M., and Graubard, B.I. (2002), Latent Class Analysis of Complex Sample Survey Data: Application to Dietary Data, Journal of the American Statistical Association, 97, R Development Core Team (2009) R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing. Rässler, S. (2002), Statistical Matching: a Frequentist Theory, Practical Applications and Alternative Bayesian Approaches, Lecture Notes in Statistics, New York: Springer Verlag. Renssen, R. H. (1998), Use of Statistical Matching Techniques in Calibration Estimation, Survey Methodology, 24, Rubin, D. B. (1986), Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations, Journal of Business and Economic Statistics, 4, Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, London: Chapman and Hall / CRC Press. Singh, A. C., Mantel, H., Kinack, M., and Rowe, G. (1993), Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption, Survey Methodology, 19,

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional