Statistical Matching of Two Surveys with a Common Subset

Size: px
Start display at page:

Download "Statistical Matching of Two Surveys with a Common Subset"

Transcription

1 Marco Ballin, Marcello D Orazio, Marco Di Zio, Mauro Scanu, Nicola Torelli Statistical Matching of Two Surveys with a Common Subset Working Paper n

2 Statistical Matching of Two Surveys with a Common Subset Marco Ballin, Marcello D Orazio, Marco Di Zio, Mauro Scanu, Nicola Torelli 1 ISTAT, Italian National Statistical Institute, Rome, Italy ballin@istat.it, madorazi@istat.it, dizio@istat.it, scanu@istat.it Università di Trieste nicolat@econ.univ.trieste.it Abstract Statistical matching techniques are aimed at combining information available in two distinct datasets. Data are often collected in two independent surveys and it is assumed that records in the two datasets refer to different units. When a non negligible number of units are included in both the surveys one obtains valuable information that can be used in combining data from the two sources. This is, for instance the case when the two datasets contain data collected in enterprise surveys. In this paper statistical matching approaches when complete data are available on a common subset of units will be presented and discussed. The case of survey data collected in two agricultural surveys whose designs are negatively coordinated is considered. Key words: File concatenation, uncertainty, missing data, negative coordination, agricultural enterprises surveys. 1 Introduction Statistical matching techniques are aimed at combining information available in two distinct datasets. The two datasets, A and B, often contain data collected in two independent sample surveys of size n A and n B respectively and such that (i) the two samples contain distinct units; (ii) the two samples contain information on some variables X (common variables), while other variables are observed in only one of the two samples, say, Y in A and Z in B. Common variables X can be, for instance, used to create synthetic records containing information on joint distribution (X, Y, Z) but properties of the synthetic archive obtained need a careful examination and one has often to rely upon restrictive assumptions such as the conditional independence assumption (CIA) between Y and Z given X. When large scale national surveys are concerned, it is reasonable to assume that the probability that the same units are included in both the samples is zero. Nonetheless, when a subset of units is included in both the surveys and data on the joint distribution (X, Y, Z) are collected, this information should be taken into account in the statistical matching procedure. Namely, data on the common subset could provide valuable auxiliary information to alleviate the CIA. In principle, if it were known in advance that statistical matching will be used to combine data from the two surveys, it could be convenient to design a small supplementary survey to collect data on (X, Y, Z) for some units of the common population, as suggested by Renssen (1998), see also Singh et al. (1993). More generally, when data are collected according to complex survey designs it is likely that a small set of units is included in both the samples. This is, for instance, true for enterprise surveys: the sample designs assign in both the surveys high probability (usually close or equal to 1) of inclusion to large enterprises. It is worth noting that, in case of enterprise surveys, the common subset could not be representative of the target population since it might 1 Marco Ballin is researcher, Istat, via Ravà 150, Roma, Italy ( ballin@istat.it), Marcello D Orazio is researcher, Istat, via Cesare Balbo 16, Roma, Italy ( madorazi@istat.it), Marco Di Zio is researcher, Istat, via Cesare Balbo 16, Roma, Italy ( dizio@istat.it), Mauro Scanu is researcher, Istat, via Cesare Balbo 16, Roma, Italy ( scanu@istat.it), Nicola Torelli is professor, Università di Trieste, piazzale Europa 1, Trieste ( nicolat@econ.univ.trieste.it)

3 include only large enterprises. In both cases it is not clear how this extra information can be appropriately exploited in a statistical matching procedure. In this paper strategies for dealing with statistical matching of data collected in two complex sample surveys with a common subset are presented. Since data from two complex surveys are combined, the proposed strategies should also take into account the appropriate use of survey weights. As already recalled, the use of data from a common subset was considered by Renssen (1998). He takes explicitly into account the availability of data collected in a small supplementary survey specifically designed to collect data on (X, Y, Z) and it is based on calibration of survey weights in order to exploit this information. This procedure, which relies upon a sequence of calibration steps, cannot be easily generalized to cover more complex situations emerging when the common subset is not obtained from a specifically designed supplementary survey. A second approach, useful in statistical matching of data from two complex surveys with a common subset, is file concatenation as proposed by Rubin (1986). This procedure has been rarely applied since the evaluation of the sampling weights of the concatenated file can be very difficult (especially when the two sample designs explicitly admit that the probability that some units enter both the samples is not negligible). Strategies to overcome this problem will be considered in the paper and application of file concatenation for matching data from two surveys with a common subset is proposed. A third approach is analysis of uncertainty where properties of the unobserved (Y, Z) distribution (in terms of interval of plausible values) are inferred by marginal and conditional distributions actually estimable from surveys A and B (see D Orazio et al. (2006a), ch. 4). The paper is motivated by application of statistical matching techniques to data collected in two important Italian surveys on agricultural enterprises: the Farm Structure Survey (hereafter FSS) and the Farm Accountancy Data Network Survey (FADN). In this case, an extreme situation emerges since the common subset comprises almost exclusively large enterprises. This happens because the two survey designs, in order to reduce respondent burden, are negatively coordinated. The organization of the paper is as follows. Section 2 describes the statistical matching strategies recalled above and proposes a feasible procedure to take properly into account data collected on the common subset. Section 3 introduces the motivating problem and data: the two surveys, FSS and FADN, as well as their designs, are presented. Some results from application of file concatenation and evaluation of uncertainty, taking into account the common subset, are in Section 4. Section 5 contains some final comments. 2 The Methods The statistical matching problem in its basic form can be considered as an inferential problem with partial information. It is usually assumed that the two samples to be matched, denoted as A and B, do not overlap on the observed units and in this case the observed common set of variables X is the only available information for drawing inferences on the relationship between two other sets of variables, Y and Z, observed in A and B respectively, and never jointly observed (Figure 1 (a)). This framework allows a pointwise estimation of parameters on X, Y X and Z X, while anything related to the distribution of (Y, Z) or of (Y, Z X) can be estimated pointwise only under some simplifying assumptions, which are untestable for the data at hand. Usually, the conditional independence assumption (CIA) is assumed, i.e. Y and Z are considered as independent variables given the common set of observed variables X. If there are not enough clues in order to assume such an assumption, different procedures have been proposed in the statistical matching literature. One precious source of information is given by an additional file C which includes observations on all the variables of interest X, Y and Z. Sometimes, this file is given by a separate sample which is representative of the population under study, but it can also be obtained as the intersection of the two sample surveys A and B, as in Figure 1 (b). The statistical matching problem becomes more difficult when the samples at hand have been

4 Figure 1: The case of two samples A and B (a) with empty intersection and (b) with a non empty intersection C. White spaces correspond to missing values (a) (b) drawn according to complex survey designs. In this case the problem of the treatment and harmonization of survey weights should be tackled. To date, there are only a few procedures (file concatenation, incomplete two-way stratification and synthetic two-way stratification) that have been defined, or can be properly adapted, in order to take into account complex survey designs in a statistical matching problem. Note that these procedures can be used also when C is not available or is not of adequate quality. Whatever procedure is actually used to get results on the (XY ) distribution one has often to rely upon assumptions (hopefully less restrictive assumptions when auxiliary information in form of the subset C is available). Since many of these assumptions can not always be tested using available data it could be useful to assess the uncertainty of statistical matching. File Concatenation The original proposal of file concatenation in Rubin (1986) consisted in modifying the sample weights of the two surveys A and B in order to get a unique sample given by the union of A and B (A B) with survey weights representative of the population of interest. The basic idea is that the new sampling weights can be derived by using the simplifying assumption that the probability of including a unit in both the samples is negligible, so that the concatenated file is as in Fig. 1 (a). This is generally true for two independent sample surveys, provided that the two sample designs do not assign to some units probability of inclusion close or equal to 1. Under the assumption stated above the inclusion probability π i of a record i in A B is simply: π i = π i,a + π i,b, (1) where π i,a and π i,b are the inclusion probabilities of unit i in A and B, respectively. A first consideration is given by the review in Moriarity and Scheuren (2003), p. 71, who noted The notion of file concatenation is appealing. However on a close examination it seems to have limited applicability. The inclusion probabilities (1) need knowledge of the inclusion probability of the records in A under the survey design in B, as well as the inclusion probability of the records in B under the survey design in A. It is worth noting that design variables of a survey may not necessarily be available in other surveys and for this reason the approach proposed by Rubin has been seldom applied. Nevertheless, statistical agencies usually have a complete knowledge of the sample designs as well as of the sampling frames the samples are drawn from, and in principle file concatenation can be applied. A second consideration relies on the fact that many survey designs (mainly for business surveys) admit units with inclusion probabilities near to 1, so that the concatenated file is as in Fig. 1 (b). Hence, the actual inclusion probability of a unit in A B (where units in A B are taken only once) is: π i = π i,a + π i,b π i,a B (2) where π i,a B is the probability that i is included in both the samples. Sometimes it is not easy to evaluate π i,a B when these terms are not assumed to be equal to 0. If the sampling frame

5 including both the sampling design variables used for selecting A as well as B is available, it is possible to approximate the inclusion probabilities π i,a B in (2) following an approach based on Monte Carlo simulations as suggested in Fattorini (2006): 1. draw T independent samples A (t) and B (t), t = 1,..., T, from the population according to the two survey designs, respectively; 2. compute the proportion of samples both including unit i. Once computed the new inclusion probabilities π i for the units in the concatenated sample A B, this sample can be treated as a single survey data file containing (massive) missing values. Any estimation procedure of a parameter in (X, Y, Z) can be performed by a method that deals with partially observed data sets, as in Little and Rubin (2002). Incomplete and Synthetic Two-way Stratification A different approach is outlined in Renssen (1998). This approach does not need additional information on the survey weights to attach to each unit in A and B under the two survey designs. It consists in analyzing separately A and B (as well as an additional survey sample C if it does exist) after these samples have been harmonized in terms of the common statistical information on X, Y, and Z. Harmonization is performed by means of calibration procedures. When only A and B are available, calibration procedures adjust the weights in A and B so that estimates of the distribution of X computed on A and B are the same. If also a file C is available, two different procedures are available. Incomplete two-way stratification adjusts the survey weights in C so that estimates relative to the marginal distributions of Y and Z computed on C are as those computed on A and B respectively. The adjusted file C is then used for the estimation of any parameter of interest on (Y, Z). Synthetic two-way stratification consists in estimating parameters on (Y, Z) as a sum of two components: (i) the first component is an estimate based on the CIA using only A and B; (ii) the second component is an estimate of the residual of the estimate in step (i) when a model different from the CIA holds: this residual is estimated on C. Uncertainty in Statistical Matching When C is not available, the only conclusions that can be considered in the statistical matching problem are intervals that include all those parameters compatible with the (X, Y ) (estimated in A) and the (X, Z) (estimated on B) parameters. In other words, the uncertainty due to the lack of joint information on X, Y, and Z should be assessed. Statistical matching literature focuses mainly on the case (X, Y, Z) is a trivariate normal distribution, as in Kadane (1978), Rubin (1986), Moriarity and Scheuren (2001), Moriarity and Scheuren (2003), Rässler (2002). The categorical case has been treated in D Orazio et al. (2006b), see also D Orazio et al. (2006a), ch. 4. They show also how it is to possible to reduce this interval by introducing suitable constraints on the unobserved variables. Uncertainty can be estimated in the following way. Estimate the marginal distribution of the common variables X on A B in case of file concatenation, either in A or in B in the approach by Renssen (1998). Estimate the conditional distributions of Y X and Z X from the original data sets A and B, respectively. This is straightforward for the approach in Renssen (1998), while survey weights in the concatenated file should appropriately be modified for the presence of missing data in A B.

6 All the distributions admitting the estimated marginal distribution of X and conditional distributions of Y X and Z X are compatible with the available sample information A B and describe the uncertainty of statistical matching for the data sets at hand. In this paper we will discuss the case X, Y and Z are categorical. Let θ hjk be the probability of the cell X = h, Y = j, Z = k, h = 1,..., H, j = 1,..., J, k = 1,..., K. According to the Fréchet bounds, the joint probability for Y and Z (θ.jk ) when the marginal distribution on X (θ h.. ) and the conditional distributions for Y and Z given X (θ j h and θ k h respectively) are known (estimated), lay in the interval with extremes: ) (θ Ḷ jk, θụ jk ) = ( h θ h.. max{0; θ j h + θ k h 1}, h θ h.. min{θ j h ; θ k h }. (3) Note that, if X was not used, the Fréchet bounds would have been: max{0; θ.j. + θ..k 1} θ.jk min{θ.j. ; θ..k }. (4) It is easy to prove by Jensen inequality that the interval (3) is narrower than (4). This can be considered as an effect of the exploitation of the common variables X, and suggests a measure on how informative these variables are in the matching process. The interval (3) can be modified if there are different sets of units characterized by different levels of information. Assume the population of interest can be split in D subpopulations. Different levels of information can be given by different sets of common variables X in the different subpopulations (possibly attached by the use of administrative archives or other sources of information). Sometimes, it may happen that for some subpopulations there is not uncertainty on (Y, Z). In this case, uncertainty is defined as: D (θ Ḷ jk, θụ jk ) = φ d (θ Ḷ jk d, θụ jk d ) (5) d=1 where φ d > 0 is equal to the relative frequency of units in the dth subpopulation ( d φ d = 1) and (θ Ḷ jk d, θụ jk d ) are the bounds (computed as in (3) of uncertainty intervals for the probability of cell (Y = j, Z = k) in the subpopulation d. Discussion A comparison on the accuracy of file concatenation, incomplete two-way stratification and synthetic two-way stratification has not already been done yet. Anyway, it is possible to draw some conclusions. When evaluation of uncertainty is the goal, an adequate estimate of the X, Y X and Z X distributions is needed. It seems that file concatenation can be more efficient than the other methods at least for the distribution of X, given that all the observations in both samples are used. The approaches by Renssen use a method (calibration of survey weights) that may need a great computational effort. In fact, these methods require the specification of additional constraints if compared to standard calibration ( Deville and Särndal, 1992). These additional constraints can imply the failure of convergence of the calibration procedure. This problem becomes more evident when the number of variables increases. Finally, file concatenation and the approaches suggested by Renssen differ in the kind of additional information C. Renssen supposes that C is a third sample, representative of the population of interest. On the contrary file concatenation does not consider a third sample survey (otherwise different weights than those in (2) should have been considered). File concatenation naturally uses the intersection between two samples as the joint information. This joint information is embedded

7 in a large sample A B with missing observations. This file should be treated with appropriate methods that tackle the problem of partially observed data. In the next sections, an example of statistical matching applied to two agricultural surveys is illustrated: this consists of two overlapping sample surveys. For this reason, file concatenation will be used. 3 Statistical Matching of Two Agricultural Samples 3.1 The Datasets The FSS (henceforth A) and the FADN (henceforth B) investigate the farms that in a given year (2003 in our application) have some characteristics in terms of Utilized Agricultural Area (UAA) and/or a certain proportion for sale and/or the size of the production unit. The population consists of 1.8 millions of farms approximately. A is carried out on farms every two years. Its main objective is to investigate the principal phenomena like crops, livestock, machinery and equipment, labour force, holder s family characteristics. Sampling units are selected according to a stratified random sampling design. The strata are designed according to location (region or province), UAA, Livestock Size Unit (LSU), Economic Size Unit (ESU) and typology of the agricultural holdings. A take all stratum (all the units in the take all stratum are selected into the sample) contains the largest farms. The total sample size is n A = B collects data on the economic structure and results of the farms, as costs, added value, employment labour cost, household income, etc. The sample is selected according to a stratified random sample. The strata are defined according to region or province code, typology classification (first digit), ESU classes, working day classes. A take all stratum contains the largest farms in terms of ESU. The sample consists of n B = units. The selection of the units in the two surveys is negatively coordinated, in order to reduce the response burden. Nevertheless, it is not possible to avoid that the largest farms are included in both surveys. The overlap between the two surveys resulted in n A B = farms. Among these, farms belong to the take all stratum of A. Table 1 summarizes the details concerning the sample size. Table 1: Sample sizes OUT FSS sample IN FSS sample Total OUT FADN sample IN FADN sample Total Deriving Sampling Weights of the Concatenated File The concatenation of the A and B givens a unique sample of size n A B = n A +n B n A B = As described in Section 2, the probability that unit i belongs to both A and B can be estimated by applying the following steps: 1. draw T pairs of independent samples (A (t), B (t) ), t = 1,..., T, from the population of farms according to the two survey sampling designs; 2. estimate the probabilities π i,a B through the empirical inclusion probabilities ˆπ i,a B = T t=1 I(t) A B (i) + 1 T + 1 i = 1, 2,..., n A B

8 where I (t) A B (i) is 1 when unit i is included in A(t) and B (t). The procedure consisted M = iterations. Although M is not large, the procedure seems quite efficient since the differences between the empirical probabilities computed at 2000th iteration and at M = were not significant. More details on how these probabilities are computed can be found in Ballin et al. (2009). The inclusion probabilities π A B have been estimated by considering the union of the two theoretical samples but, in practice, unit nonresponse in the two surveys must be considered. Denoting with m the number of responding units in the samples at hand, the concatenated file consists of m A B = farms, obtained by joining the responding farms in the two surveys (m A = and m B = ) and considering just once the 833 farms that responded to both the surveys (m A B = = ). In order to deal with unit nonresponse, the estimated inclusion probabilities ˆπ A B have been corrected, as in the original surveys, by using a calibration based on auxiliary information coming from administrative archives. From now on the weights in the concatenated file w A B refer to the inverse of the joint inclusion probabilities calibrated to known population totals. 4 Some Results In order to illustrate how the concatenated file can be used to explore the relationship among Y and Z attention is limited to a few variables. More precisely, the variables here considered are: the Number of Sale Channels (NoSC) (Y ), which is observed only in A and the Earnings Before Interests and Taxes (EBITDA) (Z), available only for the farms in B. The following common variables are considered: UAA, ESU and LSU. All the variables are actually continuous with the exception of NoSC (the following types of sale channels have been considered: (i) direct sale to the consumers; (ii) sale with contractual bonds to industrial enterprises or to business companies; (iii) sale without contractual bonds). The distribution of the continuous variables is very skewed and moreover for a large number of units the observed value is zero. For the sake of simplicity, all the continuous variables have been categorized by using the classes adopted when publishing official results (see Table 2). Some alternative strategies for using concatenated files, trying to taking into account auxiliary information given by the common subset, will be considered: 1. as a first step the CIA is assumed: it makes no explicit use of auxiliary information; 2. a second possibility is to consider the concatenated file and to estimate the joint distribution (Y, Z) by using statistical methods for inference with missing data; 3. since in this example the survey designs are negatively coordinated, one can split the sample into two strata: big and small farms where only for the former one a common subset is available; 4. finally, also with the aim of giving a framework to judge results from previous steps, evaluation of uncertainty will be considered taking also into account the two distinct strata introduced above. In this setting, the objective of the inference is the contingency table of NoSC vs. EBITDA. Obviously, this joint distribution can be easily estimated by using the CIA: ˆθ.jk = h ˆθ h.. ˆθj h ˆθ.k h In the file concatenation approach, a cell probability can be estimated by simply dividing the sum of the final survey weights of the units belonging to it by the estimated population size, i.e.

9 Table 2: Variables considered in FSS FADN Variable No. of categories categories X 1 =UAA (hectares) 5 [0,2.5), [2.5,5), [5,10), [10,25), 25 X 2 =ESU (thousand of Euro) 5 [0,4), [4,8), [8,40), [40,100), 100 X 3 =LSU 2 0, 1 Y =NoSC 4 0, 1, 2, 3 Z =EBITDA (thousand of Euro) 6 0, (0,1], (1,5], (5,10], (10,25], > 25 i A B ˆθ h.. = w i I h (x 1i ) i A B w, i where I h (x 1i ) is equal to 1 when x 1i = h and zero otherwise. Results obtained under the CIA are shown in Table 3. The CIA is a strong assumption and it could not be a valid one in this context. Moreover, it does not exploit information available in A B. In order to exploit this information the table of Y vs. Z is estimated from the whole concatenated data file directly by applying the EM algorithm (see Schafer, 1997). As in Patterson et al. (2002), the EM algorithm is applied substituting the unweighted cell counts with the weighted cell counts. The results are reported in Table 3 in rows labeled as EM all. In order to derive the final estimates, the version of the EM algorithm in the package cat (written by Schafer and ported to R by Harding and Tusell, 2009) developed for the R environment ( R Development Core team, 2009) is applied. Table 3: Estimate of (Y, Z) under the three different approaches Z = Classes of EBITDA (thousand of Euro) Y Method 0 (0, 1] (1, 5] (5, 10] (10, 25] > 25 Tot. 0 CIA EM all EM+rake CIA EM all EM+rake CIA EM all EM+rake CIA EM all EM+rake Tot Note that different starting points of the EM algorithm without the use of file C would converge to different estimates: all those in the likelihood ridge (see D Orazio et al. (2006a)). In this case, file C allows the EM algorithm to end up with a unique solution, that in Table 3. Anyway, this result is not very different from the one obtained under the CIA. This could be an effect of the small size of C (833 farms) when compared to the size of the whole concatenated data file ( farms). Moreover, the use of weights in the initial estimation step emphasizes the fact that the farms in C have weights equal or slightly greater than 1 (units in the take all strata) while, on the other hand, the remaining farms have weights much greater than 1. Results obtained with the EM algorithm are not far from those under the CIA. For this reason, another possible use of the information in A B is investigated. The whole concatenated data file A B is split in two non overlapping domains: small farms and large farms. The large

10 farms consist of the responding farms belonging to the take all strata in the A and B surveys. This subset includes also the farms observed in both the surveys A B. All the remaining farms are considered small. The sizes of the two domains are reported in the Table 4. In this context, an assumption can be that the structure of the association among Y and Z observed on large farms holds also for small ones. Table 4: Number of records (n) and estimated size of the domains Farms m ˆN large small Tot In order to replicate on the small farms the structure of association observed on large farms: (i) the EM algorithm is applied on large farms and the corresponding table of Y vs. Z is estimated; (ii) this table is raked with respect to the margins of Y and Z estimated from small farms. The association between Y and Z is quite well preserved: Kendall s τ c is equal to for large farms (step (i)), for small farms after raking (step (ii)) and finally, when small and large farms are jointly considered. The final estimates of cells of table (Y, Z) related to the whole population are showed in the rows labeled EM+rake in the Table 3. The new estimates are quite different form those obtained under the CIA or the EM applied to the whole concatenated data file. Some marked differences can be detected for the cells with Y = 0, 1 and Z = (0, 1]; (1, 5]. In order to understand how far are the (Y, Z) distributions estimated under the three approaches, the total variation distance is computed, ˆ = 1 2 jk ˆθ (1) (2).jk ˆθ.jk : Table 5: Total Variation Distance ( ˆ 100) among the tables of Y vs. Z. ˆ 100 CIA vs EM all 1.18 CIA vs EM+rake EM all vs EM+rake It is observed that the estimation procedure EM+rake provides quite different results from those obtained under the other two methods. As far as association is concerned, as expected the EM+rake provides different results from those obtained under the CIA or EM all, as shown in table 6 Table 6: Association between Y and Z ˆτ c CIA EM all EM+rake If the CIA is a strong assumption that is difficult to hold when studying the relationship among Y =NoSC and Z =EBITDA, it is also true that assuming the small farms have the same degree of association as the large ones is a strong untestable hypothesis, given the differences in the characteristics of the two types of farms. In absence of assumptions on the structure of association among Y =NoSC and Z =EBITDA, the other possible approach consist in evaluating the uncertainty in the estimation of the cell probabilities. In order to evaluate uncertainty in the estimation of the (Y, Z) table, the Fréchet bounds

11 are computed according to equations (3) and (4). Results are in Table 7. As expected, bounds according to equation (4) are wider than those given by equation (3), but only some cells show a marked reduction of the interval width. This is especially true for the largest classes of Z and Y, as a result of the fact that the matching variables X are those used for the selection of the take all strata, i.e. the largest farms. Table 7: Fréchet bounds for cell probabilities (x100) according to equation (3). Results under equation (4) are shown in brackets Z = Classes of EBITDA (thousand of Euro) Y bound 0 (0, 1] (1, 5] (5, 10] (10, 25] > 25 0 lower 0.00 (0.00) 3.27 (0.00) 0.01 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) upper (13.78) (32.87) (24.86) 6.70 (9.63) 5.64 (10.88) 2.59 (8.58) 1 lower 0.00 (0.00) 0.00 (0.00) 0.55 (0.00) 0.00 (0.00) 0.39 (0.00) 1.10 (0.00) upper (13.78) (32.87) (24.86) 9.63 (9.63) (10.88) 7.51 (8.58) 2 lower 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.57 (0.00) upper 5.94 (12.26) 5.72 (12.26) 8.55 (12.26) 7.95 (9.63) 9.02 (10.88) 6.42 (8.58) 3 lower 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.01 (0.00) upper 1.17 (1.56) 0.88 (1.56) 1.24 (1.56) 1.31 (1.56) 1.47 (1.56) 1.56 (1.56) The estimated bounds according to equation (3) reported in Table 7 show high uncertainty for all the cells. In particular, uncertainty in absolute terms appears to be high for cells with Y = 0, 1 and Z = 0; (0, 1]; (1, 5]. Note that the formula for estimating the Fréchet bounds does not take into account the joint information available for the subset of farms observed in both A and B. This information can be exploited by separating the subset A B from the other farms and using formula (5) with D = 2. A B allows the direct estimation of the table (Y, Z), i.e. the interval collapses to a single value for any cell. On the contrary, for farms that are not in A B, only the Fréchet bounds of the cell probabilities can be computed. The two results can then be combined as in (5). These new bounds are not reported here because they are similar to the ones in Table 7. Again this result is expected given the small size of the subset A B if compared to the remaining ones. 5 Final Comments Statistical matching can help giving an insight into relationships between variables that have not been jointly observed. When, at least for a small subset of units, the full set of variables is observed some, often unverifiable, assumptions about the joint distribution of the variables could be weakened. The value of the information collected in the joint subset strongly depends on the way these data are collected. In principle one should adopt a specific survey design targeting at estimation of some very specific quantities or relationships such as the ones related to the validity of CIA. Note that the CIA implies a structure of relationships between variables that cannot be tested by observing only a small sample: nonetheless information in the common sample should be properly used. In this paper some possible strategies to take into account information into a small common sample are discussed. It is assumed that the sample is already available and also a practical situation where this actually happens (enterprise surveys with complex survey designs) is presented. The design of sampling schemes for selecting the common subset when statistical matching is planned in advance goes beyond the scope of this paper and have not been considered. This is a topic that certainly deserves a more specific and detailed analysis. The analyses presented in the paper, though deriving only by a single real example, show that, even if a common subset of units is observed, statistical matching still remains a problem of dealing with missing values for a large majority of the units. The set of reasonable assumptions can be more precise and less restrictive when auxiliary information is available but uncertainty will still be present. For this reason the approach based on evaluation of uncertainty seems very useful especially when adopted jointly with, and not as an alternative to, other approaches. The

12 approach based on uncertainty has been only recently seriously considered as a viable solution for statistical matching. Its application opens a number of new issues that are not considered here and that are left for the future. The most important is certainly related on measuring statistical properties of uncertainty measures. Uncertainty has also been proposed, under the heading of partially identified models in econometrics (see Manski, 2003) and also some results on confidence regions for uncertainty bounds have been obtained, e.g. Imbens and Manski (2004). It is worth noting that the case of statistical matching would be even more complicated since complex survey designs are involved. Note also that uncertainty analysis can be even more informative if other auxiliary information is available: a notable example is given by structural zeros or inequality constraints when estimating entries into a two way table (as discussed in D Orazio et al. (2006a), D Orazio et al. (2006b)). Statistical matching, such as any small piece of applied statistics, is more than a collection of tools and technical solutions to be applied following specific guidelines: it is a practice which requires a deep understanding of the data, of the way they are collected, of subject related issues. More an art than a science. Acknowledgments References Ballin, M., Di Zio, M., D Orazio, M., Scanu, M., Torelli, N. (2009), File Concatenation of Survey Data: a Computer Intensive Approach to Sampling Weights Estimation, forthcoming on Rivista di Statistica Ufficiale D Orazio M., Di Zio M., Scanu M. (2006a), Statistical Matching: Theory and Practice, Wiley, New York. D Orazio M., Di Zio M., Scanu M. (2006b), Statistical matching for categorical data: displaying uncertainty and using logical constraints, Journal of Official Statistics, 22, Deville, J.C., Särndal, C.E. (1992), Calibration estimators in survey sampling, Journal of the American Statistical Association, 87, Fattorini L. (2006), Applying the Horwitz Thompson criterion in complex designs: a computerintensive perspective for estimating inclusion probabilities, Biometrika, 93, Harding, T., and Tusell, F. (2009) cat: Analysis of categorical-variable datasets with missing values. R package version , Ported to R by Ted Harding and Fernando Tusell Original by Joseph L. Schafer. Imbens, G., Manski, C. (2004), Confidence Intervals for Partially Identified Parameters. Econometrica, 72, Kadane, J. B. (1978), Some Statistical Problems in Merging Data Files, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., (Reprinted in 2001, Journal of Official Statistics, 17, ). Little, R.J.A., Rubin, D.B. (2002) Statistical Analysis with Missing Data, Second Edition, New York: Wiley. Manski, C.F. (2003), Partial Identification of Probability Distributions, Springer verlag, New York. Moriarity, C., and Scheuren, F. (2001), Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure, Journal of Official Statistics, 17,

13 Moriarity, C., and Scheuren, F. (2003), A Note on Rubin s Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputation, Journal of Business & Economic Statistics, 21(1), Patterson,B.H., Dayton, C.M., and Graubard, B.I. (2002), Latent Class Analysis of Complex Sample Survey Data: Application to Dietary Data, Journal of the American Statistical Association, 97, R Development Core Team (2009) R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing. Rässler, S. (2002), Statistical Matching: a Frequentist Theory, Practical Applications and Alternative Bayesian Approaches, Lecture Notes in Statistics, New York: Springer Verlag. Renssen, R. H. (1998), Use of Statistical Matching Techniques in Calibration Estimation, Survey Methodology, 24, Rubin, D. B. (1986), Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations, Journal of Business and Economic Statistics, 4, Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, London: Chapman and Hall / CRC Press. Singh, A. C., Mantel, H., Kinack, M., and Rowe, G. (1993), Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption, Survey Methodology, 19,

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional. independence assumption and auxiliary information Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional

More information

STRATIFICATION IN BUSINESS AND AGRICULTURE SURVEYS WITH

STRATIFICATION IN BUSINESS AND AGRICULTURE SURVEYS WITH 4 Th International Conference New Challenges for Statistical Software - The Use of R in Official Statistics Bucharest, 7-8 April 2016 STRATIFICATION IN BUSINESS AND AGRICULTURE SURVEYS WITH Marco Ballin

More information

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi 1 Introduction Imputation is the commonly used

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This module is part of the Memobust Handboo on Methodology of Modern Business Statistics 26 March 2014 Method: Statistical Matching Methods Contents General section... 3 Summary... 3 2. General description

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

Statistical Matching using Fractional Imputation

Statistical Matching using Fractional Imputation Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application:

More information

Analysis of Incomplete Multivariate Data

Analysis of Incomplete Multivariate Data Analysis of Incomplete Multivariate Data J. L. Schafer Department of Statistics The Pennsylvania State University USA CHAPMAN & HALL/CRC A CR.C Press Company Boca Raton London New York Washington, D.C.

More information

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, 00142 Roma, Italy e-mail: pimassol@istat.it 1. Introduction Questions can be usually asked following specific

More information

Handling missing data for indicators, Susanne Rässler 1

Handling missing data for indicators, Susanne Rässler 1 Handling Missing Data for Indicators Susanne Rässler Institute for Employment Research & Federal Employment Agency Nürnberg, Germany First Workshop on Indicators in the Knowledge Economy, Tübingen, 3-4

More information

Simulating from the Polya posterior by Glen Meeden, March 06

Simulating from the Polya posterior by Glen Meeden, March 06 1 Introduction Simulating from the Polya posterior by Glen Meeden, glen@stat.umn.edu March 06 The Polya posterior is an objective Bayesian approach to finite population sampling. In its simplest form it

More information

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION

More information

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

Core Membership Computation for Succinct Representations of Coalitional Games

Core Membership Computation for Succinct Representations of Coalitional Games Core Membership Computation for Succinct Representations of Coalitional Games Xi Alice Gao May 11, 2009 Abstract In this paper, I compare and contrast two formal results on the computational complexity

More information

COPULA MODELS FOR BIG DATA USING DATA SHUFFLING

COPULA MODELS FOR BIG DATA USING DATA SHUFFLING COPULA MODELS FOR BIG DATA USING DATA SHUFFLING Krish Muralidhar, Rathindra Sarathy Department of Marketing & Supply Chain Management, Price College of Business, University of Oklahoma, Norman OK 73019

More information

Handling Data with Three Types of Missing Values:

Handling Data with Three Types of Missing Values: Handling Data with Three Types of Missing Values: A Simulation Study Jennifer Boyko Advisor: Ofer Harel Department of Statistics University of Connecticut Storrs, CT May 21, 2013 Jennifer Boyko Handling

More information

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland 1 1 University of Michigan School of Public Health April 2011 Complete (Ideal) vs. Observed

More information

Comparative Evaluation of Synthetic Dataset Generation Methods

Comparative Evaluation of Synthetic Dataset Generation Methods Comparative Evaluation of Synthetic Dataset Generation Methods Ashish Dandekar, Remmy A. M. Zen, Stéphane Bressan December 12, 2017 1 / 17 Open Data vs Data Privacy Open Data Helps crowdsourcing the research

More information

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation The Korean Communications in Statistics Vol. 12 No. 2, 2005 pp. 323-333 Bayesian Estimation for Skew Normal Distributions Using Data Augmentation Hea-Jung Kim 1) Abstract In this paper, we develop a MCMC

More information

Computing Optimal Strata Bounds Using Dynamic Programming

Computing Optimal Strata Bounds Using Dynamic Programming Computing Optimal Strata Bounds Using Dynamic Programming Eric Miller Summit Consulting, LLC 7/27/2012 1 / 19 Motivation Sampling can be costly. Sample size is often chosen so that point estimates achieve

More information

On the Design and Implementation of a Generalized Process for Business Statistics

On the Design and Implementation of a Generalized Process for Business Statistics On the Design and Implementation of a Generalized Process for Business Statistics M. Bruno, D. Infante, G. Ruocco, M. Scannapieco 1. INTRODUCTION Since the second half of 2014, Istat has been involved

More information

ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION

ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION CHRISTOPHER A. SIMS Abstract. A new algorithm for sampling from an arbitrary pdf. 1. Introduction Consider the standard problem of

More information

Estimation of Unknown Parameters in Dynamic Models Using the Method of Simulated Moments (MSM)

Estimation of Unknown Parameters in Dynamic Models Using the Method of Simulated Moments (MSM) Estimation of Unknown Parameters in ynamic Models Using the Method of Simulated Moments (MSM) Abstract: We introduce the Method of Simulated Moments (MSM) for estimating unknown parameters in dynamic models.

More information

Massive Data Analysis

Massive Data Analysis Professor, Department of Electrical and Computer Engineering Tennessee Technological University February 25, 2015 Big Data This talk is based on the report [1]. The growth of big data is changing that

More information

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated

More information

Weighting and estimation for the EU-SILC rotational design

Weighting and estimation for the EU-SILC rotational design Weighting and estimation for the EUSILC rotational design JeanMarc Museux 1 (Provisional version) 1. THE EUSILC INSTRUMENT 1.1. Introduction In order to meet both the crosssectional and longitudinal requirements,

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Cost Effectiveness of Programming Methods A Replication and Extension

Cost Effectiveness of Programming Methods A Replication and Extension A Replication and Extension Completed Research Paper Wenying Sun Computer Information Sciences Washburn University nan.sun@washburn.edu Hee Seok Nam Mathematics and Statistics Washburn University heeseok.nam@washburn.edu

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

Multivariate Capability Analysis

Multivariate Capability Analysis Multivariate Capability Analysis Summary... 1 Data Input... 3 Analysis Summary... 4 Capability Plot... 5 Capability Indices... 6 Capability Ellipse... 7 Correlation Matrix... 8 Tests for Normality... 8

More information

Discrete Optimization. Lecture Notes 2

Discrete Optimization. Lecture Notes 2 Discrete Optimization. Lecture Notes 2 Disjunctive Constraints Defining variables and formulating linear constraints can be straightforward or more sophisticated, depending on the problem structure. The

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

Statistical Analysis of List Experiments

Statistical Analysis of List Experiments Statistical Analysis of List Experiments Kosuke Imai Princeton University Joint work with Graeme Blair October 29, 2010 Blair and Imai (Princeton) List Experiments NJIT (Mathematics) 1 / 26 Motivation

More information

Online Supplement to Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data

Online Supplement to Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data Online Supplement to Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data Daniel Manrique-Vallier and Jerome P. Reiter August 3, 2016 This supplement includes the algorithm for processing

More information

Data corruption, correction and imputation methods.

Data corruption, correction and imputation methods. Data corruption, correction and imputation methods. Yerevan 8.2 12.2 2016 Enrico Tucci Istat Outline Data collection methods Duplicated records Data corruption Data correction and imputation Data validation

More information

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Ramin Zabih Computer Science Department Stanford University Stanford, California 94305 Abstract Bandwidth is a fundamental concept

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

Bijective Proofs of Two Broken Circuit Theorems

Bijective Proofs of Two Broken Circuit Theorems Bijective Proofs of Two Broken Circuit Theorems Andreas Blass PENNSYLVANIA STATE UNIVERSITY UNIVERSITY PARK, PENNSYLVANIA 16802 Bruce Eli Sagan THE UNIVERSITY OF PENNSYLVANIA PHILADELPHIA, PENNSYLVANIA

More information

CORA COmmon Reference Architecture

CORA COmmon Reference Architecture CORA COmmon Reference Architecture Monica Scannapieco Istat Carlo Vaccari Università di Camerino Antonino Virgillito Istat Outline Introduction (90 mins) CORE Design (60 mins) CORE Architectural Components

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

The Simplex Algorithm

The Simplex Algorithm The Simplex Algorithm Uri Feige November 2011 1 The simplex algorithm The simplex algorithm was designed by Danzig in 1947. This write-up presents the main ideas involved. It is a slight update (mostly

More information

Modeling with Uncertainty Interval Computations Using Fuzzy Sets

Modeling with Uncertainty Interval Computations Using Fuzzy Sets Modeling with Uncertainty Interval Computations Using Fuzzy Sets J. Honda, R. Tankelevich Department of Mathematical and Computer Sciences, Colorado School of Mines, Golden, CO, U.S.A. Abstract A new method

More information

Acknowledgments. Acronyms

Acknowledgments. Acronyms Acknowledgments Preface Acronyms xi xiii xv 1 Basic Tools 1 1.1 Goals of inference 1 1.1.1 Population or process? 1 1.1.2 Probability samples 2 1.1.3 Sampling weights 3 1.1.4 Design effects. 5 1.2 An introduction

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models

Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models Nikolaos Mittas, Lefteris Angelis Department of Informatics,

More information

1. Estimation equations for strip transect sampling, using notation consistent with that used to

1. Estimation equations for strip transect sampling, using notation consistent with that used to Web-based Supplementary Materials for Line Transect Methods for Plant Surveys by S.T. Buckland, D.L. Borchers, A. Johnston, P.A. Henrys and T.A. Marques Web Appendix A. Introduction In this on-line appendix,

More information

Box-Cox Transformation for Simple Linear Regression

Box-Cox Transformation for Simple Linear Regression Chapter 192 Box-Cox Transformation for Simple Linear Regression Introduction This procedure finds the appropriate Box-Cox power transformation (1964) for a dataset containing a pair of variables that are

More information

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization 10 th World Congress on Structural and Multidisciplinary Optimization May 19-24, 2013, Orlando, Florida, USA Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization Sirisha Rangavajhala

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Evaluating the Effectiveness of Using an Additional Mailing Piece in the American Community Survey 1

Evaluating the Effectiveness of Using an Additional Mailing Piece in the American Community Survey 1 Evaluating the Effectiveness of Using an Additional Mailing Piece in the American Community Survey 1 Jennifer Guarino Tancreto and John Chesnut U.S. Census Bureau, Washington, DC 20233 Abstract Decreases

More information

Minimum Cost Edge Disjoint Paths

Minimum Cost Edge Disjoint Paths Minimum Cost Edge Disjoint Paths Theodor Mader 15.4.2008 1 Introduction Finding paths in networks and graphs constitutes an area of theoretical computer science which has been highly researched during

More information

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users Practical Considerations for WinBUGS Users Kate Cowles, Ph.D. Department of Statistics and Actuarial Science University of Iowa 22S:138 Lecture 12 Oct. 3, 2003 Issues in MCMC use for Bayesian model fitting

More information

Box-Cox Transformation

Box-Cox Transformation Chapter 190 Box-Cox Transformation Introduction This procedure finds the appropriate Box-Cox power transformation (1964) for a single batch of data. It is used to modify the distributional shape of a set

More information

Bayesian Inference for Sample Surveys

Bayesian Inference for Sample Surveys Bayesian Inference for Sample Surveys Trivellore Raghunathan (Raghu) Director, Survey Research Center Professor of Biostatistics University of Michigan Distinctive features of survey inference 1. Primary

More information

Dual-Frame Weights (Landline and Cell) for the 2009 Minnesota Health Access Survey

Dual-Frame Weights (Landline and Cell) for the 2009 Minnesota Health Access Survey Dual-Frame Weights (Landline and Cell) for the 2009 Minnesota Health Access Survey Kanru Xia 1, Steven Pedlow 1, Michael Davern 1 1 NORC/University of Chicago, 55 E. Monroe Suite 2000, Chicago, IL 60603

More information

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa Ronald H. Heck 1 In this handout, we will address a number of issues regarding missing data. It is often the case that the weakest point of a study is the quality of the data that can be brought to bear

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Lecture 3: Linear Classification

Lecture 3: Linear Classification Lecture 3: Linear Classification Roger Grosse 1 Introduction Last week, we saw an example of a learning task called regression. There, the goal was to predict a scalar-valued target from a set of features.

More information

A STRATEGY ON STRUCTURAL METADATA MANAGEMENT BASED ON SDMX AND THE GSIM MODELS

A STRATEGY ON STRUCTURAL METADATA MANAGEMENT BASED ON SDMX AND THE GSIM MODELS Distr. GENERAL 25 April 2013 WP.4 ENGLISH ONLY UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS EUROPEAN COMMISSION STATISTICAL OFFICE OF THE EUROPEAN UNION (EUROSTAT)

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Smoothing Dissimilarities for Cluster Analysis: Binary Data and Functional Data

Smoothing Dissimilarities for Cluster Analysis: Binary Data and Functional Data Smoothing Dissimilarities for Cluster Analysis: Binary Data and unctional Data David B. University of South Carolina Department of Statistics Joint work with Zhimin Chen University of South Carolina Current

More information

CS 5540 Spring 2013 Assignment 3, v1.0 Due: Apr. 24th 11:59PM

CS 5540 Spring 2013 Assignment 3, v1.0 Due: Apr. 24th 11:59PM 1 Introduction In this programming project, we are going to do a simple image segmentation task. Given a grayscale image with a bright object against a dark background and we are going to do a binary decision

More information

Extremal Graph Theory: Turán s Theorem

Extremal Graph Theory: Turán s Theorem Bridgewater State University Virtual Commons - Bridgewater State University Honors Program Theses and Projects Undergraduate Honors Program 5-9-07 Extremal Graph Theory: Turán s Theorem Vincent Vascimini

More information

PROOF OF THE COLLATZ CONJECTURE KURMET SULTAN. Almaty, Kazakhstan. ORCID ACKNOWLEDGMENTS

PROOF OF THE COLLATZ CONJECTURE KURMET SULTAN. Almaty, Kazakhstan.   ORCID ACKNOWLEDGMENTS PROOF OF THE COLLATZ CONJECTURE KURMET SULTAN Almaty, Kazakhstan E-mail: kurmet.sultan@gmail.com ORCID 0000-0002-7852-8994 ACKNOWLEDGMENTS 2 ABSTRACT This article contains a proof of the Collatz conjecture.

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

VALIDITY OF 95% t-confidence INTERVALS UNDER SOME TRANSECT SAMPLING STRATEGIES

VALIDITY OF 95% t-confidence INTERVALS UNDER SOME TRANSECT SAMPLING STRATEGIES Libraries Conference on Applied Statistics in Agriculture 1996-8th Annual Conference Proceedings VALIDITY OF 95% t-confidence INTERVALS UNDER SOME TRANSECT SAMPLING STRATEGIES Stephen N. Sly Jeffrey S.

More information

Coding the spoken language through the integration of different approaches of textual analysis

Coding the spoken language through the integration of different approaches of textual analysis 745 Coding the spoken language through the integration of different approaches of textual analysis Stefania Macchia 1, Manuela Murgia 1, Valentina Talucci 1 1 ISTAT- Via Cesare Balbo, 16-00184 Rome - Italy

More information

Expectation Maximization (EM) and Gaussian Mixture Models

Expectation Maximization (EM) and Gaussian Mixture Models Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation

More information

Sampling informative/complex a priori probability distributions using Gibbs sampling assisted by sequential simulation

Sampling informative/complex a priori probability distributions using Gibbs sampling assisted by sequential simulation Sampling informative/complex a priori probability distributions using Gibbs sampling assisted by sequential simulation Thomas Mejer Hansen, Klaus Mosegaard, and Knud Skou Cordua 1 1 Center for Energy Resources

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION Introduction CHAPTER 1 INTRODUCTION Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data. Mplus offers researchers a wide choice of models, estimators,

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

Mining High Order Decision Rules

Mining High Order Decision Rules Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Uncertain Data Models

Uncertain Data Models Uncertain Data Models Christoph Koch EPFL Dan Olteanu University of Oxford SYNOMYMS data models for incomplete information, probabilistic data models, representation systems DEFINITION An uncertain data

More information

Robustness analysis of metal forming simulation state of the art in practice. Lectures. S. Wolff

Robustness analysis of metal forming simulation state of the art in practice. Lectures. S. Wolff Lectures Robustness analysis of metal forming simulation state of the art in practice S. Wolff presented at the ICAFT-SFU 2015 Source: www.dynardo.de/en/library Robustness analysis of metal forming simulation

More information

Blending of Probability and Convenience Samples:

Blending of Probability and Convenience Samples: Blending of Probability and Convenience Samples: Applications to a Survey of Military Caregivers Michael Robbins RAND Corporation Collaborators: Bonnie Ghosh-Dastidar, Rajeev Ramchand September 25, 2017

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

Digital Archives: Extending the 5S model through NESTOR

Digital Archives: Extending the 5S model through NESTOR Digital Archives: Extending the 5S model through NESTOR Nicola Ferro and Gianmaria Silvello Department of Information Engineering, University of Padua, Italy {ferro, silvello}@dei.unipd.it Abstract. Archives

More information

EULER S FORMULA AND THE FIVE COLOR THEOREM

EULER S FORMULA AND THE FIVE COLOR THEOREM EULER S FORMULA AND THE FIVE COLOR THEOREM MIN JAE SONG Abstract. In this paper, we will define the necessary concepts to formulate map coloring problems. Then, we will prove Euler s formula and apply

More information

The Importance of Modeling the Sampling Design in Multiple. Imputation for Missing Data

The Importance of Modeling the Sampling Design in Multiple. Imputation for Missing Data The Importance of Modeling the Sampling Design in Multiple Imputation for Missing Data Jerome P. Reiter, Trivellore E. Raghunathan, and Satkartar K. Kinney Key Words: Complex Sampling Design, Multiple

More information

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material

More information

NON-CENTRALIZED DISTINCT L-DIVERSITY

NON-CENTRALIZED DISTINCT L-DIVERSITY NON-CENTRALIZED DISTINCT L-DIVERSITY Chi Hong Cheong 1, Dan Wu 2, and Man Hon Wong 3 1,3 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong {chcheong, mhwong}@cse.cuhk.edu.hk

More information

GUIDELINES FOR MASTER OF SCIENCE INTERNSHIP THESIS

GUIDELINES FOR MASTER OF SCIENCE INTERNSHIP THESIS GUIDELINES FOR MASTER OF SCIENCE INTERNSHIP THESIS Dear Participant of the MScIS Program, If you have chosen to follow an internship, one of the requirements is to write a Thesis. This document gives you

More information

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Week 02 Module 06 Lecture - 14 Merge Sort: Analysis

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Week 02 Module 06 Lecture - 14 Merge Sort: Analysis Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute Week 02 Module 06 Lecture - 14 Merge Sort: Analysis So, we have seen how to use a divide and conquer strategy, we

More information

User s guide to R functions for PPS sampling

User s guide to R functions for PPS sampling User s guide to R functions for PPS sampling 1 Introduction The pps package consists of several functions for selecting a sample from a finite population in such a way that the probability that a unit

More information

Transitivity and Triads

Transitivity and Triads 1 / 32 Tom A.B. Snijders University of Oxford May 14, 2012 2 / 32 Outline 1 Local Structure Transitivity 2 3 / 32 Local Structure in Social Networks From the standpoint of structural individualism, one

More information

ECLT 5810 Evaluation of Classification Quality

ECLT 5810 Evaluation of Classification Quality ECLT 5810 Evaluation of Classification Quality Reference: Data Mining Practical Machine Learning Tools and Techniques, by I. Witten, E. Frank, and M. Hall, Morgan Kaufmann Testing and Error Error rate:

More information

Bootstrap and multiple imputation under missing data in AR(1) models

Bootstrap and multiple imputation under missing data in AR(1) models EUROPEAN ACADEMIC RESEARCH Vol. VI, Issue 7/ October 2018 ISSN 2286-4822 www.euacademic.org Impact Factor: 3.4546 (UIF) DRJI Value: 5.9 (B+) Bootstrap and multiple imputation under missing ELJONA MILO

More information

Small area estimation by model calibration and "hybrid" calibration. Risto Lehtonen, University of Helsinki Ari Veijanen, Statistics Finland

Small area estimation by model calibration and hybrid calibration. Risto Lehtonen, University of Helsinki Ari Veijanen, Statistics Finland Small area estimation by model calibration and "hybrid" calibration Risto Lehtonen, University of Helsinki Ari Veijanen, Statistics Finland NTTS Conference, Brussels, 10-12 March 2015 Lehtonen R. and Veijanen

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

NORM software review: handling missing values with multiple imputation methods 1

NORM software review: handling missing values with multiple imputation methods 1 METHODOLOGY UPDATE I Gusti Ngurah Darmawan NORM software review: handling missing values with multiple imputation methods 1 Evaluation studies often lack sophistication in their statistical analyses, particularly

More information

Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices

Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices Int J Adv Manuf Technol (2003) 21:249 256 Ownership and Copyright 2003 Springer-Verlag London Limited Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices J.-P. Chen 1

More information

ROUGH MEMBERSHIP FUNCTIONS: A TOOL FOR REASONING WITH UNCERTAINTY

ROUGH MEMBERSHIP FUNCTIONS: A TOOL FOR REASONING WITH UNCERTAINTY ALGEBRAIC METHODS IN LOGIC AND IN COMPUTER SCIENCE BANACH CENTER PUBLICATIONS, VOLUME 28 INSTITUTE OF MATHEMATICS POLISH ACADEMY OF SCIENCES WARSZAWA 1993 ROUGH MEMBERSHIP FUNCTIONS: A TOOL FOR REASONING

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

The Near Greedy Algorithm for Views Selection in Data Warehouses and Its Performance Guarantees

The Near Greedy Algorithm for Views Selection in Data Warehouses and Its Performance Guarantees The Near Greedy Algorithm for Views Selection in Data Warehouses and Its Performance Guarantees Omar H. Karam Faculty of Informatics and Computer Science, The British University in Egypt and Faculty of

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University September 30, 2016 1 Introduction (These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan.

More information

Programs for MDE Modeling and Conditional Distribution Calculation

Programs for MDE Modeling and Conditional Distribution Calculation Programs for MDE Modeling and Conditional Distribution Calculation Sahyun Hong and Clayton V. Deutsch Improved numerical reservoir models are constructed when all available diverse data sources are accounted

More information