Statistical Matching using Fractional Imputation

Size: px

Start display at page:

Download "Statistical Matching using Fractional Imputation"

Bernadette Erica Clark
5 years ago
Views:

1 Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park

2 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 2 / 35

3 Introduction Motivation Combine information from several surveys Example: Two surveys 1 Survey A: Observe X and Y 1 2 Survey B: Observe X and Y 2 Want to create a data file with X, Y 1, Y 2. If Survey B sample is a subset of Survey A sample, then we may use record linkage technique to obtain Y 1 value for survey B sample. What if the two samples are independent? Kim (ISU) Matching 3 / 35

4 Introduction Table : A Simple Data structure for Matching X Y 1 Y 2 Sample A o o Sample B o o Kim (ISU) Matching 4 / 35

5 Introduction Table : Data after statistical matching X Y 1 Y 2 Sample A o o o Sample B o o o Also called data fusion, or data combination. Kim (ISU) Matching 5 / 35

6 Introduction Example 1 Split questionnaire design Split the original sample into two groups In group 1, ask (x, y 1 ) In group 2, ask (x, y 2 ) Often used to reduce the response burden (and improve the quality of the survey responses). Kim (ISU) Matching 6 / 35

7 Introduction Example 2 Combining two surveys Survey A: Health-related survey Survey B: Socio-Economic surveys x: demographic variable, y 1 : health status variable, y 2 : socio-economic variable Interested in fitting a regression of y 1 (e.g. Obesity) on x and y 2 using two surveys. Two samples should be obtained from the same finite population. Kim (ISU) Matching 7 / 35

8 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 8 / 35

9 Introduction Idea We want to create Y 1 for each element in sample B by finding a statistical twin from the sample A. Often based on the assumption that Y 1 and Y 2 are conditionally independent, conditional on X. That is, Y 1 Y 2 X Under CI (Conditional Independence) assumption, we have f (y 1 x, y 2 ) = f (y 1 x) and the statistical twin is solely determined by how close they are in terms of x s. Kim (ISU) Matching 9 / 35

10 Introduction Remark Under the assumption that (X, Y 1, Y 2 ) are multivariate normal, the CI assumption means that σ 12 = σ 1x σ 2x /σ xx and ρ 12 = ρ 1x ρ 2x. That is, σ 12 is determined from other parameters, rather than estimated from the realized samples. Kim (ISU) Matching 10 / 35

11 Existing Methods Methods under CI assumption Synthetic data imputation: 1 Estimate f (y 1 x) from sample A, denoted by ˆf a (y 1 x). 2 For each element in sample B, use the x i value to create imputed value(s) from ˆf a (y 1 x). Matching: Two-step method Instead of using the synthetic values directly for imputation, synthetic values are used to identify the statistical twins in sample A. The identified twin in sample A is used as the imputed value. Kim (ISU) Matching 11 / 35

12 Existing Methods Some popular methods under CI assumption Parametric approach : Often based on the parametric model or regression model ŷ 1i = ˆβ 0 + ˆβ 1 x i Nonparametric approach Random hot deck Rank hot deck Distance hot deck Reference D Orazio, Di Zio, and Scanu (2006). Statistical Matching: Theory and Practice, Wiley. Kim (ISU) Matching 12 / 35

13 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 13 / 35

14 New Approach Motivation The regression of Y 1 on X and Y 2 will provide insignificant regression coefficient on Y 2. That is, the p-value for ˆβ 2 will be large in ŷ 1 = ˆβ 0 + ˆβ 1 x + ˆβ 2 y 2 CI assumption is often unrealistic! For example, 1 Often X is demographic variable 2 Y 1 is social-behavior (or public health) 3 Y 2 is economic variable (e.g. HH income) In this case, we may have Corr(Y 1, Y 2 X ) 0 Kim (ISU) Matching 14 / 35

15 New Approach Alternative interpretation We can view the problem as an omitted variable regression problem. y 1 = β (1) 0 + β (1) 1 x + β(1) 2 z + e 1 y 2 = β (2) 0 + β (2) 1 x + β(2) 2 z + e 2 where z, e 1, e 2 are never observed. e 1 and e 2 are independent. z is an unobservable confounding factor that explains Cov(y 1, y 2 x) 0. Thus, if we fit a regression of (y 1, y 2 ) on x, then the error terms are still correlated. Kim (ISU) Matching 15 / 35

16 New Approach Instrumental variable Under CI assumption, imputed values are generated from f (y 1 x), which completely ignores the observed information of y 2. Let s try to generate imputed values from f (y 1 x, y 2 ). However, we cannot estimate the parameters in f (y 1 x, y 2 ). Use instrumental variable assumption for identification of the models. Kim (ISU) Matching 16 / 35

17 New Approach Idea Decompose X = (X 1, X 2 ) such that (i) f (y 1 x 1, x 2, y 2 ) = f (y 1 x 1, y 2 ) (ii) f (y 1 x 1, x 2 = a) f (y 1 x 1, x 2 = b) for some a b. X 2 is often called instrumental variable (IV) for Y 2 Kim (ISU) Matching 17 / 35

18 New Approach Propose method Under IV assumption, f (y 1 x, y 2 ) f (y 1 x) f (y 2 x 1, y 1 ) The second term can be ignored under CI assumption. The second term incorporates the observed information of y 2 in Sample B. EM algorithm can be used to perform the parameter estimation and prediction simultaneously. E-step can be computationally heavy (Markov Chain Monte Carlo). Metropolis-Hastings algorithm 1 Generate y 1 from ˆf a (y 1 x). 2 Accept y 1 if f (y 2 x 1, y 1 ; ˆθ) is large at the current parameter value ˆθ. Kim (ISU) Matching 18 / 35

19 New Approach Propose method Parametric fractional imputation (PFI) of Kim (2011) is an alternative computational tool that does not involve MCMC computation but still implements EM algorithm with intractable E-step. PFI uses importance sampling: When the target distribution is f (y 1 x, y 2 ) f (y 1 x) f (y 2 x 1, y 1 ), first generate m values of y1 f (y 1 x) and then use a normalized version of f (y 2 x 1, y1 ) as a weight assigned to y 1. Solve the weighted score equation to update the parameters in the M-step. Kim (ISU) Matching 19 / 35

20 New Approach Propose method: Parametric fractional imputation 1 For each i B, generate m imputed values of y 1, denoted by y (1) 1i,, y (m) 1i, from ˆf a (y 1 x i ). 2 Let ˆθ t be the current parameter value of θ in f (y 2 x 1, y 1 ). For the j-th imputed value y (j) 1i, assign fractional weight where m j=1 w ij = 1. w ij f ( y 2i x 1i, y (j) 1i ; ˆθ t ) 3 Solve the fractionally imputed score equation for θ m w ib i B j=1 w ij S(θ; x 1i, y (j) 1i, y 2i ) = 0 to update ˆθ t+1, where S(θ; x 1, y 1, y 2 ) = log f (y 2 x 1, y 1 ; θ)/ θ. 4 Go to step 2 and continue until convergence. Kim (ISU) Matching 20 / 35

21 Remark Fractional imputation can be understood as a tool for computing a Monte Carlo approximation of the conditional expectation given the observation. Fractionally imputed data file can be used to obtain many different parameters. That is, if a parameter η is defined as a solution to E{U(η; x, y 1, y 2 )} = 0, then a consistent estimator of η can be obtained by the solution to m w ib i B j=1 w ij U(η; x i, y (j) 1i, y 2i ) = 0. Note that the above estimating equation is a Monte Carlo approximation to the following estimating equation: w ib E{U(η; x i, Y 1i, y 2i ) x i, y 2i } = 0. i B For variance estimation, linearization method can be used (Skipped here). Kim (ISU) Matching 21 / 35

22 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 22 / 35

23 Application to Measurement error models Interested in estimating θ in f (y x; θ). Instead of observing x, we observe z which can be highly correlated with x. Thus, z is an instrumental variable for x: f (y x, z) = f (y x) and f (y z = a) f (y z = b) for a b. In addition to original sample, we have a separate calibration sample that observes (x i, z i ). Kim (ISU) Matching 23 / 35

24 Example: Measurement error model Table : External Calibration Study Z X Y Sample A o o Sample B o o Table : Internal Calibration Study Sample Z X Y Validation Subsample o o o Non-validation subsample o o Kim (ISU) Matching 24 / 35

25 Remark Internal calibration study: Two-phase sampling structure Phase One: observe (z, y) Phase Two: validation subsample, observe x in addition to (z, y) Imputation approach for two-phase sampling Estimate f (x z, y) from the second phase sample. For the elements in the phase one sample, generate x ˆf (x z, y). For external calibration study, we use the proposed statistical matching technique under the assumption that f (y x, z) = f (y x). Kim (ISU) Matching 25 / 35

26 Proposed method: Idea In sample B, x is a latent variable (a variable that is always missing). The goal is to generate x in Sample B from f (x i z i, y i ) f (x i z i ) f (y i x i, z i ) = f (x i z i ) f (y i x i ) Obtain a consistent estimator ˆf a (x z) from sample A. May use a Monte Carlo EM algorithm E-step: Generate x (1) i,, x (m) i from f (x i z i, y i ; ˆθ (t) ) ˆf a (x i z i )f (y i x i ; ˆθ (t) ) M-step: Solve the imputed score equation for θ. Kim (ISU) Matching 26 / 35

27 Fractional imputation for EM algorithm The above E-step may be computationally challenging (often relies on a MCMC method) Parametric fractional imputation can be used for easy computation. E-step 1 Generate x (1) i,, x (m) i from ˆf a (x i z i ) in i B. 2 Compute the fractional weights associated with x (j) i w ij f (y i x (j) i ; ˆθ (t) ) and j w ij = 1. M-step: Solve the weighted score equation for θ. by Kim (ISU) Matching 27 / 35

28 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 28 / 35

29 Simulation Setup Measurement error model setup y i Bernoulli(p i ) logit(p i ) = γ 0 + γ x x i z i = β 0 + β 1 x i + u i u i N(0, σ 2 xi 2α ) and x i N(µ x, σx). 2 We observe (x i, z i ), i = 1,..., n A in sample A. In sample B, instead of observing (x i, y i ), we observe (z i, y i ). For the simulation, n A = n B = 800, γ 0 = 1, γ x = 1, β 0 = 0, β 1 = 1, σ 2 = 0.25, α = 0.4, µ x = 0, and σ 2 x = 1. Kim (ISU) Matching 29 / 35

30 Methods 1 Parametric fractional imputation (PFI) 2 Hot deck fractional imputation (HDFI) 3 Naive: Naive estimator obtained from the logistic regression of y i on z i for i B. 4 Bayes: Proposed by Guo and Little (2011). GIBBS sampling is implemented with JAGS. We used 1000 iterations of a single chain for inference, after discarding the first 500 for burn-in. We specify diffuse proper prior distributions for the Bayes estimators. Letting θ 1 = (log(σ 2 x), log(σ 2 ), µ x, β 0, β 1, γ 0, γ x ), we assume a priori that θ 1 N(0, 10 6 I 7 ), where I 7 is a 7 7 identity matrix. The prior distribution for the power α is uniform on the interval [ 5, 5]. 5 Weighted regression calibration (WRC): regression calibration method incorporating the unequal variance in the measurement error model (also considered in Guo and Little, 2011). Kim (ISU) Matching 30 / 35

31 Simulation result Table : Monte Carlo (MC) means, variances, and mean squared errors (MSE) of point estimators of γ x Method MC Bias MC Variance MC MSE PFI HDFI Naive Bayes WRC Kim (ISU) Matching 31 / 35

32 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 32 / 35

33 Concluding Remark Statistical matching is a tool for survey data integration. The current practice of statistical matching is based on conditional independence assumption, which may not be a realistic assumption in practice. A new approach based on instrumental variable is proposed. The proposed method provides statistically valid regression coefficient for the matched data even when CI assumption does not hold. Variance estimation is possible (not covered here). Directly applicable to measurement error model problems and split questionnaire design problems. Kim (ISU) Matching 33 / 35

34 Future research Semi-parametric inference by making ˆf a (y 1 x) nonparametric. f (y 1 x, y 2 ) f (y 1 x) f (y 2 x 1, y 1 ) Application to causal inference: Estimation of average treatment effect from observational studies when we cannot observe the counterfactual outcomes. Combination of two data: one from probability sampling and the other from a non-probability sample. Kim (ISU) Matching 34 / 35

35 The end Kim (ISU) Matching 35 / 35

An imputation approach for analyzing mixed-mode surveys

An imputation approach for analyzing mixed-mode surveys Jae-kwang Kim 1 Iowa State University June 4, 2013 1 Joint work with S. Park and S. Kim Ouline Introduction Proposed Methodology Application to Private