Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information

Outline The conditional independence model(cia) Parametric macro methods The normal case Maximum likelihood Parametric micro methods: an overview Nonparametric macro methods: an overview Nonparametric micro methods Random hot deck Conditional random hot deck Rank hot deck Distance hot deck Constrained hot deck Auxiliary information

Statistical matching Let us assume that data are collected in two sample surveys, say A and B of size n A and n B from the same population. Some X variables are observed in both the samples Variables Y are observed only in survey A Variables Z are observed only in survey B. The goal is inference on (X,Y,Z), or at least on the bivariate (Y,Z)

Statistical matching Goal: estimation of parameters describing (Y,Z) or (X,Y,Z)

A first identifiable model Let us consider the class of models F for (X,Y,Z) to the following set: where f Y X is the conditional density of Y given X, f Z X is the conditional density of Z given X f X is the marginal density of X. Consequence 1: this class of distributions for (X,Y,Z) is the conditional independence of Y and Z given X (CIA). Consequence 2: this model is identifiable for A B Note: this is not the only identifiable model! Help: this model can be useful in many different cases (use of proxy variables and uncertainty)!

The different matching contexts Output Macro Micro Approach Parametric Nonparametric Let s tackle this problem in a familiar context for inferential statistics: data are drawn according to a probability law that follows a parametric model, and the objective is macro. In the following we will mainly consider two distributions: the normal and the multinomial

Parametric macro methods In a parametric model, each probability law that can generate our sample data can be described by a finite number of parameters. Under the CIA, given the sample A B, the likelihood function becomes:

Parametric macro methods Parameter estimation becomes straightforward: Use sample A B for estimating Use A for estimating Use B for estimating

Parametric macro methods: the normal case Let (X,Y,Z) be a three-variate normal r.v. with parameters: Under the CIA, the parameter YZ is superfluous For the statistical matching problem, it is convenient to consider the equivalent distribution defined by this parameterization: X, Y X, Z X.

Parametric macro methods: the normal case Estimates for the re-parameterization

Parametric macro methods: the normal case For the estimates of the parameters of the marginal distribution of X, the whole sample A B can be used

Parametric macro methods: the normal case For the estimates of the parameters of the distribution of Y given X, only sample A can be used Hence, the marginal parameters for Y are:

Parametric macro methods: the normal case For the estimates of the parameters of the distribution of Z given X, only sample B can be used Hence, the marginal parameters for Z are:

Comment: why maximum likelihood estimation? What happens if, instead of the previous maximum likelihood parameter estimation, we consider a direct estimation from the data set where the corresponding variable(s) are observed? For instance, let s consider the case of a direct estimation of the Y mean value on A (i.e. with the sample average of Y in A) instead of using μ Y (a kind of regression estimate in a double sampling) Where ρ XY is the correlation coefficient between X and Y. The maximum likelihood estimator is much more efficient when B sample size increases and X and Y are highly correlated.

Comment: why maximum likelihood estimation? When parameters are estimated distinctly on the part of A B that is complete for the corresponding r.v., it might happen that the estimates are not coherent. For instance, the estimated variance and covariance matrix for (X, Y) can be negative definite! This does not happen in the simultaneous estimation of all the parameters by means of the maximum likelihood estimation

Example

Example Under the CIA the maximum likelihood estimate of the parameters are: From the previous estimates we get:

Parametric macro methods: the multinomial case Let (X,Y,Z) be a multinomial r.v. with parameters: where following characteristics is a vector of parameters with the

Parametric macro methods: the multinomial case Adopting the same re-parameterization of the joint distribution, under the CIA the parameters of interest are: In this context, the parameters of the joint distribution computed according to the following formulas are When the interest is only on the pairwise distribution (Y,Z)

Parametric macro methods: the multinomial case Given the sample A B, the maximum likelihood estimator is

Example Let s consider the following two samples A and B, where I=2, J=2, K=3.

Example The maximum likelihood estimates of the parameters are:

Example The maximum likelihood estimates of the parameters of the joint distribution are:

Parametric macro methods: conclusions The CIA model is identifiable (i.e. with a unique estimate) for the data set A B The application of the maximum likelihood estimator is very easy Even if the problem is characterized by missing data, the problem can be split in three «complete data» subproblems, one for each parameter of the re-parameterization There can be other estimation methods can be statistically consistent, but incoherent

Selected references Anderson T W (1957) ``Maximum likelihood estimates for a multivariate normal distribution when some observations are missing'', JASA, 52, 200 203 Anderson T W (1984) An Introduction to Multivariate Statistical Analysis, Wiley Rubin D B (1974)``Characterizing the Estimation of Parameters in Incomplete--Data Problems'', JASA, 69, 467 474 D'Orazio, M., Di Zio, M. and Scanu, M. (2006) Statistical matching for categorical data: displaying uncertainty and using logical constraints. JOS, 22, 137-157 Moriarity C, Scheuren F (2001)``Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure'', JOS, 17, 407--422

Parametric micro methods Output Macro Micro Approach Parametric Nonparametric We are still in the familiar context of parametric data models, but now the objective is micro, i.e. we are interested in a complete data set where (X, Y, Z) are jointly available. This is the context where imputation methods are usually used!

Objective and context Objective: to create a complete data set for (X, Y, Z) Context: partially observed data set

Parametric micro methods: rationale Method: imputation of missing values. In a parametric context: 1. Estimate the distribution parameters 2. Take a (not necessarily random) value from the estimated distribution

Selected references Little R J A, Rubin D B (2002) Statistical Analysis with Missing Data, 2 nd edition, Wiley Rubin D B (1986) Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations, Journal of Business and Economic Statistics, 4, 87 94 Kadane J B (1978) Some Statistical Problems in Merging Data Files, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 159 179. Republished on Journal of Official Statistics, 17, 423 433. Marella D., Scanu M., Conti P.L. (2008). On the matching noise of some nonparametric imputation procedures, Statistics and Probability Letters, 78, 1593-1600. Conti P.L., Marella D., Scanu M. (2008). Evaluation of matching noise for imputation techniques based on the local linear regression estimator. Computational Statistics and Data Analysis, 53, 354-365.

Non parametric macro methods Approach Output Parametric Nonparametric Macro Micro The family of distributions where the distribution of (X, Y, Z) belongs cannot be represented by a finite number of parameters. Although the statistical literature on nonparametrics is huge, this is by far the most neglected situation in statistical matching! Anyway we check it in order to link macro and micro methods, as for the parametric case

Non parametric macro methods: rationale Usually neglected in the statistical matching literature., anyway it is possible to develop the methodologies similarly to the parametric macro case. Two situations have been mainly studied: 1. X categorical, Y and Z ordered or numerical: estimation of the empirical cumulative distribution 2. X, Y and Z numerical: estimation of the nonparametric regression function As a matter of fact, the first approach will be helpful for the random generation of imputations in the corresponding micro methods, the second for the conditional mean matching and, whenever possible, adding a random residual

Selected references Paass G (1985) Statistical record linkage methodology, state of the art and future prospects, in Bulletin of the International Statistical Institute, Proceedings of the 45th Session, volume LI, Book 2 Marella D., Scanu M., Conti P.L. (2008) On the matching noise of some nonparametric imputation procedures, Statistics and Probability Letters, 78, 1593 1600 Conti P.L., Marella D., Scanu M. (2008). Evaluation of matching noise for imputation techniques based on the local linear regression estimator. Computational Statistics and Data Analysis, 53, 354 365

Non parametric micro methods Approach Output Parametric Nonparametric Macro Micro Who applied these methods, seldom assumed anything about the distribution of (X, Y, Z) Each micro method has a macro counterpart, i.e. a representation of how the distribution of (X, Y, Z) should be done. The problem is: to be aware or not?

Non parametric micro methods The nonparametric micro matching methods consist of essentially three imputation procedures 1. Random hot deck 2. Rank hot deck 3. Distance hot deck As already seen in the parametric case, each one of these methods correspond to a specific nonparametric macro approach of the distribution f x, y, z or of a characteristic value. In general, these methods do not organize the two data sets A and B as a unique sample A B.

Parametric micro methods A is the recipient file B is the donor file and and these are the data these are the data to to impute use for imputation The idea is to consider a file as a recipient and the other as the donor

Example In order to define the different hot deck methods, let s consider an example Example: let A and B be the following ones A : n A = 6, observed variables: Gender, Age, Income B : n B = 10, observed variables: Gender, Age, Expenditures A=recipient B=donor Common variables X=(X 1 =Gender, X 2 =Age) Y=(Income) Z=(Expenditures)

Example

Random hot deck: the method 1. Let us draw one random value from B and assign it to the first value to impute in A. 2. Follow the same procedure for all the a A Example: In general we have n B n A = 10 6 possible different ways to impute A

Conditional random hot deck: the method 1. Let s fix a conditional variable, e.g. X 1 2. For the first record a=1, let us draw one random value from the subset of units in B that X 1 b = F. 3. Follow the same procedure for all the a A Example: The number of different completed data sets we can get is m B m A + n B m B n A ma = 6 4 + 4 2 = 1312

Comments 1. Random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z 2. Conditional random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z X 3. It is possible to eliminate the already drawn value from the set of possible donors (constrained procedure), anyway the preservation of the observed distribution of Z or Z X in B is geopardized

Rank hot deck Let s assume that n B = kn A, k integer. Compute the empirical cumulative distribution functions To each a A assign b B chosen so that In other words, this method imputes the values whose quintiles of X are similar in A and B respectively

Rank hot deck Rank the two sample A and B according to X 1

Rank hot deck These are the values of the empirical cumulative distribution function of X 1 in A and B respectively

Rank hot deck This is the result In this example, there is only one way to impute a value

Distance hot deck To each a A assign b B chosen so that it is the nearest according to the common variables. This method depends on the distance function. Different choices are available. If X is numeric, it is possible to choose the Manhattan distance Other distances can be the Euclidean, If X is multivariate, the available distances are the Mahalanobis, Canberra, If X is categorical and unordered, it is possible to consider the classes of imputation (i.e. the distance is the «equality»)

Distance hot deck - example Let s consider X 2 as the variable to use in order to compute distances (i.e. we choose as donors those records in B whose age is the most similar to the one in A) Choose one value at random if there are more than one same distance donors The overall distance between donor and recipients is

Constrained distance hot deck In the former procedure, a donor can be chosen more than once if it is the nearest of more than one record in A. In order to resue the same information more than once, the following constrained procedure has been defined a. Minimize b. Under the constraints

Constrained distance hot deck: advantages and disadvantages Constrained matching allows to preserve the marginal distribution of the variable to impute (Z) Constrained distance hot deck is characterized by a larger distance between donors and recipients than the one for distance hot deck

Constrained distance hot deck: example The overall donor recipient distance is

Constrained distance hot deck: comments Distance hot deck is equivalent to the estimation of a nonparametric regression function via the knn method, when k=1. These methods produce always live data as imputations Sometimes, parametric and nonparametric procedures are applied together: mixed methods Example: Regression step - impute intermediate values in A and B Matching step use a distance hot deck by selecting b * with the shortest distance

Selected references Kadane J B (1978) Some Statistical Problems in Merging Data Files, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 159 179. Published also on Journal of Official Statistics, 17, 423 433. Little R J A, Rubin D B (2002) Statistical Analysis with Missing Data, 2 nd edition, Wiley Okner B A (1972) Constructing a new data base from existing microdata sets: the 1966 merge file, Annals of Economic and Social Measurement, 1, 325 342 Rodgers W L (1984) An Evaluation of Statistical Matching, Journal of Business and Economic Statistics, 2, 91 102 Rubin D B (1986) Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations, Journal of Business and Economic Statistics, 4, 87 94 Sims C A (1972), Comments on Okner, Annals of Economic and Social Measurement, 1, 343 345 Singh A C, Mantel H, Kinack M, Rowe G (1993) Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption, Survey Methodology, 19, 59 79 Marella D., Scanu M., Conti P.L. (2008). On the matching noise of some nonparametric imputation procedures, Statistics and Probability Letters, 78, 1593-1600. Conti P.L., Marella D., Scanu M. (2008). Evaluation of matching noise for imputation techniques based on the local linear regression estimator. Computational Statistics and Data Analysis, 53, 354-365

Auxiliary information In order to avoid the CIA, two different kinds of auxiliary information have been usually considered: 1) a third file C where either (X, Y,Z) or (Y,Z) are jointly observed 2) a plausible value of the inestimable parameters of either (Y,Z X) or (Y,Z) These additional sources of information may originate from an outdated statistical investigation; administrative register; a supplemental (even small) ad hoc survey; proxy variables (Y,Z ) Pay always attention to their accuracy!!!

Auxiliary information on parameters Auxiliary information on parameters can be in terms of: information about q yz x Information about q yz This kind of information restricts the parameter space Q to a subspace Q*, where Q* involves all the parameters q Q compatible with the auxiliary information. NOTE: most of the times the unconstrained maximum likelihood estimate is not compatible with this information: this leads to parameter estimates of the estimable parameters that are strictly different from the ones on the CIA

Example: Auxiliary info on r yz = r* yz Let us suppose that Value r* yz = 0.7 is compatible, det(r ) =0.096. while r* yz = 0.9 is not compatible, det(r ) =-0.008

Use of a third file C, complete on X,Y,Z In a parametric macro approach: use the EM on the union of A, B, C In a parametric micro approach: use conditional mean matching In a nonparametric micro approach: use hot deck (first impute A with record from C, then impute live B values on A using the imputed records) Parametric and nonparametric methods can also be mixed (e.g. impute A records with the use of a conditional mean matching that makes use of an additional file C, then impute live B values)

Selected references Rässler S. (2002) Statistical Matching: a frequentist theory, practical applications and alternative Bayesian approaches, Springer Moriarity C., Scheuren F. (2001) Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure, Jour. of Official Statistics, 17, 407 422 Moriarity C., Scheuren F. (2003) A Note on Rubin s Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputation, Jour. of Business and Economic Statistics, 21, 65 73 Moriarity C., Scheuren F. (2004), Regression based statistical matching: recent developments, Proceedings of the Section on Survey Research Methods, American Statistical Association D Orazio M., Di Zio M., Scanu M. (2006) Statistical Matching for Categorical Data: displaying uncertainty and using logical constraints, Jour. of Official Statistics, 22, 1 22 Singh A.C., Mantel M.D., Kinack M.D., Rowe G. (1993). Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption, Survey Methodology, vol. 19, N. 1, pp- 59-79