Statistical matching: conditional. independence assumption and auxiliary information

Similar documents
This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Statistical Matching of Two Surveys with a Common Subset

Statistical Matching using Fractional Imputation

Estimation methods for the integration of administrative sources

Cleanup and Statistical Analysis of Sets of National Files

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland

Data corruption, correction and imputation methods.

Missing Data Analysis for the Employee Dataset

Statistical Methods for the Analysis of Repeated Measurements

Missing Data Analysis for the Employee Dataset

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation

COPULA MODELS FOR BIG DATA USING DATA SHUFFLING

Missing Data. Where did it go?

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

Hierarchical Mixture Models for Nested Data Structures

Handling missing data for indicators, Susanne Rässler 1

STATISTICS (STAT) Statistics (STAT) 1

Data analysis using Microsoft Excel

Handling Data with Three Types of Missing Values:

CHAPTER 1 INTRODUCTION

A noninformative Bayesian approach to small area estimation

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Comparative Evaluation of Synthetic Dataset Generation Methods

Missing Data: What Are You Missing?

Missing Data Techniques

Mixture Models and the EM Algorithm

Nuts and Bolts Research Methods Symposium

Opening Windows into the Black Box

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

NORM software review: handling missing values with multiple imputation methods 1

Machine Learning: An Applied Econometric Approach Online Appendix

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data

Missing Data and Imputation

arxiv: v1 [stat.me] 29 May 2015

Analysis of Incomplete Multivariate Data

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Conditional Volatility Estimation by. Conditional Quantile Autoregression

Smoothing Dissimilarities for Cluster Analysis: Binary Data and Functional Data

FHDI: An R Package for Fractional Hot Deck Imputation by Jongho Im, In Ho Cho, and Jae Kwang Kim

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Development of Synthetic Microdata for Educational Use in Japan

Small area estimation by model calibration and "hybrid" calibration. Risto Lehtonen, University of Helsinki Ari Veijanen, Statistics Finland

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Dynamic Thresholding for Image Analysis

Multiple-imputation analysis using Stata s mi command

Using Machine Learning to Optimize Storage Systems

Statistical Matching of Discrete Data by Bayesian Networks

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Missing data analysis. University College London, 2015

Statistics (STAT) Statistics (STAT) 1. Prerequisites: grade in C- or higher in STAT 1200 or STAT 1300 or STAT 1400

CS 229 Midterm Review

The Use of Sample Weights in Hot Deck Imputation

ON SOME METHODS OF CONSTRUCTION OF BLOCK DESIGNS

Statistical Analysis of List Experiments

Digital Image Classification Geography 4354 Remote Sensing

Lecture 7: Linear Regression (continued)

Performance of Sequential Imputation Method in Multilevel Applications

Predict Outcomes and Reveal Relationships in Categorical Data

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

JMP Book Descriptions

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

THE 2002 U.S. CENSUS OF AGRICULTURE DATA PROCESSING SYSTEM

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Section 4 Matching Estimator

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

Missing Data Missing Data Methods in ML Multiple Imputation

Massive Data Analysis

WELCOME! Lecture 3 Thommy Perlinger

Time Series Analysis by State Space Methods

Programs for MDE Modeling and Conditional Distribution Calculation

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

- 1 - Fig. A5.1 Missing value analysis dialog box

An Introduction to the Bootstrap

A Fast Clustering Algorithm with Application to Cosmology. Woncheol Jang

Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda

Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics

Random projection for non-gaussian mixture models

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

Creating a data file and entering data

A Bayesian analysis of survey design parameters for nonresponse, costs and survey outcome variable models

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

A Fast Multivariate Nearest Neighbour Imputation Algorithm

Chapter 1 Introduction. Chapter Contents

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

Introduction to Machine Learning. Xiaojin Zhu

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

BCLUST -- A program to assess reliability of gene clusters from expression data by using consensus tree and bootstrap resampling method

Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA

ECLT 5810 Clustering

Application of Characteristic Function Method in Target Detection

Package midastouch. February 7, 2016

Understanding Clustering Supervising the unsupervised

Excel 2010 with XLSTAT

PSY 9556B (Feb 5) Latent Growth Modeling

Multivariate Normal Random Numbers

Transcription:

Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information

Outline The conditional independence model(cia) Parametric macro methods The normal case Maximum likelihood Parametric micro methods: an overview Nonparametric macro methods: an overview Nonparametric micro methods Random hot deck Conditional random hot deck Rank hot deck Distance hot deck Constrained hot deck Auxiliary information

Statistical matching Let us assume that data are collected in two sample surveys, say A and B of size n A and n B from the same population. Some X variables are observed in both the samples Variables Y are observed only in survey A Variables Z are observed only in survey B. The goal is inference on (X,Y,Z), or at least on the bivariate (Y,Z)

Statistical matching Goal: estimation of parameters describing (Y,Z) or (X,Y,Z)

A first identifiable model Let us consider the class of models F for (X,Y,Z) to the following set: where f Y X is the conditional density of Y given X, f Z X is the conditional density of Z given X f X is the marginal density of X. Consequence 1: this class of distributions for (X,Y,Z) is the conditional independence of Y and Z given X (CIA). Consequence 2: this model is identifiable for A B Note: this is not the only identifiable model! Help: this model can be useful in many different cases (use of proxy variables and uncertainty)!

The different matching contexts Output Macro Micro Approach Parametric Nonparametric Let s tackle this problem in a familiar context for inferential statistics: data are drawn according to a probability law that follows a parametric model, and the objective is macro. In the following we will mainly consider two distributions: the normal and the multinomial

Parametric macro methods In a parametric model, each probability law that can generate our sample data can be described by a finite number of parameters. Under the CIA, given the sample A B, the likelihood function becomes:

Parametric macro methods Parameter estimation becomes straightforward: Use sample A B for estimating Use A for estimating Use B for estimating

Parametric macro methods: the normal case Let (X,Y,Z) be a three-variate normal r.v. with parameters: Under the CIA, the parameter YZ is superfluous For the statistical matching problem, it is convenient to consider the equivalent distribution defined by this parameterization: X, Y X, Z X.

Parametric macro methods: the normal case Estimates for the re-parameterization

Parametric macro methods: the normal case For the estimates of the parameters of the marginal distribution of X, the whole sample A B can be used

Parametric macro methods: the normal case For the estimates of the parameters of the distribution of Y given X, only sample A can be used Hence, the marginal parameters for Y are:

Parametric macro methods: the normal case For the estimates of the parameters of the distribution of Z given X, only sample B can be used Hence, the marginal parameters for Z are:

Comment: why maximum likelihood estimation? What happens if, instead of the previous maximum likelihood parameter estimation, we consider a direct estimation from the data set where the corresponding variable(s) are observed? For instance, let s consider the case of a direct estimation of the Y mean value on A (i.e. with the sample average of Y in A) instead of using μ Y (a kind of regression estimate in a double sampling) Where ρ XY is the correlation coefficient between X and Y. The maximum likelihood estimator is much more efficient when B sample size increases and X and Y are highly correlated.

Comment: why maximum likelihood estimation? When parameters are estimated distinctly on the part of A B that is complete for the corresponding r.v., it might happen that the estimates are not coherent. For instance, the estimated variance and covariance matrix for (X, Y) can be negative definite! This does not happen in the simultaneous estimation of all the parameters by means of the maximum likelihood estimation

Example

Example Under the CIA the maximum likelihood estimate of the parameters are: From the previous estimates we get:

Parametric macro methods: the multinomial case Let (X,Y,Z) be a multinomial r.v. with parameters: where following characteristics is a vector of parameters with the

Parametric macro methods: the multinomial case Adopting the same re-parameterization of the joint distribution, under the CIA the parameters of interest are: In this context, the parameters of the joint distribution computed according to the following formulas are When the interest is only on the pairwise distribution (Y,Z)

Parametric macro methods: the multinomial case Given the sample A B, the maximum likelihood estimator is

Example Let s consider the following two samples A and B, where I=2, J=2, K=3.

Example The maximum likelihood estimates of the parameters are:

Example The maximum likelihood estimates of the parameters of the joint distribution are:

Parametric macro methods: conclusions The CIA model is identifiable (i.e. with a unique estimate) for the data set A B The application of the maximum likelihood estimator is very easy Even if the problem is characterized by missing data, the problem can be split in three «complete data» subproblems, one for each parameter of the re-parameterization There can be other estimation methods can be statistically consistent, but incoherent

Selected references Anderson T W (1957) ``Maximum likelihood estimates for a multivariate normal distribution when some observations are missing'', JASA, 52, 200 203 Anderson T W (1984) An Introduction to Multivariate Statistical Analysis, Wiley Rubin D B (1974)``Characterizing the Estimation of Parameters in Incomplete--Data Problems'', JASA, 69, 467 474 D'Orazio, M., Di Zio, M. and Scanu, M. (2006) Statistical matching for categorical data: displaying uncertainty and using logical constraints. JOS, 22, 137-157 Moriarity C, Scheuren F (2001)``Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure'', JOS, 17, 407--422

Parametric micro methods Output Macro Micro Approach Parametric Nonparametric We are still in the familiar context of parametric data models, but now the objective is micro, i.e. we are interested in a complete data set where (X, Y, Z) are jointly available. This is the context where imputation methods are usually used!

Objective and context Objective: to create a complete data set for (X, Y, Z) Context: partially observed data set

Parametric micro methods: rationale Method: imputation of missing values. In a parametric context: 1. Estimate the distribution parameters 2. Take a (not necessarily random) value from the estimated distribution

Selected references Little R J A, Rubin D B (2002) Statistical Analysis with Missing Data, 2 nd edition, Wiley Rubin D B (1986) Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations, Journal of Business and Economic Statistics, 4, 87 94 Kadane J B (1978) Some Statistical Problems in Merging Data Files, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 159 179. Republished on Journal of Official Statistics, 17, 423 433. Marella D., Scanu M., Conti P.L. (2008). On the matching noise of some nonparametric imputation procedures, Statistics and Probability Letters, 78, 1593-1600. Conti P.L., Marella D., Scanu M. (2008). Evaluation of matching noise for imputation techniques based on the local linear regression estimator. Computational Statistics and Data Analysis, 53, 354-365.

Non parametric macro methods Approach Output Parametric Nonparametric Macro Micro The family of distributions where the distribution of (X, Y, Z) belongs cannot be represented by a finite number of parameters. Although the statistical literature on nonparametrics is huge, this is by far the most neglected situation in statistical matching! Anyway we check it in order to link macro and micro methods, as for the parametric case

Non parametric macro methods: rationale Usually neglected in the statistical matching literature., anyway it is possible to develop the methodologies similarly to the parametric macro case. Two situations have been mainly studied: 1. X categorical, Y and Z ordered or numerical: estimation of the empirical cumulative distribution 2. X, Y and Z numerical: estimation of the nonparametric regression function As a matter of fact, the first approach will be helpful for the random generation of imputations in the corresponding micro methods, the second for the conditional mean matching and, whenever possible, adding a random residual

Selected references Paass G (1985) Statistical record linkage methodology, state of the art and future prospects, in Bulletin of the International Statistical Institute, Proceedings of the 45th Session, volume LI, Book 2 Marella D., Scanu M., Conti P.L. (2008) On the matching noise of some nonparametric imputation procedures, Statistics and Probability Letters, 78, 1593 1600 Conti P.L., Marella D., Scanu M. (2008). Evaluation of matching noise for imputation techniques based on the local linear regression estimator. Computational Statistics and Data Analysis, 53, 354 365

Non parametric micro methods Approach Output Parametric Nonparametric Macro Micro Who applied these methods, seldom assumed anything about the distribution of (X, Y, Z) Each micro method has a macro counterpart, i.e. a representation of how the distribution of (X, Y, Z) should be done. The problem is: to be aware or not?

Non parametric micro methods The nonparametric micro matching methods consist of essentially three imputation procedures 1. Random hot deck 2. Rank hot deck 3. Distance hot deck As already seen in the parametric case, each one of these methods correspond to a specific nonparametric macro approach of the distribution f x, y, z or of a characteristic value. In general, these methods do not organize the two data sets A and B as a unique sample A B.

Parametric micro methods A is the recipient file B is the donor file and and these are the data these are the data to to impute use for imputation The idea is to consider a file as a recipient and the other as the donor

Example In order to define the different hot deck methods, let s consider an example Example: let A and B be the following ones A : n A = 6, observed variables: Gender, Age, Income B : n B = 10, observed variables: Gender, Age, Expenditures A=recipient B=donor Common variables X=(X 1 =Gender, X 2 =Age) Y=(Income) Z=(Expenditures)

Example

Random hot deck: the method 1. Let us draw one random value from B and assign it to the first value to impute in A. 2. Follow the same procedure for all the a A Example: In general we have n B n A = 10 6 possible different ways to impute A

Conditional random hot deck: the method 1. Let s fix a conditional variable, e.g. X 1 2. For the first record a=1, let us draw one random value from the subset of units in B that X 1 b = F. 3. Follow the same procedure for all the a A Example: The number of different completed data sets we can get is m B m A + n B m B n A ma = 6 4 + 4 2 = 1312

Comments 1. Random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z 2. Conditional random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z X 3. It is possible to eliminate the already drawn value from the set of possible donors (constrained procedure), anyway the preservation of the observed distribution of Z or Z X in B is geopardized

Rank hot deck Let s assume that n B = kn A, k integer. Compute the empirical cumulative distribution functions To each a A assign b B chosen so that In other words, this method imputes the values whose quintiles of X are similar in A and B respectively

Rank hot deck Rank the two sample A and B according to X 1

Rank hot deck These are the values of the empirical cumulative distribution function of X 1 in A and B respectively

Rank hot deck This is the result In this example, there is only one way to impute a value

Distance hot deck To each a A assign b B chosen so that it is the nearest according to the common variables. This method depends on the distance function. Different choices are available. If X is numeric, it is possible to choose the Manhattan distance Other distances can be the Euclidean, If X is multivariate, the available distances are the Mahalanobis, Canberra, If X is categorical and unordered, it is possible to consider the classes of imputation (i.e. the distance is the «equality»)

Distance hot deck - example Let s consider X 2 as the variable to use in order to compute distances (i.e. we choose as donors those records in B whose age is the most similar to the one in A) Choose one value at random if there are more than one same distance donors The overall distance between donor and recipients is

Constrained distance hot deck In the former procedure, a donor can be chosen more than once if it is the nearest of more than one record in A. In order to resue the same information more than once, the following constrained procedure has been defined a. Minimize b. Under the constraints

Constrained distance hot deck: advantages and disadvantages Constrained matching allows to preserve the marginal distribution of the variable to impute (Z) Constrained distance hot deck is characterized by a larger distance between donors and recipients than the one for distance hot deck

Constrained distance hot deck: example The overall donor recipient distance is

Constrained distance hot deck: comments Distance hot deck is equivalent to the estimation of a nonparametric regression function via the knn method, when k=1. These methods produce always live data as imputations Sometimes, parametric and nonparametric procedures are applied together: mixed methods Example: Regression step - impute intermediate values in A and B Matching step use a distance hot deck by selecting b * with the shortest distance

Selected references Kadane J B (1978) Some Statistical Problems in Merging Data Files, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 159 179. Published also on Journal of Official Statistics, 17, 423 433. Little R J A, Rubin D B (2002) Statistical Analysis with Missing Data, 2 nd edition, Wiley Okner B A (1972) Constructing a new data base from existing microdata sets: the 1966 merge file, Annals of Economic and Social Measurement, 1, 325 342 Rodgers W L (1984) An Evaluation of Statistical Matching, Journal of Business and Economic Statistics, 2, 91 102 Rubin D B (1986) Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations, Journal of Business and Economic Statistics, 4, 87 94 Sims C A (1972), Comments on Okner, Annals of Economic and Social Measurement, 1, 343 345 Singh A C, Mantel H, Kinack M, Rowe G (1993) Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption, Survey Methodology, 19, 59 79 Marella D., Scanu M., Conti P.L. (2008). On the matching noise of some nonparametric imputation procedures, Statistics and Probability Letters, 78, 1593-1600. Conti P.L., Marella D., Scanu M. (2008). Evaluation of matching noise for imputation techniques based on the local linear regression estimator. Computational Statistics and Data Analysis, 53, 354-365

Auxiliary information In order to avoid the CIA, two different kinds of auxiliary information have been usually considered: 1) a third file C where either (X, Y,Z) or (Y,Z) are jointly observed 2) a plausible value of the inestimable parameters of either (Y,Z X) or (Y,Z) These additional sources of information may originate from an outdated statistical investigation; administrative register; a supplemental (even small) ad hoc survey; proxy variables (Y,Z ) Pay always attention to their accuracy!!!

Auxiliary information on parameters Auxiliary information on parameters can be in terms of: information about q yz x Information about q yz This kind of information restricts the parameter space Q to a subspace Q*, where Q* involves all the parameters q Q compatible with the auxiliary information. NOTE: most of the times the unconstrained maximum likelihood estimate is not compatible with this information: this leads to parameter estimates of the estimable parameters that are strictly different from the ones on the CIA

Example: Auxiliary info on r yz = r* yz Let us suppose that Value r* yz = 0.7 is compatible, det(r ) =0.096. while r* yz = 0.9 is not compatible, det(r ) =-0.008

Use of a third file C, complete on X,Y,Z In a parametric macro approach: use the EM on the union of A, B, C In a parametric micro approach: use conditional mean matching In a nonparametric micro approach: use hot deck (first impute A with record from C, then impute live B values on A using the imputed records) Parametric and nonparametric methods can also be mixed (e.g. impute A records with the use of a conditional mean matching that makes use of an additional file C, then impute live B values)

Selected references Rässler S. (2002) Statistical Matching: a frequentist theory, practical applications and alternative Bayesian approaches, Springer Moriarity C., Scheuren F. (2001) Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure, Jour. of Official Statistics, 17, 407 422 Moriarity C., Scheuren F. (2003) A Note on Rubin s Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputation, Jour. of Business and Economic Statistics, 21, 65 73 Moriarity C., Scheuren F. (2004), Regression based statistical matching: recent developments, Proceedings of the Section on Survey Research Methods, American Statistical Association D Orazio M., Di Zio M., Scanu M. (2006) Statistical Matching for Categorical Data: displaying uncertainty and using logical constraints, Jour. of Official Statistics, 22, 1 22 Singh A.C., Mantel M.D., Kinack M.D., Rowe G. (1993). Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption, Survey Methodology, vol. 19, N. 1, pp- 59-79