Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Similar documents
Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Introduction to blocking techniques and traditional record linkage

Statistical Analysis of List Experiments

Privacy Preserving Probabilistic Record Linkage

Package fastlink. February 1, 2018

An Ensemble Approach for Record Matching in Data Linkage

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.

Overview of Record Linkage for Name Matching

Data Linkages - Effect of Data Quality on Linkage Outcomes

The Growing Gap between Landline and Dual Frame Election Polls

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Keith Kranker

Cleanup and Statistical Analysis of Sets of National Files

Overview of Record Linkage Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques

Statistical Matching using Fractional Imputation

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Missing data analysis. University College London, 2015

Missing Data and Imputation

Supplementary text S6 Comparison studies on simulated data

Regression Discontinuity Designs

Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda

RELAIS: A Record Linkage Toolkit Training course on record linkage Monica Scannapieco Istat

SAMPLE Treatment Perceptions Survey (TPS) Report. XXXXXX County, N=239. (Not Real Data) All Substance Use Treatment Programs Surveyed.

Missing Data: What Are You Missing?

Exploiting Data for Risk Evaluation. Niel Hens Center for Statistics Hasselt University

Uncertainties: Representation and Propagation & Line Extraction from Range data

SPSS syntax for data set definition

Semi-supervised Learning

Collective Entity Resolution in Relational Data

Missing Data. Where did it go?

Online Supplement to Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data

Survey Questions and Methodology

Survey Questions and Methodology

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

Missing Data Techniques

10-701/15-781, Fall 2006, Final

An imputation approach for analyzing mixed-mode surveys

Proceedings of the Eighth International Conference on Information Quality (ICIQ-03)

Clinton Leads Trump in Michigan by 10% (Clinton 49% - Trump 39%)

Collaborative Filtering using a Spreading Activation Approach

Statistical matching: conditional. independence assumption and auxiliary information

Introduction Entity Match Service. Step-by-Step Description

An Introduction to Growth Curve Analysis using Structural Equation Modeling

GALLUP NEWS SERVICE GALLUP POLL SOCIAL SERIES: VALUES AND BELIEFS

Week 4: Simple Linear Regression II

Missing Data Analysis with SPSS

Design-Based Estimation with Record-Linked Administrative Files and a Clerical Review Sample

Scaling Bayesian Probabilistic Record Linkage with Post-Hoc Blocking: An Application to the California Great Registers

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

STAT10010 Introductory Statistics Lab 2

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Telephone Survey Response: Effects of Cell Phones in Landline Households

Mixture Models and the EM Algorithm

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

- 1 - Fig. A5.1 Missing value analysis dialog box

ECLT 5810 Clustering

Missing data analysis: - A study of complete case analysis, single imputation and multiple imputation. Filip Lindhfors and Farhana Morko

FIELD RESEARCH CORPORATION

Multiple Model Estimation : The EM Algorithm & Applications

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

Colorado Results. For 10/3/ /4/2012. Contact: Doug Kaplan,

Unsupervised Sentiment Analysis Using Item Response Theory Models

Handling Data with Three Types of Missing Values:

L2 VoterMapping - Display Options

[Programming Assignment] (1)

Bayesian Model Averaging over Directed Acyclic Graphs With Implications for Prediction in Structural Equation Modeling

Differential Privacy. Seminar: Robust Data Mining Techniques. Thomas Edlich. July 16, 2017

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Grouping methods for ongoing record linkage

Linking Patients in PDMP Data

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules

ANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3

emarketer US Social Network Usage StatPack

Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles

Expectation Maximization (EM) and Gaussian Mixture Models

Multiple Model Estimation : The EM Algorithm & Applications

Machine Learning: An Applied Econometric Approach Online Appendix

Graphs and Tables of the Results

Missing Data Analysis for the Employee Dataset

ECLT 5810 Clustering

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

Data linkages in PEDSnet

Quantifying Search Bias: Investigating Sources of Bias for Political Searches in Social Media

Chuck Cartledge, PhD. 23 September 2017

08 An Introduction to Dense Continuous Robotic Mapping

Amelia multiple imputation in R

STA 4273H: Statistical Machine Learning

Note Set 4: Finite Mixture Models and the EM Algorithm

Variational Methods for Discrete-Data Latent Gaussian Models

A STUDY ON SMART PHONE USAGE AMONG YOUNGSTERS AT AGE GROUP (15-29)

Non-Linearity of Scorecard Log-Odds

Clustering Large Credit Client Data Sets for Classification with SVM

Correlation Motif Vignette

Lecture 26: Missing data

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

User Manual. Last updated 1/19/2012

ƒ % ƒ % ƒ %

A Deterministic Global Optimization Method for Variational Inference

Transcription:

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton University Talk at Seoul National University Fifth Asian Political Methodology Meeting January 11, 2018 Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 1 / 18

Motivation In any given project, social scientists often rely on multiple data sets Cutting-edge empirical research often merges large-scale administrative records with other types of data We can easily merge data sets if there is a common unique identifier e.g. Use the merge function in R or Stata How should we merge data sets if no unique identifier exists? must use variables: names, birthdays, addresses, etc. Variables often have measurement error and missing values cannot use exact matching What if we have millions of records? cannot merge by hand Merging data sets is an uncertain process quantify uncertainty and error rates Solution: Probabilistic Model Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 2 / 18

Data Merging Can be Consequential Turnout validation for the American National Election Survey 2012 Election: self-reported turnout (78%) actual turnout (59%) Ansolabehere and Hersh (2012, Political Analysis): electronic validation of survey responses with commercial records provides a far more accurate picture of the American electorate than survey responses alone. Berent, Krosnick, and Lupia (2016, Public Opinion Quarterly): Matching errors... drive down validated turnout estimates. As a result,... the apparent accuracy [of validated turnout estimates] is likely an illusion. Challenge: Find 2500 survey respondents in 160 million registered voters (less than 0.001%) finding needles in a haystack Problem: match registered voter, non-match non-voter Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 3 / 18

Probabilistic Model of Record Linkage Many social scientists use deterministic methods: match similar observations (e.g., Ansolabehere and Hersh, 2016; Berent, Krosnick, and Lupia, 2016) proprietary methods (e.g., Catalist) Problems: 1 not robust to measurement error and missing data 2 no principled way of deciding how similar is similar enough 3 lack of transparency Probabilistic model of record linkage: originally proposed by Fellegi and Sunter (1969, JASA) enables the control of error rates Problems: 1 current implementations do not scale 2 missing data treated in ad-hoc ways 3 does not incorporate auxiliary information Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 4 / 18

The Fellegi-Sunter Model Two data sets: A and B with N A and N B observations K variables in common We need to compare all N A N B pairs Agreement vector for a pair (i, j): γ(i, j) 0 different 1 γ k (i, j) =. similar L k 2 L k 1 identical Latent variable: M i,j = { 0 non-match 1 match Missingness indicator: δ k (i, j) = 1 if γ k (i, j) is missing Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 5 / 18

How to Construct Agreement Patterns Jaro-Winkler distance with default thresholds for string variables Name Address First Middle Last House Street Data set A 1 James V Smith 780 Devereux St. 2 John NA Martin 780 Devereux St. Data set B 1 Michael F Martinez 4 16th St. 2 James NA Smith 780 Dvereuux St. Agreement patterns A.1 B.1 0 0 0 0 0 A.1 B.2 2 NA 2 2 1 A.2 B.1 0 NA 1 0 0 A.2 B.2 0 NA 0 2 1 Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 6 / 18

Independence assumptions for computational efficiency: 1 Independence across pairs 2 Independence across variables: γ k (i, j) γ k (i, j) M ij 3 Missing at random: δ k (i, j) γ k (i, j) M ij Nonparametric mixture model: N A N B 1 λ m (1 λ) 1 m i=1 j=1 m=0 K k=1 ( Lk 1 l=0 π 1{γ k(i,j)=l} kml where λ = P(M ij = 1) is the proportion of true matches and π kml = Pr(γ k (i, j) = l M ij = m) ) 1 δk (i,j) Fast implementation of the EM algorithm (R package fastlink) EM algorithm produces the posterior matching probability ξ ij Deduping to enforce one-to-one matching 1 Choose the pairs with ξ ij > c for a threshold c 2 Use Jaro s linear sum assignment algorithm to choose the best matches Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 7 / 18

Controlling Error Rates 1 False negative rate (FNR): #true matches not found # true matches in the data = P(M ij = 1 unmatched)p(unmatched) P(M ij = 1) 2 False discovery rate (FDR): # false matches found # matches found = P(M ij = 0 matched) We can compute FDR and FNR for any given posterior matching probability threshold c Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 8 / 18

Simulation Studies 2006 voter files from California (female only; 8 million records) Validation data: records with no missing data (340k records) Linkage fields: first name, middle name, last name, date of birth, address (house number and street name), and zip code 2 scenarios: 1 Unequal size: 1:100, 10:100, and 50:100, larger data 100k records 2 Equal size (100k records each): 20%, 50%, and 80% matched 3 missing data mechanisms: 1 Missing completely at random (MCAR) 2 Missing at random (MAR) 3 Missing not at random (MNAR) 3 levels of missingness: 5%, 10%, 15% Noise is added to first name, last name, and address Results below are with 10% missingness and no noise Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 9 / 18

Error Rates and Estimation Error for Turnout 80% Overlap 50% Overlap 20% Overlap False Negative Rate 0 0.25 0.5 0.75 1 fastlink partial match (ADGN) exact match MCAR MAR MNAR MCAR MAR MNAR MCAR MAR MNAR Absolute Estimation Error (percentage point) 0 5 10 15 fastlink partial match (ADGN) exact match MCAR MAR MNAR MCAR MAR MNAR MCAR MAR MNAR Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 10 / 18

Runtime Comparisons Equal size Unequal Size Time elapsed (minutes) 0 50 100 150 Record Linkage (Python) RecordLinkage (R) fastlink (R) Time elapsed (minutes) 0 5 10 15 20 Record Linkage (Python) RecordLinkage (R) fastlink (R) 1 5 10 15 20 25 30 Dataset size (thousands) 1 5 10 15 20 25 30 35 40 Largest dataset size (thousands) No blocking, single core (parallelization possible with fastlink) Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 11 / 18

Application: Merging Survey with Administrative Record Hill and Huber (2017, Political Behavior) study differences between donors and non-donors among CCES (2012) respondents CCES respondents are matched with DIME donors (2010, 2012) Use of a proprietary method, treating non-matches as non-donors Donation amount coarsened and small noise added 4,432 (8.1%) matched out of 54,535 CCES respondents We asked YouGov to apply fastlink for merging the two data sets We signed the NDA form no coarsening, no noise Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 12 / 18

Merging Process DIME: 5 million unique contributors CCES: 51,184 respondents (YouGov panel only) Exact matching: 0.33% match rate Blocking: 102 blocks using state and gender Linkage fields: first name, middle name, last name, address (house number, street name), zip code Took 1 hour using a dual-core laptop Examples from the output of one block: Name Address First Middle Last Street House Zip Posterior agree agree agree agree agree agree 1.00 similar NA Agree similar agree agree 0.93 agree NA Agree disagree disagree NA 0.01 Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 13 / 18

Merge Results Number of matches Overlap fastlink and proprietary method False discovery rate (FDR; %) False negative rate (FNR; %) Threshold 0.75 0.85 0.95 Proprietary All 4945 4794 4573 4534 Female 2198 2156 2067 2210 Male 2747 2638 2506 2324 All 3958 3935 3880 Female 1878 1867 1845 Male 2080 2068 2035 All 1.24 0.65 0.21 Female 0.91 0.52 0.14 Male 1.49 0.75 0.27 All 15.25 17.35 20.81 Female 5.34 6.79 10.29 Male 21.84 24.37 27.81 Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 14 / 18

Correlations with Self-reports and Matching Probabilities DIME self report 0 100 1K 100K 10K 0 100 1K 100K 10K Corr = 0.73 N = 3767 Common matches fastlink only matches Proprietary only matches DIME self report 0 100 1K 100K 10K 0 100 1K 100K 10K Corr = 0.58 N = 747 DIME self report 0 100 1K 100K 10K 0 100 1K 100K 10K Corr = 0.46 N = 533 Probability Density N = 3935 0 0.25 0.5 0.75 1 0 5 10 15 20 25 Common matches fastlink only matches Proprietary only matches Probability Density 0 0.25 0.5 0.75 1 0 5 10 15 20 25 N = 859 Probability Density 0 0.25 0.5 0.75 1 0 5 10 15 20 25 N = 649 Donations Posterior probability of match Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 15 / 18

Post-merge Analysis 1 Merged variable as the outcome Assumption: No omitted variable for merge Zi X i (δ, γ) Posterior mean of merged variable: ζ i = N B j=1 ξ ijz j / N B j=1 ξ ij Regression: E(Z i X) = E{E(Z i γ, δ, X i ) X i } = E(ζ i X i ) 2 Merged variable as a predictor Linear regression: Y i = α + βz i + η X i + ɛ i Additional assumption: Y i (δ, γ) Z, X Weighted regression: E(Y i γ, δ, X i ) = α + βe(zi γ, δ, X i ) + η X i + E(ɛ i γ, δ, X i ) = α + βζ i + η X i Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 16 / 18

Predicting Ideology using Contribution Status Hill and Huber regresses ideology score ( 1 to 1) on the indicator variable for being a donor (merging indicator), turnout, and demographic variables We use the weighted regression approach Republicans Democrats Original fastlink Original fastlink Contributor dummy 0.080 0.046 0.180 0.165 (0.016) (0.015) (0.008) (0.009) 2012 General vote 0.095 0.094 0.060 0.060 (0.013) (0.013) (0.010) (0.010) 2012 Primary vote 0.094 0.096 0.019 0.024 (0.009) (0.009) (0.009) 0.008) Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 17 / 18

Concluding Remarks Merging data sets is critical part of social science research merging can be difficult when no unique identifier exists large data sets make merging even more challenging yet merging can be consequential Merging should be part of replication archive We offer a fast, principled, and scalable merging method that can incorporate auxiliary information Open-source software fastlink available at CRAN Ongoing research: 1 validating self-reported turnout 2 merging multiple administrative records over time 3 privacy-preserving record linkage Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 18 / 18