Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Similar documents
Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

2018 NSP Student Leader Contact Form

Figure 1 Map of US Coast Guard Districts... 2 Figure 2 CGD Zip File Size... 3 Figure 3 NOAA Zip File Size By State...

DEPARTMENT OF HOUSING AND URBAN DEVELOPMENT. [Docket No. FR-6090-N-01]

Telecommunications and Internet Access By Schools & School Districts

Panelists. Patrick Michael. Darryl M. Bloodworth. Michael J. Zylstra. James C. Green

A New Method of Using Polytomous Independent Variables with Many Levels for the Binary Outcome of Big Data Analysis

THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE

MAKING MONEY FROM YOUR UN-USED CALLS. Connecting People Already on the Phone with Political Polls and Research Surveys. Scott Richards CEO

The Lincoln National Life Insurance Company Universal Life Portfolio

B.2 Measures of Central Tendency and Dispersion

CostQuest Associates, Inc.

Distracted Driving- A Review of Relevant Research and Latest Findings

Ocean Express Procedure: Quote and Bind Renewal Cargo

Fall 2007, Final Exam, Data Structures and Algorithms

Accommodating Broadband Infrastructure on Highway Rights-of-Way. Broadband Technology Opportunities Program (BTOP)

CSE 781 Data Base Management Systems, Summer 09 ORACLE PROJECT

Geographic Accuracy of Cell Phone RDD Sample Selected by Area Code versus Wire Center

Silicosis Prevalence Among Medicare Beneficiaries,

IT Modernization in State Government Drivers, Challenges and Successes. Bo Reese State Chief Information Officer, Oklahoma NASCIO President

State IT in Tough Times: Strategies and Trends for Cost Control and Efficiency

Department of Business and Information Technology College of Applied Science and Technology The University of Akron

MERGING DATAFRAMES WITH PANDAS. Appending & concatenating Series

Name: Business Name: Business Address: Street Address. Business Address: City ST Zip Code. Home Address: Street Address

Post Graduation Survey Results 2015 College of Engineering Information Networking Institute INFORMATION NETWORKING Master of Science

NSA s Centers of Academic Excellence in Cyber Security

Presented on July 24, 2018

CIS 467/602-01: Data Visualization

DSC 201: Data Analysis & Visualization

Moonv6 Update NANOG 34

Sideseadmed (IRT0040) loeng 4/2012. Avo

Charter EZPort User Guide

2018 Supply Cheat Sheet MA/PDP/MAPD

Amy Schick NHTSA, Occupant Protection Division April 7, 2011

Tina Ladabouche. GenCyber Program Manager

Presentation Outline. Effective Survey Sampling of Rare Subgroups Probability-Based Sampling Using Split-Frames with Listed Households

Global Forum 2007 Venice

Team Members. When viewing this job aid electronically, click within the Contents to advance to desired page. Introduction... 2

2015 DISTRACTED DRIVING ENFORCEMENT APRIL 10-15, 2015

Best Practices in Rapid Deployment of PI Infrastructure and Integration with OEM Supplied SCADA Systems

AASHTO s National Transportation Product Evaluation Program

State HIE Strategic and Operational Plan Emerging Models. February 16, 2011

Contact Center Compliance Webinar Bringing you the ANSWERS you need about compliance in your call center.

The Outlook for U.S. Manufacturing

DTFH61-13-C Addressing Challenges for Automation in Highway Construction

What Did You Learn? Key Terms. Key Concepts. 68 Chapter P Prerequisites

Homework Assignment #5

A Capabilities Presentation

PulseNet Updates: Transitioning to WGS for Reference Testing and Surveillance

Prizm. manufactured by. White s Electronics, Inc Pleasant Valley Road Sweet Home, OR USA. Visit our site on the World Wide Web

ACCESS PROCESS FOR CENTRAL OFFICE ACCESS

Introduction to blocking techniques and traditional record linkage

2013 Product Catalog. Quality, affordable tax preparation solutions for professionals Preparer s 1040 Bundle... $579

Configuring Oracle GoldenGate OGG 11gR2 local integrated capture and using OGG for mapping and transformations

Data Visualization (CIS/DSC 468)

Touch Input. CSE 510 Christian Holz Microsoft Research February 11, 2016

Jurisdictional Guidelines for Accepting a UCC Record Presented for Filing 2010 Amendments & the 2011 IACA Forms

On All Forms. Financing Statement (Form UCC1) Statutory, MARS or Other Regulatory Authority to Deviate

Appriss Health Information Solutions. NARxCheck/Gateway Overview June 2016 Provided by Clay Rogers

Package fastlink. February 1, 2018

Strengthening connections today, while building for tomorrow. Wireless broadband, small cells and 5G

STATE DATA BREACH NOTIFICATION LAWS OVERVIEW OF REQUIREMENTS FOR RESPONDING TO A DATA BREACH UPDATED JUNE 2017

Presentation to NANC. January 22, 2003

Steve Stark Sales Executive Newcastle

Expanding Transmission Capacity: Options and Implications. What role does renewable energy play in driving transmission expansion?

Privacy Preserving Probabilistic Record Linkage

Real World Algorithms: A Beginners Guide Errata to the First Printing

Medium voltage Marketing contacts

How to Make an Impressive Map of the United States with SAS/Graph for Beginners Sharon Avrunin-Becker, Westat, Rockville, MD

Selling Compellent Hardware: Controllers, Drives, Switches and HBAs Chad Thibodeau

Statistical Analysis of List Experiments

Elevation Data Acquisition Update

CMPE 180A Data Structures and Algorithms in C++ Spring 2018

OPERATOR CERTIFICATION

Panasonic Certification Training January 21-25, 2019

Be sure to fax AGA a copy of every application you submitted direct along with the fax confirmation page to ensure you receive your commission!

Non-Revenue Water Management as Effective Utility Management. Steve Cavanaugh, P.E.

2012 Tax Year Media Guide

An Ensemble Approach for Record Matching in Data Linkage

National Continuity Programs

Today s Lecture. Factors & Sampling. Quick Review of Last Week s Computational Concepts. Numbers we Understand. 1. A little bit about Factors

ASR Contact and Escalation Lists

1/29/2009. Disparity arises at every decision point in the system. Pamela Oliver Department of Sociology University of Wisconsin - Madison

MOVR. Mobile Overview Report January March The first step in a great mobile experience

Distracted Driving U.S. Federal Initiatives. Amy Schick, MS, CHES Office of Impaired Driving and Occupant Protection

Azure App Service. Jorge D. Wong Technical Account Manager Microsoft Azure. Ingram Micro Inc.

Landline and Cell Phone Response Measures in Behavioral Risk Factor Surveillance System

Pro look sports football guide

The Altice Business. Carrier Advantage

2010 Tax Year Media Guide

Valerie Robinson,

Includes all of the following a value of over $695:

Data Linkages - Effect of Data Quality on Linkage Outcomes

2013 Certification Program Accomplishments

ARE WE HEADED INTO A RECESSION?

Welcome to the Comcast Business SmartOffice Video Monitoring Solution

VitalSigns Company List

Recovery Auditor for CMS

Paralleling Topologies and Uptime Institute Tier Ratings

Transcription:

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton Harvard Talk at the Tech Science Seminar IQSS, Harvard University October 30, 2018 Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 1 / 24

Motivation In any given project, social scientists often rely on multiple data sets Cutting-edge empirical research often merges large-scale administrative records with other types of data We can easily merge data sets if there is a common unique identifier e.g. Use the merge function in R or Stata How should we merge data sets if no unique identifier exists? must use variables: names, birthdays, addresses, etc. Variables often have measurement error and missing values cannot use exact matching What if we have millions of records? cannot merge by hand Merging data sets is an uncertain process quantify uncertainty and error rates Solution: Probabilistic Model Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 2 / 24

Data Merging Can be Consequential Turnout validation for the American National Election Survey 2012 Election: self-reported turnout (78%) actual turnout (59%) Ansolabehere and Hersh (2012, Political Analysis): electronic validation of survey responses with commercial records provides a far more accurate picture of the American electorate than survey responses alone. Berent, Krosnick, and Lupia (2016, Public Opinion Quarterly): Matching errors... drive down validated turnout estimates. As a result,... the apparent accuracy [of validated turnout estimates] is likely an illusion. Challenge: Find 2500 survey respondents in 160 million registered voters (less than 0.001%) finding needles in a haystack Problem: match registered voter, non-match non-voter Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 3 / 24

Probabilistic Model of Record Linkage Many social scientists use deterministic methods: match similar observations (e.g., Ansolabehere and Hersh, 2016; Berent, Krosnick, and Lupia, 2016) proprietary methods (e.g., Catalist) Problems: 1 not robust to measurement error and missing data 2 no principled way of deciding how similar is similar enough 3 lack of transparency Probabilistic model of record linkage: originally proposed by Fellegi and Sunter (1969, JASA) enables the control of error rates Problems: 1 current implementations do not scale 2 missing data treated in ad-hoc ways 3 does not incorporate auxiliary information Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 4 / 24

The Fellegi-Sunter Model Two data sets: A and B with N A and N B observations K variables in common We need to compare all N A N B pairs Agreement vector for a pair (i, j): γ(i, j) 0 different 1 γ k (i, j) =. similar L k 2 L k 1 identical Latent variable: M i,j = { 0 non-match 1 match Missingness indicator: δ k (i, j) = 1 if γ k (i, j) is missing Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 5 / 24

How to Construct Agreement Patterns Jaro-Winkler distance with default thresholds for string variables Name Address First Middle Last House Street Data set A 1 James V Smith 780 Devereux St. 2 John NA Martin 780 Devereux St. Data set B 1 Michael F Martinez 4 16th St. 2 James NA Smith 780 Dvereuux St. Agreement patterns A.1 B.1 0 0 0 0 0 A.1 B.2 2 NA 2 2 1 A.2 B.1 0 NA 1 0 0 A.2 B.2 0 NA 0 2 1 Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 6 / 24

Independence assumptions for computational efficiency: 1 Independence across pairs 2 Independence across variables: γ k (i, j) γ k (i, j) M ij 3 Missing at random: δ k (i, j) γ k (i, j) M ij Nonparametric mixture model: N A N B 1 λ m (1 λ) 1 m i=1 j=1 m=0 K k=1 ( Lk 1 l=0 π 1{γ k(i,j)=l} kml where λ = P(M ij = 1) is the proportion of true matches and π kml = Pr(γ k (i, j) = l M ij = m) ) 1 δk (i,j) Fast implementation of the EM algorithm (R package fastlink) EM algorithm produces the posterior matching probability ξ ij Deduping to enforce one-to-one matching 1 Choose the pairs with ξ ij > c for a threshold c 2 Use Jaro s linear sum assignment algorithm to choose the best matches Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 7 / 24

Controlling Error Rates 1 False negative rate (FNR): #true matches not found # true matches in the data = P(M ij = 1 unmatched)p(unmatched) P(M ij = 1) 2 False discovery rate (FDR): # false matches found # matches found = P(M ij = 0 matched) We can compute FDR and FNR for any given posterior matching probability threshold c Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 8 / 24

Computational Improvements via Hashing Sufficient statistics for the EM algorithm: number of pairs with each observed agreement pattern H k maps each pair of records (keys) in linkage field k to a corresponding agreement pattern (hash value): H = and h (i,j) k h K (1,1) k h (1,2) k... h (1,N 2) k H k where H k =...... k=1 h (N 1,1) k h (N 1,2) k... h (N 1,N 2 ) k = 1 {γ k (i, j) > 0} 2 γ k(i,j)+(k 1) L k H k is a sparse matrix, and so is H With sparse matrix, lookup time is O(T ) where T is the number of unique patterns observed T K k=1 L k Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 9 / 24

Simulation Studies 2006 voter files from California (female only; 8 million records) Validation data: records with no missing data (340k records) Linkage fields: first name, middle name, last name, date of birth, address (house number and street name), and zip code 2 scenarios: 1 Unequal size: 1:100, 10:100, and 50:100, larger data 100k records 2 Equal size (100k records each): 20%, 50%, and 80% matched 3 missing data mechanisms: 1 Missing completely at random (MCAR) 2 Missing at random (MAR) 3 Missing not at random (MNAR) 3 levels of missingness: 5%, 10%, 15% Noise is added to first name, last name, and address Results below are with 10% missingness and no noise Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 10 / 24

Error Rates and Estimation Error for Turnout 80% Overlap 50% Overlap 20% Overlap False Negative Rate 0 0.25 0.5 0.75 1 fastlink partial match (ADGN) exact match MCAR MAR MNAR MCAR MAR MNAR MCAR MAR MNAR Absolute Estimation Error (percentage point) 0 5 10 15 fastlink partial match (ADGN) exact match MCAR MAR MNAR MCAR MAR MNAR MCAR MAR MNAR Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 11 / 24

Accuracy of Estimated Error Rates 80% Overlap 50% Overlap 20% Overlap False Discovery Rate True value 0.00 0.03 0.06 0.09 0.12 fastlink Missing as disagreement True value 0.00 0.03 0.06 0.09 0.12 True value 0.00 0.03 0.06 0.09 0.12 no blocking blocking 0.00 0.03 0.06 0.09 0.12 Estimated value 0.00 0.03 0.06 0.09 0.12 Estimated value 0.00 0.03 0.06 0.09 0.12 Estimated value False Negative Rate True value 0.0 0.1 0.2 0.3 0.4 fastlink Missing as disagreement True value 0.0 0.1 0.2 0.3 0.4 True value 0.0 0.1 0.2 0.3 0.4 blocking no blocking 0.0 0.1 0.2 0.3 0.4 Estimated value 0.0 0.1 0.2 0.3 0.4 Estimated value 0.0 0.1 0.2 0.3 0.4 Estimated value Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 12 / 24

Runtime Comparisons 1500 RecordLinkage (Python) RecordLinkage (R) Running Time Comparison Time Elapsed (Minutes) 1000 500 0 fastlink (Serial) fastlink (Parallel, 8 cores) 0 100 200 300 Dataset Size (Thousands) No blocking, single core (parallelization possible with fastlink) Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 13 / 24

Application ❶: Merging Survey with Administrative Record Hill and Huber (2017, Political Behavior) study differences between donors and non-donors among CCES (2012) respondents CCES respondents are matched with DIME donors (2010, 2012) Use of a proprietary method, treating non-matches as non-donors Donation amount coarsened and small noise added 4,432 (8.1%) matched out of 54,535 CCES respondents We asked YouGov to apply fastlink for merging the two data sets We signed the NDA form no coarsening, no noise Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 14 / 24

Merging Process DIME: 5 million unique contributors CCES: 51,184 respondents (YouGov panel only) Exact matching: 0.33% match rate Blocking: 102 blocks using state and gender Linkage fields: first name, middle name, last name, address (house number, street name), zip code Took 1 hour using a dual-core laptop Examples from the output of one block: Name Address First Middle Last Street House Zip Posterior agree agree agree agree agree agree 1.00 similar NA Agree similar agree agree 0.93 agree NA Agree disagree disagree NA 0.01 Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 15 / 24

Merge Results Number of matches Overlap fastlink and proprietary method False discovery rate (FDR; %) False negative rate (FNR; %) Threshold 0.75 0.85 0.95 Proprietary All 4945 4794 4573 4534 Female 2198 2156 2067 2210 Male 2747 2638 2506 2324 All 3958 3935 3880 Female 1878 1867 1845 Male 2080 2068 2035 All 1.24 0.65 0.21 Female 0.91 0.52 0.14 Male 1.49 0.75 0.27 All 15.25 17.35 20.81 Female 5.34 6.79 10.29 Male 21.84 24.37 27.81 Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 16 / 24

Correlations with Self-reports and Matching Probabilities DIME self report 0 100 1K 100K 10K 0 100 1K 100K 10K Corr = 0.73 N = 3767 Common matches fastlink only matches Proprietary only matches DIME self report 0 100 1K 100K 10K 0 100 1K 100K 10K Corr = 0.58 N = 747 DIME self report 0 100 1K 100K 10K 0 100 1K 100K 10K Corr = 0.46 N = 533 Probability Density N = 3935 0 0.25 0.5 0.75 1 0 5 10 15 20 25 Common matches fastlink only matches Proprietary only matches Probability Density 0 0.25 0.5 0.75 1 0 5 10 15 20 25 N = 859 Probability Density 0 0.25 0.5 0.75 1 0 5 10 15 20 25 N = 649 Donations Posterior probability of match Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 17 / 24

Post-merge Analysis 1 Merged variable as the outcome Assumption: No omitted variable for merge Zi X i (δ, γ) Posterior mean of merged variable: ζ i = N B j=1 ξ ijz j / N B j=1 ξ ij Regression: E(Z i X) = E{E(Z i γ, δ, X i ) X i } = E(ζ i X i ) 2 Merged variable as a predictor Linear regression: Y i = α + βz i + η X i + ɛ i Additional assumption: Y i (δ, γ) Z, X Weighted regression: E(Y i γ, δ, X i ) = α + βe(zi γ, δ, X i ) + η X i + E(ɛ i γ, δ, X i ) = α + βζ i + η X i Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 18 / 24

Predicting Ideology using Contribution Status Hill and Huber regresses ideology score ( 1 to 1) on the indicator variable for being a donor (merging indicator), turnout, and demographic variables We use the weighted regression approach Republicans Democrats Original fastlink Original fastlink Contributor dummy 0.080 0.046 0.180 0.165 (0.016) (0.015) (0.008) (0.009) 2012 General vote 0.095 0.094 0.060 0.060 (0.013) (0.013) (0.010) (0.010) 2012 Primary vote 0.094 0.096 0.019 0.024 (0.009) (0.009) (0.009) 0.008) Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 19 / 24

Application ❷: Merging National Voter Files We merged two national voter files (2015 and 2016) with more than 140 million voters each! Almost all merging is done within each state But, some people move across states! 7.5 million cross-state movers between 2014 and 2015 IRS Statistics of Income Migration Data 9.2% of residents moved to new address in same state 1.6% moved to a new state Popular move: New York Florida, followed by California Texas Linkage fields: first name, middle name, last name, date/year/month of birth, gender, house number (within-state only), street name (within-state only), date of registration (within-state only) Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 20 / 24

Incorporating Auxiliary Information on Migration Five-step process for across-state merge: 1 Within-state estimation on random sample of each state 2 Apply to full state to find non-movers and within-state movers 3 Subset out successful matches 4 Cross-state estimation on random sample to find cross-state movers 5 Apply estimates to each cross-state pair Use of prior distribution 1 Within-state merge: P(M ij = 1) non-movers + in-state movers N A N B P(γ address (i, j) = 0 M ij = 1) 2 Across-state merge: in-state movers in-state movers + non-movers P(M ij = 1) outflow from state A to state B N A N B Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 21 / 24

Merge Results Match count (millions) Match rate (%) False discovery rate (FDR; %) False negative rate (FNR; %) fastlink 0.75 0.85 0.95 Exact All 138.74 132.58 129.99 91.62 Within-state 127.38 127.12 126.80 91.36 Across-state 11.37 5.47 3.19 0.27 All 99.32 95.62 93.93 66.24 Within-state 92.06 91.87 91.66 66.05 Across-state 7.26 3.75 2.27 0.19 All 1.17 0.24 0.05 Within-state 0.08 0.04 0.01 Across-state 1.09 0.21 0.04 All 2.34 2.53 2.70 Within-state 1.81 1.95 2.10 Across-state 0.53 0.58 0.60 Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 22 / 24

Movers Found Origin State WY WV WA WI VT VA UT SD TN TX SC OR PARI OK OH NM NV NY NH NJ ND NE NC MO MS MT MN ME MI MD MA KY LA KS IN ID IL GA HI IA DE FL CO CT CA AR AZ AK AL Match Rates for Cross State Movers AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY Destination State Recover intra-northeast migration (NH MA, RI MA, DE PA) Recover out-migration to Florida (from CT, NJ, VA, NH, RI) Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 23 / 24

Concluding Remarks Merging data sets is critical part of social science research merging can be difficult when no unique identifier exists large data sets make merging even more challenging yet merging can be consequential Merging should be part of replication archive We offer a fast, principled, and scalable merging method that can incorporate auxiliary information Open-source software fastlink available at CRAN Used for validating self-reporeted turnout in ANES Ongoing research: 1 merging multiple administrative records over time 2 privacy-preserving record linkage Enamorado, Fifield, and Imai (PU/HU) Merging Large Data Sets Harvard (October 30, 2018) 24 / 24