Data Linkages - Effect of Data Quality on Linkage Outcomes

Size: px

Start display at page:

Download "Data Linkages - Effect of Data Quality on Linkage Outcomes"

Stewart Harper
5 years ago
Views:

1 Data Linkages - Effect of Data Quality on Linkage Outcomes Anders Alexandersson July 27, 2016 Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

2 Introduction Data linkage synonyms = record linkage, record matching, re-identification, entitity heterogeneity, and merge/purge. Aim = Determine the true match status of each comparison pair: a match if records belong to the same individual and a non-match if records belong to different individuals. Use linkage criteria to assign a link status for each comparison pair: a link if records are classified as belonging to the same individual and a non-link if records are classified as belonging to different individuals. Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

3 The Problem Ideally, all matches are classified as links, and all non-matches are classified as non-links. This presentation will demonstrate how data quality affects linkage outcomes. There are two types of possible errors: Type 1: False matches = linked non-matches ( false positives ) Type 2: Missed matches = non-linked matches ( false negatives ) Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

4 The Table of Confusion The four outcomes can be displayed in a 2*2 table of confusion or error matrix : Figure 1: Table of Confusion Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

5 Linkage Quality Measures Match status errors: True positive rate (TPR), matching rate, sensitivity, power = TP / Matches FNR, miss rate, beta (alpha in R) error = FN / Matches FPR, false match rate, alpha (beta in R) error = FP / Non-matches TNR or specificity = TN / Non-matches Linkage errors: Positive predictive value (PPV) or precision = TP / Links False discovery rate, false match rate (again!) = FP / Links False omission rate = FN / Non-links Negative predictive value (NPV) = TN / Non-links Record pairs quality measures: Accuracy = (TP + TN) / Record pairs Prevalence = Links / Record pairs Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

6 The Solution: Probabilistic Record Linkage The theory behind probabilistic record linkage is based on probabilities. This improves on traditional, simple rule-based, deterministic record linkage. The standard reference is Fellegi-Sunter (1969). FPR = FP / Non-matches = u-probability TPR = TP / Matches = m-probability In practice, the process involves three key steps: 1 Preprocessing 2 Linking 3 Clerical review Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

7 Step 1: Preprocessing Typically, preprocessing consists of two substeps: 1 parse a field (variable, column) into the relevant subcomponents 2 standardize common character strings Several data linkage software do not have features for preprocessing. Examples are BigMatch and the R package RecordLinkage. For preprocessing, any good stat software will work. We use the NYSIIS phonetic code to handle spelling mistakes in names. Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

8 Example code Here is example code in Stata. The original data are in R.. R: load("rldata500.rda"). R: load.data(rldata500). decode fname_c1, gen(fname_c1s). nysiis fname_c1s, gen(nysf). list fname_c1 nysf in 1/ fname_c1 nysf CARSTEN carstan 2. GERD gad 3. ROBERT rabad Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

9 Step 2: Linking At FCDS, we use the user-written R package RecordLinkage for the linking. We used to use the software AutoMatch. Example code in R: rpairs <- compare.linkage(rmort1,rpatient1,blockfld=c( ssn, sex ), strcmp=4:7,exclude=c( pid, address, st, county, zip, mi )); rpairs$pairs[c(1:5), ]; # (list obs 1-5, comparison pattern only) rpairs <- emweights(rpairs); # (calculate EM weights) summary(rpairs); # (show weight distribution ### pairs) tail(getpairs(rpairs, 40, 30)); # review obs to determine thresholds result <- emclassify(rpairs, 40, 30); # classification summary(result); Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

10 Example output in R Anders Alexandersson Figure 2: Linkage result in R Data Linkages - Effect of Data Quality on Linkage Outcomes July 27, / 13

11 Step 3: Clerical Review At FCDS, we use the user-written Stata command clrevmatch for the clerical review. Example code in Stata: clrevmatch using cler_reviewed_14jul2016, idm(mort_id) idu(pat_id) /// varm(pass mort_id id1 fname_1 lname_1 ssn_1 dob_1 sex_1 race_1) /// varu(pass pat_id id2 fname_2 lname_2 ssn_2 dob_2 sex_2 race_2) /// clrev_result(crev) clrev_label(0 not match 1 match ) /// clrev_note(crnote) /// rlscoremin(30) rlscoremax(45) reclinkscore(weight) /// nobssave(1) replace saveold Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage Outcomes July 27, / 13

12 Data Linkage Requirements 1 At a minimum, the following information is required to link records with FCDS: First name, Last name, Sex, and Date of Birth and/or Social Security Number. 2 Additional information such as Middle Initial, Alias Name, Maiden Name, Race, Street Address, City, State, Zip Code and Birthplace improves linkage outcomes. Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage Outcomes July 27, / 13

13 Conclusion Data quality is central to data linkage outcomes! 1 Quality of identifiers: Most important. 2 Quality of linkage methods: Probabilistic linkage is recommended but has limitations. 3 Quality of evaluation: A clerical review note is better than usual. Match-status data would be best. Future work: 1 Improve existing code template. For example, Stata users can use more efficient code with command Rcall than with rsource. 2 Learn more R to better understand the package RecordLinkage. For example, it is possible but very challenging to create match-status data. R users can use Stata code with the package RStata. 3 Stay on top of methods. Examples are machine learning and literate programming. 4 Stay on top of software developments. For instance, a new version of LinkPlus is expected this year. Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage Outcomes July 27, / 13

Introduction to blocking techniques and traditional record linkage

Introduction to blocking techniques and traditional record linkage Brenda Betancourt Duke University Department of Statistical Science bb222@stat.duke.edu May 2, 2018 1 / 32 Blocking: Motivation Naively