Overview of Record Linkage Techniques

Size: px

Start display at page:

Download "Overview of Record Linkage Techniques"

Debra Daniela Lindsey
5 years ago
Views:

1 Overview of Record Linkage Techniques Record linkage or data matching refers to the process used to identify records which relate to the same entity (e.g. patient, customer, household) in one or more data sets that do not share a unique database key in common. 1.1 Deterministic Record Linkage The simplest type of linkage involves exact matches on unique identifiers or combinations of fields that uniquely identify given individuals. This type of linkage is known as Deterministic Record Linkage. All identifiers must agree for a link to be made. This type of linkage works well for unique identifiers such as Medical Record Number, Social Security Number or Driver s License Number. However, it performs poorly for non-unique identifiers such as name and date of birth. Names are frequently misspelt, nicknames or aliases are used and dates of birth are often estimated. L i n k a g e W i z S o f t w a r e P a g e 1

2 1.2 Fuzzy Matching Another method is Fuzzy Matching; partial matches are permitted and matches are usually determined according to a number of subjective rules created by the user. For example, it might be determined that a pair of records should be linked if the first initial, family name and date of birth agree or if the first name, family name and address agrees but the day, month or year of birth disagrees. While this method appears relatively simple, a large number of rules and exceptions may need to be specified to maximize the accuracy of the linkage process. Alternately, a simple scoring system is sometimes used. 1.3 Probabilistic Record Linkage A very popular method is Probabilistic Record Linkage. Unlike deterministic record linkage, typographical differences and other errors do not prelude possible candidate record pairs from being matched. Mathematical probabilities derived from a large reference dataset of known linkages and non-linkages are used to derive Linkage Weights for each field. Separate weights are derived for field agreements, disagreements and missing values. The linkage weights are higher for variables with more specificity (such as family name), and lower for variables with less specificity (such as sex) The weights are calculated from the logarithm of the frequency ratio of the field being examined, where: Weight = Log 2 Frequency Frequency of agreement in LINKED pairs of agreement in UNLINKED pairs Example Calculation - Family Name: If the family name agrees in 90% of linked pairs, and only 1% of unlinked pairs then: Field Agreement Weight = LOG2(90/1) = Example Calculation - Sex: If sex agrees in 95% of linked pairs and 50% of unlinked pairs then: Field Disagreement Weight = LOG2(95/50) = 0.93 During the linkage process, the agreement or disagreement weight for each field is added to derive a combined score that represents the probability that the records refer to the same entity. There is usually a threshold above which a pair is considered a match True Linkage, and another threshold below which it is considered not to be a match L i n k a g e W i z S o f t w a r e P a g e 2

3 Non-Linkage. Between the two thresholds a pair is considered to be a Potential Linkage and may require manual review by a clerical officer. There are no precise rules for determining these thresholds, as they are affected by a range of factors; including data quality, the characteristics of the population being studied and many others. A histogram of the Linkage Scores can often be useful in displaying patterns in the data, which can subsequently be used to estimate the thresholds. The following graph indicates the relationship between scores, linkage thresholds and linkage errors: There are large numbers of record pairs with lower scores (non-linkages), and lower numbers with higher scores, indicating true linkages. In this example, there appears to be a natural break in the curve at scores higher than 21-23, with a mid-range around 17-19, and a steep increase in the curve below 17. This would tend to indicate that true linkages have a score of 21 or higher, whilst the potential linkages might fall into the range and non-linkages have a score of less than 17. As illustrated in the graph, probabilistic record linkage almost always includes a number of False Positives (records that have been linked that do not belong to the same individual) as well as False Negatives (records that have not been linked but really do belong to the same individual). Increasing the True Linkages Threshold will reduce the number of false positives, but at the expense of increased numbers of false negatives, and vice versa. When undertaking record linkage in a medical setting it is important to set a reasonably high threshold so that results from different patients are not inadvertently combined, whereas for a police investigation the threshold might be lowered to maximize the chance of detecting a specific criminal. L i n k a g e W i z S o f t w a r e P a g e 3

4 It is not possible to eliminate errors entirely as reducing one type of error always results in an increase in the other type Clerical Review Clerical review describes the process by which potential linkages are manually reviewed. Additional information may be sought from the original custodian to confirm a possible link, or if several records for an individual have already been linked then it may be possible to confirm the link using the information sourced from the other records. Depending upon the type of data it may also be important to scan for false positives such as twins or triplets. These cases usually have a high linkage score and only the first or middle initials differ; all other demographic fields are usually in agreement Blocking Blocking is the process of stratifying the linkage process to reduce the number of comparisons that must be undertaken. If blocking was not used then every record would need to be compared with every other record in the data set, resulting in a very large number of comparisons and exponentially degrading system performance. The table below indicates the number of records that would be to be compared for three different sized database tables: Number of Records Number of Comparisons 1,000 1,000,000 10, ,000,000 1,000,000 1,000,000,000,000 Blocking thus subdivides data into a set of mutually exclusive subsets (blocks) under the assumption that no matches occur across different blocks. Blocks are typically based upon fields such as family name, date of birth or business name. As blocks may be subject to typographical and spelling errors, they are usually standardized by applying a phonetic coding system such as the NYSIIS or by applying a data standardization scheme. In practice some matches may actually occur across blocks, for example, records for a woman who has changed her family name might not be linked on a block based on family name, but would be linked if a subsequent block on date of birth was instigated. For this reason multiple blocking variables are often used; maximizing the likelihood that a linkage missed by one pass will be detected by a subsequent blocking pass. L i n k a g e W i z S o f t w a r e P a g e 4

5 1.3.3 Phonetic Algorithms A phonetic algorithm is an algorithm for the indexing of words by their pronunciation. Soundex, the most well known algorithm was developed to provide a manual filing code for the USA Census documents in the early 1900 s. Soundex codes are fourcharacter strings composed of one letter followed by three numbers. For example, Johnson J525 New York State Identification and Intelligence System, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System. The result is a string that can be pronounced by the reader without decoding. Unlike the Soundex algorithm relative vowel positioning is maintained. For example, Johnson JASAN Metaphone was developed by Lawrence Philips as a response to deficiencies in the Soundex algorithm. It is more accurate than Soundex because it uses a larger set of rules for English pronunciation. For example, Johnson JNSN The main use of phonetic algorithms is during blocking, to ensure that records with common spelling variations are included in the subset of records being compared. Blocking variables are usually based on the phonetic representation of a field rather than the raw values. Phonetic algorithms should not be used to compare fields as they do not possess sufficient specificity. The example below illustrates that two quite different first names share the same Soundex code:. For example, John J500, Jane J500 In this example agreement accepting an agreement on Soundex codes would result in a false positive. When comparing two fields it is more appropriate to use a string comparison algorithm, which typically measures the differences between two strings, also known as the edit difference. Such functions include the Levenshtein distance and Jaro-Winkler distance. L i n k a g e W i z S o f t w a r e P a g e 5

6 1.3.4 Data Standardization Standardization of the data prior to linkage is very important for reducing variability and subsequently increasing the accuracy of the linkage process. It involves the removal of special characters such as punctuation and extraneous spaces, ensuring the consistent use of upper and lower case as well as the removal of invalid numerics and other noise data. Individual words such as address elements are replaced with standardized words or abbreviations. Organizational noise is removed from business names. For example, Street St Acme Motors International Pty Ltd ACME MOTORS More Information For a detailed description of probabilistic linkage techniques, you should refer to reference material such as the following publication by Howard Newcombe: Newcombe, H.B. Handbook of Record Linkage Methods for Health and Statistical Studies, Administration and Business, Oxford, U.K., Oxford University Press. L i n k a g e W i z S o f t w a r e P a g e 6

Session 6 Population and Housing Censuses; Registers of Population, Dwelling, and Buildings Brunei, August 2017

Mariet Tetty Nuryetty mariet@bps.go.id Session 6 Population and Housing Censuses; Registers of Population, Dwelling, and Buildings Brunei, 22-24 August 2017 1. Record Linkage 2. How to do it? As a rule