Data Linkage Methods: Overview of Computer Science Research

Size: px

Start display at page:

Download "Data Linkage Methods: Overview of Computer Science Research"

Rudolf Lyons
6 years ago
Views:

1 Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia Contact: January 2013 p. 1/24

2 Outline Recent interest in data linkage Applications and challenges The data linkage process Types of data linkage techniques Improving scalability: Indexing techniques Improving linkage quality: Learning techniques Synthetic data: Advantages and challenges Privacy-preserving record linkage Challenges and research directions January 2013 p. 2/24

3 Recent interest in data linkage Traditionally, data linkage has been used in statistics (census) and health (epidemiology) In recent years, increased interest from businesses and governments Increased computing power and storage capacities Massive amounts of data are being collected Often data from different sources need to be integrated Need for data sharing between organisations Data mining (analysis) of large data collections E-Commerce and Web applications Geocode matching and spatial data analysis January 2013 p. 3/24

4 Applications of data linkage Remove duplicates in one data set (internal linkage) Merge new records into a larger master data set Create patient or customer oriented statistics (for example for longitudinal studies) Clean and enrich data for analysis and mining Geocode matching (with reference address data) Widespread use of data linkage Immigration, taxation, social security, census Fraud, crime and terrorism intelligence Business mailing lists, exchange of customer data Biomedical and social science research January 2013 p. 4/24

5 Data linkage challenges Often no unique entity identifiers are available Real world data are dirty (typographical errors and variations, missing and out-of-date values, different coding schemes, etc.) Scalability Naïve comparison of all record pairs is quadratic Blocking, searching, or filtering is needed (indexing) No training data in many linkage applications (no record pairs with known true match status) Privacy and confidentiality (because personal information, like names and addresses, are commonly required for linking) January 2013 p. 5/24

6 The data linkage process Database A Database B Data pre processing Data pre processing Indexing / Searching Matches Comparison Classif ication Non matches Evaluation Clerical Review Potential Matches January 2013 p. 6/24

7 Types of data linkage techniques Deterministic linkage Exact linkage (if a unique identifier of high quality is available: precise, robust, stable over time) Examples: Social security or driver s licence numbers Rule-based linkage (complex to build and maintain) Probabilistic linkage (Fellegi and Sunter, 1969) Use available attributes for linkage (often personal information, like names, addresses, dates of birth, etc.) Discussed in Bill s talk Computer science approaches (based on machine learning, data mining, database, or information retrieval techniques) January 2013 p. 7/24

8 Improving scalability: Indexing Number of record pair comparisons equals the product of the sizes of the two databases (linking two databases containing 1 and 5 million records will result in 1,000,000 5,000,000 = record pairs) Number of true matches is generally less than the number of records in the smaller of the two databases (assuming no duplicate records) Performance bottleneck is usually the (expensive) detailed comparison of attribute values between records (using approximate string comparison functions) Aim of indexing: Cheaply remove record pairs that are obviously not matches January 2013 p. 8/24

9 Traditional blocking Traditional blocking works by only comparing record pairs that have the same value for a blocking variable (for example, only compare records that have the same postcode value) Problems with traditional blocking An erroneous value in a blocking variable results in a record being inserted into the wrong block (several passes with different blocking variables can solve this) Values of blocking variable should have uniform frequencies (as the most frequent values determine the size of the largest blocks) Example: Frequency of Smith in NSW: 25,425 January 2013 p. 9/24

10 Recent indexing approaches (1) Sorted neighbourhood approach Sliding window over sorted databases Use several passes with different blocking variables Q-gram based blocking (e.g. 2-grams / bigrams) Convert values into q-gram lists, then generate sub-lists peter [ pe, et, te, er ], [ pe, et, te ], [ pe, et, er ],.. pete [ pe, et, te ], [ pe, et ], [ pe, te ], [ et, te ],... Each record will be inserted into several blocks Overlapping canopy clustering Based on q-grams and a cheap similarity measure, such as Jaccard (set intersection) or TF-IDF/Cosine January 2013 p. 10/24

11 Recent indexing approaches (2) StringMap based blocking Map strings into a multi-dimensional space such that distances between pairs of strings are preserved Use similarity join to find similar pairs (close strings) Suffix array based blocking Generate suffix array based inverted index (suffix array: peter eter, ter, er, r ) Post-blocking filtering (for example, string length or q-grams count differences) US Census Bureau: BigMatch (pre-process smaller data set so its values can be directly accessed; with all blocking passes in one go) January 2013 p. 11/24

12 Improving linkage quality: Learning techniques View record pair classification as a multidimensional binary classification problem (use numerical attribute similarities to classify record pairs as matches or non-matches) Many machine learning techniques can be used Supervised: Decision trees, SVMs, neural networks, learnable string comparisons, active learning, etc. Un-supervised: Various clustering algorithms Recently, collective classification techniques have been investigated (build graph of database and do overall classification, rather than each record pair independently) January 2013 p. 12/24

13 Collective linkage example Paper 2? w1=? Dave White w3=?? Paper 6 Susan Grey Paper 3 Intel Liz Pink MIT w2=? Don White w4=? Paper 5 John Black Paper 1 CMU Paper 4 Joe Brown (A1, Dave White, Intel) (A2, Don White, CMU) (A3, Susan Grey, MIT) (A4, John Black, MIT) (A5, Joe Brown, unknown) (A6, Liz Pink, unknown) (P1, John Black / Don White) (P2, Sue Grey / D. White) (P3, Dave White) (P4, Don White / Joe Brown) (P5, Joe Brown / Liz Pink) (P6, Liz Pink / D. White) Adapted from Kalashnikov and Mehrotra, ACM TODS, 31(2), 2006 January 2013 p. 13/24

14 Managing transitive closure a1 a2 a3 a4 If record a1 is classified as matching with record a2, and record a2 as matching with record a3, then records a1 and a3 must also be matching. Possibility of record chains occurring Various algorithms have been developed to find optimal solutions (special clustering algorithms) Collective classification deals with this problem by default January 2013 p. 14/24

15 Classification challenges In many cases there is no training data available Possible to use results of earlier linkage projects? Or from manual clerical review process? How confident can we be about correct manual classification of potential matches? Often there is no gold standard available (no data sets with known true match status) No large test data set collection available (like in information retrieval or machine learning) Much computer science research in data linkage is using bibliographic data (which have very different characteristics) January 2013 p. 15/24

16 Synthetic data: Advantages Privacy issues prohibit publication of real personal information De-identified or encrypted data cannot be used to link databases (as real name and address values are required) Several advantages of synthetic data Volume and characteristics can be controlled (errors and variations in records, number of duplicates, etc.) It is known which records are duplicates of each other, and so linkage quality can be calculated Data and the data generator program can be published (allowing others to repeat experiments) January 2013 p. 16/24

17 Synthetic data: Challenges Modelling the content and characteristics of real data (frequencies of values; variations and errors) Modelling dependencies between attributes (for example, given names often depend on gender) Several data generators have been developed Hernandez and Stolfo (mid 1990s): Only based on value tables, no frequencies, simple typographic errors Bertolazzi et al. (2003): Added frequency tables, allowed missing values, still simple error generation Christen et al. (2009): Added look-up tables with misspellings and nicknames, model different error types (typing, phonetic, OCR), attribute dependencies, generate households January 2013 p. 17/24

Modelling of variations and errors Printed

attr swap, repl cc (ph) sub, ins, del attr

del, trans attr swap, repl cc (ocr) sub,

ins, del Electronic document cc (ph and or

substitution ins : insertion wc split, merge

recognition del : deletion trans : transpose

18 Modelling of variations and errors Printed Handwritten Memory cc (ph) sub, ins, del attr swap, repl cc (ph) sub, ins, del attr swap, repl Dictate Typed cc (ty) sub, ins, del, trans attr swap, repl cc (ocr) sub, ins, del wc split, merge OCR cc (ph) sub, ins, del Electronic document cc (ph and or ty) sub, ins, del, trans cc (ph,ty) sub, ins, del, trans Abbreviations: cc : character change wc : word change subs : substitution ins : insertion wc split, merge - attr swap, repl attr swap, repl Speech recognition del : deletion trans : transpose repl : replace ty : typographic ph : phonetic attr : attribute January 2013 p. 18/24

19 Privacy-preserving record linkage Alice (2) (1) (2) Bob Alice (1) (2) (2) Bob (3) (3) (3) Carol (3) Assume two data sources, and possibly a third (trusted) party to conduct the linkage Objective: No party learns about the other parties private data, only linked records are revealed Various approaches with different assumptions about threats, what can be inferred by parties, and what is being released Based on some form of encoding or encryption techniques January 2013 p. 19/24

20 Privacy technologies Secure hash encoding ( peter 4R#x+Y4i!e@t4o] ) Generalisation techniques (replace specific values with more general ones, using for example k-anonymity) Secure multi-party computation (calculate a function such that parties only learn the final result) Differential privacy (statistical database queries) Bloom filters (e.g. bit-strings for set membership testing) Reference values (e.g. public telephone directory) Phonetic encoding (such as Soundex, NYSIIS, etc.) Add extra random values (hide sensitive real values) January 2013 p. 20/24

21 The state of PPRL Significant advances to achieving the goal of PPRL have been developed in recent years Various approaches based on different techniques Can link records securely, approximately, and in a (somewhat) scalable fashion So far, most PPRL techniques concentrated on approximate matching techniques, and on making PPRL more scalable to large databases However, no large-scale comparative evaluations of PPRL techniques have been published Only limited investigation of classification and linking assessment in PPRL January 2013 p. 21/24

22 Challenges and research directions Improved classification for data linkage (collective classification for personal data) Linking data from many sources Use cloud computing platforms for large-scale data linkage (cost efficiency) Real-time linking and linking dynamic data Develop and implement frameworks for data linkage that allow comparative studies of different techniques (benchmarks) Develop practical PPRL techniques (that facilitate accurate and automatic classification, as well as evaluation of linkage quality and completeness) January 2013 p. 22/24

23 Definition: Privacy-preserving record linkage Assume O 1 O d are the d owners of their respective databases D 1 D d They wish to determine which of their records r i 1 D 1, r j 2 D 2,, and r k d D d, match according to a decision model C(r i 1, rj 2,, rk d ) that classifies pairs (or groups) of records into one of the two classes M of matches, and U of non-matches O 1 O d do not wish to reveal their actual records r i 1 rk d with any other party (they are however prepared to disclose to a (maybe external) selected party the actual values of some attributes of the record pairs that are in class M to allow further analysis) January 2013 p. 23/24

24 Secure multi-party computation Compute a function across several parties, such that no party learns the information from the other parties, but all receive the final results [Yao 1982; Goldreich 1998/2002] Simple example: Secure summation s = i x i. Step 0: Z=999 Step 4: s = 1169 Z = 170 Party 1 x1=55 Step 1: Z+x1= 1054 Party 2 Step 2: (Z+x1)+x2 = 1127 x2=73 Step 3: ((Z+x1)+x2)+x3=1169 Party 3 x3=42 January 2013 p. 24/24

Privacy-Preserving Data Sharing and Matching

Privacy-Preserving Data Sharing and Matching Peter Christen School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia Contact: