Privacy Preserving Probabilistic Record Linkage

Size: px

Start display at page:

Download "Privacy Preserving Probabilistic Record Linkage"

Emil Reed
5 years ago
Views:

Privacy Preserving Probabilistic Record Linkage Duncan Smith

Manchester The research leading to these results has received

1 Privacy Preserving Probabilistic Record Linkage Duncan Smith Natalie Shlomo Social Statistics, School of Social Sciences University of Manchester The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/ ) under grant agreement n (DwB - Data without Boundaries). 1

2 Topics Covered Introduction Probabilistic Record Linkage String Anonymisation Putting the probabilities back into Privacy Preserving Record Linkage Experiment Discussion 2

3 Introduction Probabilistic record linkage developed by Fellegi and Sunter, 1969 Administrative sources are being used to improve the quality of surveys or to replace traditional censuses Traditionally, all datasets in one location (NSI) and matching variables (first name, last name, address) used to link data without the need for anonymisation Data on individuals may be in distinct databases and may be owned by different custodians: Alice (A) and Bob (B) Privacy restrictions prevent the release of certain variables or information is suppressed/coarsened that uniquely identifies an individual 3

4 Introduction CS Literature, techniques for anonymising identifying variables Third party (Carole) only sees matching variables and returns pairs of unique record IDs (assigned by Alice and Bob) Two possible scenarios (there are more ): Trusted Carole sees the true values of single matching variable Non-trusted Carole sees anonymised values of single matching variable Privacy preserving record linkage (PPRL) allows exact matching and can allow linkage based on similarity scores generated from anonymised values F&S probabilistic record linkage typically not used in PPRL 4

5 Introduction Alice and Bob clean, harmonize and standardize data and anonymise matching variables (using the same method and seed) In our new approach, we apply probabilistic record linkage to anonymised values to obtain a probability of a correct match (PPPRL) Motivation: Data can be held within an archive, users can carry out PPPRL within a black box for dynamic database integration Three party Alice, Bob, Carole scenario as set out in UK Beyond 2011 project where Carole has access to original values and can calculate string comparators In PPRL, no possibility of clerical review and links classified into 2 classes: true matches and false matches 5

6 Probabilistic Record Linkage F&S probabilistic record linkage uses a Binomial EM algorithm based on an agree/disagree indicator { i,i 1..p }) to estimate likelihood ratio Matching score based on the sum of the log of the likelihood ratio: m( ) / u( ) where m( ) is the probability of agree given it s a match and u( ) probability of agree given its not a match String comparators, eg. Jaro-Winkler, are used to adjust the matching score based on partial agreements, eg. typing errors, etc. 6

7 String Anonymisation String anonymisation can use hash functions on bigrams: 'john' {'jo', 'oh', 'hn'} { , , } 'jon' {'jo', 'on'} { , } Minwise hashing (Broder 1997) generates a random permutation of a set of elements and returns the hash for the first ordered element The probability of a hash collision on the first ordered element is the Jaccard similarity score: A B J A, B A B Estimate of Jaccard similarity score based on many hash values where the number of collisions is distributed: n ~ Bin( m,j ) (m number of hash functions) A, B And estimated by n Ĵ A, B m 7

8 String Anonymisation Proposed method: concatenated 1-bit minwise hashing Estimation of the Jaccard similarity score is: n Ĵ A, B 2 1 m Example: Minwise hashes and 1-bit minwise hashes under a binary representation for S1={ jo, oh, hn } and S2= { jo, on } H1 H2 H3 H4 H5 Hm S S Sn H1 H2 H3 H4 H5 Hm S S Sn With 5 hash functions, estimate of the Jaccard similarity score is 2/5 for minwise hashes and 3/5 for 1-bit hashes; true value is 1/4 8

9 String Anonymisation Simulation Study: File A 300 names, File B obtained by perturbing File A to simulate typographical errors Tokenized bigrams with leading and trailing underscores True Jaccard scores compared with estimated scores on all pairs in A x B Bias in Bloom filter approaches Smaller variance in minwise hash compared to concatenated 1-bit hash but requires more storage Concatenated hash approximately same MSE as Bloom filter Precision can be controlled by choice of m the number of hash functions 9

10 Privacy Preserving Probabilistic Record Linkage Extend Binomial EM Algorithm to K categories, k=1,,k where each category is a grouping of similarity scores (Jaro for original values; Jaccard for anonymised values) i.e. 8 categories with (inclusive) upper bounds: [0.2,0.4,0.6,0.8,0.9,0.95,0.999,1] Element in agreement vector for variable q of pair j j with similarity score in category k, q 1, otherwise 0 Multinomial EM algorithm to estimate matching parameters: mˆ q,k, ûq, k and pˆ,k Blocking: In PPRL literature methods include: canopy clustering (McCallum et al., 2000) which divides the pairs into overlapping subsets before classification; multibit tree structures to identify similar comparison vectors under the Bloom filter framework (Bachteler et al.,2013 ), and more... 10

11 Experiment 1000 records from a Census database with attached English names (File A) File B generated by perturbing File A under a probabilistic approach including swapping, deleting and transposing characters on variables: Gender, Year of Birth, Month of Birth and First Name 4 Perturbed datasets perturbed at different levels of perturbation A random sample of 700 records from File A and a random sample of 400 records from perturbed files used for matching No blocking was carried out 11

12 Experiment PPPRL: Binary EM: standard EM approach based on exact matching of strings. No similarity score used LR weighted: outputs of Binary EM and downweight likelihood ratios Log LR weighted: outputs of Binary EM and downweight log likelihood ratios EM (8): multinomial EM approach with 9 bins having upper bounds [0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.999, 1]. Jaccard similarity score (with padded underscores on bigrams) EM (15): multinomial EM approach with 15 bins having upper bounds [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.925, 0.95, 0.975, 0.999, 1]. As above PRL: Jaro: multinomial EM approach with 8 bins using Jaro string comparator Jaro-Winkler: As above but with Jaro-Winkler string comparator 12

13 Experiment Correct links identified and used to construct precisionrecall plots Plots show for any given threshold the precision and recall based on false positives, true positives, false negatives, true negatives, and can be used to compare approaches Good approaches will produce curves in the upper right of the plot tp Pr ecision tp fp Re call tp tp fn 13

14 Experiment low perturbation high perturbation All approaches perform better with low level of perturbation Binary EM without similarity scores performs the worst Down weighting log likelihood ratios outperforms down weighting of likelihood ratios Multinomial EM outperforms Binary EM with no clear difference between 8 category and 15 category Jaccard score schemes Jaro schemes provide the best performance, although these are not privacy preserving 14

15 Discussion PPPRL does not allow clerical review and one threshold is determined based on posterior probability of a correct link PPPRL can be tailored to different types of variables via the choice/design of the tokenization scheme So far dealt with 1 to 1 matching Multinomial EM offers improved classification over the unweighted and weighted binary EM schemes Under trusted Carole, the Jaro and Jaro-Winkler schemes outperformed the padded bigram tokenization scheme under PPPRL 15

16 Thank you for your attention 16

Data Linkage Methods: Overview of Computer Science Research

Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,