Anonymisation of Public Use Data Sets

Anonymsaton of Publc Use Data Sets Methods for Reducng Dsclosure Rsk and the Analyss of Perturbed Data Harvey Goldsten Unversty of Brstol and Unversty College London and Natale Shlomo Unversty of Manchester 1

The problem and some solutons Release of large (pseudonymsed datasets for analyss potentally allows statstcal attack va searchng for records satsfyng certan constrants (e.g. age, locaton, medcaton.. Standard soluton s to degrade data values to make t unlkely that an attacker could correctly dentfy ndvduals. Typcally judge usng k-anonymty Two types of dsclosure control methods under safe data approach 1. Non-perturbatve methods reduce nformaton content 2. Perturbatve methods alters the data to ncrease uncertanty of dentfcaton 2

The problem and some solutons Non-perturbatve methods: 1. Remove cells wth small counts f data n tabular form, preservng margns 2. Delete senstve varables 3. Group categores or categorse contnuous varables of dsclosve varables such as postcode, age. 4. Sub-sample Perturbatve methods: 1. Add random nose to ncrease uncertanty around correct dentfcaton (ths ncludes random msclassfcaton for categorcal varables 2. mcro-aggregaton of smlar cases (effectvely reduces varaton 3. Create synthetc data values whle preservng data structure 3

Effects on statstcal analyss a key concern 1 Cell removal: may over - coarsen data and n partcular remove nterestng nteracton effects 2 Groupng: lke (1 may smooth over comple relatonshps 3 Addton of random nose wll lead to ncorrect standard errors and also based coeffcents n generalsed lnear models unless properly adjusted for 4 Synthetc data may lead to severely based coeffcents f analyss models do not nclude varables used n the synthess 4

Synthetc data Synthetc data: reles on assumed or modelled data relatonshps to smulate (mpute new data that appromates real data. Ths can be done for all data or a subset. Producng multply mputed datasets allows correctons to be made for mputaton varance or appromatons avalable. Few would advocate that such data should be used for a fnal analyss: rather they can provde an ndcaton for a small set of fnal models that can then smply be ftted (n a secure envronment to produce requred model estmates. 5

Synthetc data Ths poses partcular problems: 1. There s a strong relance on producng the rght structure, typcally va a seres of condtonal models. 2. Even usng synthetc data n eploratory mode can lead users astray, where ther models based upon an appromaton to the true structure become based, and lead to the selecton of napproprate fnal models to be estmated usng the real data. 6

Addng random nose n general Addng random nose s less etreme than synthetc data. We suppose that the attacker has avalable a set of q values for y (the varables to be used, say yy that she ntends to match aganst records n the data set. We propose to construct a new set of varables, z, whch s what the attacker wll see zz yy + mm where m has a predefned (normal dstrbuton (other dstrbutons are possble e.g. dfferental prvacy technques often use a double eponental dstrbuton For smplcty, assume ndependence across varables to be dsturbed or we mght consder the case of correlated nose (correlated wth true values to preserve the correlaton structure and suffcent statstcs. Note that y can be contnuous or dscrete (categores numbered 1,,p 7

Addng random nose n general The value of the varance (σσ 2 mm wll determne the strength of the resstance to attack and can be a functon of the true varablty of each varable. We now form a measure of the dstance between the yy and each z and then rank these dstances. A general dstance measure can be wrtten n the form DD zz yy TT WW(zz yy, where, for eample, WW 1 Ω kk But, more smply we can choose the Eucldean dstance for each comparson record DD qq jj1 (zz yy jj 2, qq DD jj1 (zz yy jj 2, 1,., nn 8

Rankng the dstances A ratonal attacker chooses closest record(s to ther own as the correct one(s. Form RR RRRRRRRR(DD, RR RRRRRRRR DD, Defne value of for RR 1. Defne h RR 1, For eample: Thus f h0 we have the correct match. RR RR 1 3 2 2 1 3 3 2 1 2, so h 3 1 2 f attacker chooses closest record. h measures dfference between chosen and correct method So choose nose added large enough so that, say, Pr h < pp < εε (say, pp 3, εε 0.1 9

A smulaton Generate 10 3 records wth 5 normal varables and σσ 2 mm 0.1 All varances 1 and covarances 0.25. For each true value record (attacker s y generate DD, DD The followng table gves some estmates of dsclosveness n terms of h for a range of ndvduals at dfferent dstances from the medan. 10

Dstrbuton for h 0 1 2 3 4 5 hh Cumulatve percentle of D dstrbuton 10 20 30 40 50 52.2 49.4 43.9 41.3 41.7 62.9 60.7 56.1 53.1 53.1 70.0 65.3 62.0 61.2 60.8 74.7 70.2 68.6 66.0 65.8 78.5 74.4 72.8 70.1 68.7 80.8 77.5 76.5 72.7 71.5 11

More results Lowest decle Pr(h>5. For combnatons of Ω aaaaaa σσ mm 2 where Ω always has unt dagonal elements and equal off-dagonal elements (gven by columns 0.1 0.5 are shown. Sample sze 1000. σσ mm 2. 0.1 0.2 0.3 0.4 0.5 0.1 0.15 0.16 0.19 0.23 0.24 0.2 0.45 0.43 0.46 0.50 0.54 0.3 0.58 0.63 0.63 0.65 0.70 0.4 0.73 0.74 0.74 0.76 0.77 We see that the procedure s readly tuned smply by changng the varance of the nose. We are also studyng the possblty of a more sophstcated attack that uses vales of y predcted from the perturbed dataset rather than the z themselves. 12

The h-nde and k-anonymsaton If we have, say, 2-anonymty ths mples that an attacker s able to dentfy two ndvdual records matchng her own nformaton, so choosng ether of them at random means that there s a probablty of 0.5 that t s the correct one. The h-nde, however, only yelds a sngle ndvdual as the closest, for eample wth a probablty about 0.5 and thus provdes less nformaton to the attacker than n the case of 2-anonymty. 13

The h-nde and k-anonymzaton II For k-anonymty an attacker may be qute content that they can access 2 or perhaps even 5 records contanng the one that s sought. By contrast, wth the h-nde procedure, n our most favourable case, the probablty of the sought-for ndvdual beng one of the two nearest s just over 60% and one of the fve nearest just under 80% Thus t could be argued that ths s suffcent to deter an attacker and hence sutable n terms of dsclosveness. In practce careful attenton needs to be pad to the amount of nose requred to satsfy dsclosure concerns. 14

How to remove the nose Assume nose η ~ d(0, 2 σ η add to contnuous varable We get unbased totals and means but larger varance and bases where predctors ncorporate nose How to make correct nferences n a general modellng framework? y η 2 σ η Assume a smple regresson model wth a dependent varable that has been subjected to Gaussan addtve nose wth a mean of 0 and a postve varance The predctor varable s error free we assume. 15 15

16 16 The model s: where denotes the true but unobserved value of the dependent varable If we regress on then snce y + + + y y n y η ε β α 1,...,, y y (, ( (, (, ( (, ( (, ( Var y Cov Var Cov y Cov Var y Cov Var y Cov + + η η β 0, ( Cov η How to remove the nose

How to remove the nose Addtve nose on the dependent varable thus does not bas slope coeffcent but ncreases standard errors due to the ncrease n varance Var ( y Var( y + Var( η Now add nose η to predctor varable The model s now: y α + + β η + ε, 1,..., n where denotes true but unobserved value of 17

How to remove the nose If we regress y on then for the least squares scope coeffcent: β Cov( y, Var( Cov( y, + η Var( + Var( η Cov( y, Var( + + Cov( y, Var( η η Cov( y, Var( + Var( η snce Cov( y, η 0 Addtve nose on predctor varable bases slope coeffcent downwards (attenuaton Thus we need sutable methodology to deal wth these measurement errors 18

How to remove the nose For the lease squares slope coeffcent n a smple lnear regresson: ˆ β p Cov( y, Var( + Var( We defne 2 2 1 λ 1 / σ as the relablty rato η ( + σ η σ βσ 2 + 2 2 η β (1 + σ 2 η / σ 2 σ 1 A consstent estmate of the slope coeffcent s obtaned by dvdng least squares estmate by λ 2 λ σ η To calculate we assumes that s released and known to the researcher. 19 19

How to remove the nose n general Nose s random wth known propertes so a measurement error model s requred Ths requres that the parameters used to generate the nose are known to the researchers. Current work (usng a CLOSER grant at Brstol s underway to develop software to show how the nose should be generated n such a way that the parameters can be released under a predetermned h-nde to protect aganst attrbute dsclosure whlst preservng utlty 20

How to remove the nose In smple lnear regresson, correlated nose can be added whch produce unbased estmates of slope coeffcents by usng standard regresson technques. Current work at Brstol s developng algorthms ncorporatng measurement error models that wll handle generalzed lnear models and multlevel data of dfferent types. Specalsaton to anonymsaton wth user software currently funded through ESRC (va Closer at Brstol (Boyd, Goldsten and Burton Can be combned wth handlng mssng data values. Some loss of statstcal effcency but enables underlyng sgnal to be etracted and thus provdes unbased parameter estmates. 21

Further thoughts Often, a data attacker wll have no pre-estng ndvdual data and may trawl the dataset to dscover an nterestng record, for eample an ndvdual wth an unusual combnaton of values. They may then attempt to dentfy the real person usng other varables n the data record. Our procedure s also relevant to such an attack so long as the nose has been appled to the varables n queston. How to tune the nose and dfferental nose related to dentfablty of varables s an area for further research. For eample we mght wsh to add relatvely more nose to a varable such as heght than har colour. Now, t may well be the case that, condtonal on the data avalable to the attacker, a varable such as ncome can be predcted wth suffcent accuracy wthn ths dataset, and f the data structure s well appromated ether by removng nose or va synthess then ncome could be farly accurately predcted and ths may be suffcent for an attacker s purpose. Needs further consderaton. Provson of sutable analyss tools and tranng for data analysts s mportant dscussons are underway wth Government departments and agences through ADRN. 22

Thank you for your attenton 23