Anonymisation of Public Use Data Sets

Similar documents
NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Feature Reduction and Selection

Adjustment methods for differential measurement errors in multimode surveys

X- Chart Using ANOM Approach

Wishing you all a Total Quality New Year!

y and the total sum of

Support Vector Machines

Machine Learning: Algorithms and Applications

S1 Note. Basis functions.

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

Econometrics 2. Panel Data Methods. Advanced Panel Data Methods I

A Post Randomization Framework for Privacy-Preserving Bayesian. Network Parameter Learning

Modeling Local Uncertainty accounting for Uncertainty in the Data

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

A CLASS OF TRANSFORMED EFFICIENT RATIO ESTIMATORS OF FINITE POPULATION MEAN. Department of Statistics, Islamia College, Peshawar, Pakistan 2

Performance Evaluation of Information Retrieval Systems

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

CS 534: Computer Vision Model Fitting

Announcements. Supervised Learning

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

SVM-based Learning for Multiple Model Estimation

Active Contours/Snakes

Support Vector Machines

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Smoothing Spline ANOVA for variable screening

A Statistical Model Selection Strategy Applied to Neural Networks

Lecture 5: Probability Distributions. Random Variables

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

TN348: Openlab Module - Colocalization

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Data Mining: Model Evaluation

Problem Set 3 Solutions

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Random Variables and Probability Distributions

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Mathematics 256 a course in differential equations for engineering students

Review of approximation techniques

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

7/12/2016. GROUP ANALYSIS Martin M. Monti UCLA Psychology AGGREGATING MULTIPLE SUBJECTS VARIANCE AT THE GROUP LEVEL

LECTURE : MANIFOLD LEARNING

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

An Entropy-Based Approach to Integrated Information Needs Assessment

Why visualisation? IRDS: Visualization. Univariate data. Visualisations that we won t be interested in. Graphics provide little additional information

Fuzzy Filtering Algorithms for Image Processing: Performance Evaluation of Various Approaches

EXTENDED BIC CRITERION FOR MODEL SELECTION

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Mixed Linear System Estimation and Identification

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A Semi-parametric Regression Model to Estimate Variability of NO 2

Comparing High-Order Boolean Features

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Biostatistics 615/815

Recognizing Faces. Outline

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Electrical analysis of light-weight, triangular weave reflector antennas

Analysis of Continuous Beams in General

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Lecture 4: Principal components

Three supervised learning methods on pen digits character recognition dataset

Intra-Parametric Analysis of a Fuzzy MOLP

Optimizing Document Scoring for Query Retrieval

Unsupervised Learning and Clustering

C2 Training: June 8 9, Combining effect sizes across studies. Create a set of independent effect sizes. Introduction to meta-analysis

Fusion Performance Model for Distributed Tracking and Classification

THE THEORY OF REGIONALIZED VARIABLES

Classifier Selection Based on Data Complexity Measures *

The Man-hour Estimation Models & Its Comparison of Interim Products Assembly for Shipbuilding

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Control strategies for network efficiency and resilience with route choice

We Two Seismic Interference Attenuation Methods Based on Automatic Detection of Seismic Interference Moveout

Available online at ScienceDirect. Procedia Environmental Sciences 26 (2015 )

Machine Learning 9. week

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Reducing Frame Rate for Object Tracking

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

5.0 Quality Assurance

A Binarization Algorithm specialized on Document Images and Photos

REFRACTIVE INDEX SELECTION FOR POWDER MIXTURES

A Similarity-Based Prognostics Approach for Remaining Useful Life Estimation of Engineered Systems

LOOP ANALYSIS. The second systematic technique to determine all currents and voltages in a circuit

Solutions to Programming Assignment Five Interpolation and Numerical Differentiation

Using Auxiliary Data for Adjustment In Longitudinal Research. Dirk Sikkel Joop Hox Edith de Leeuw

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Graph-based Clustering

Principal Component Inversion

Support Vector Machines for Business Applications

A Clustering Algorithm for Chinese Adjectives and Nouns 1

Finite Population Small Area Interval Estimation

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Outlier Detection based on Robust Parameter Estimates

Transcription:

Anonymsaton of Publc Use Data Sets Methods for Reducng Dsclosure Rsk and the Analyss of Perturbed Data Harvey Goldsten Unversty of Brstol and Unversty College London and Natale Shlomo Unversty of Manchester 1

The problem and some solutons Release of large (pseudonymsed datasets for analyss potentally allows statstcal attack va searchng for records satsfyng certan constrants (e.g. age, locaton, medcaton.. Standard soluton s to degrade data values to make t unlkely that an attacker could correctly dentfy ndvduals. Typcally judge usng k-anonymty Two types of dsclosure control methods under safe data approach 1. Non-perturbatve methods reduce nformaton content 2. Perturbatve methods alters the data to ncrease uncertanty of dentfcaton 2

The problem and some solutons Non-perturbatve methods: 1. Remove cells wth small counts f data n tabular form, preservng margns 2. Delete senstve varables 3. Group categores or categorse contnuous varables of dsclosve varables such as postcode, age. 4. Sub-sample Perturbatve methods: 1. Add random nose to ncrease uncertanty around correct dentfcaton (ths ncludes random msclassfcaton for categorcal varables 2. mcro-aggregaton of smlar cases (effectvely reduces varaton 3. Create synthetc data values whle preservng data structure 3

Effects on statstcal analyss a key concern 1 Cell removal: may over - coarsen data and n partcular remove nterestng nteracton effects 2 Groupng: lke (1 may smooth over comple relatonshps 3 Addton of random nose wll lead to ncorrect standard errors and also based coeffcents n generalsed lnear models unless properly adjusted for 4 Synthetc data may lead to severely based coeffcents f analyss models do not nclude varables used n the synthess 4

Synthetc data Synthetc data: reles on assumed or modelled data relatonshps to smulate (mpute new data that appromates real data. Ths can be done for all data or a subset. Producng multply mputed datasets allows correctons to be made for mputaton varance or appromatons avalable. Few would advocate that such data should be used for a fnal analyss: rather they can provde an ndcaton for a small set of fnal models that can then smply be ftted (n a secure envronment to produce requred model estmates. 5

Synthetc data Ths poses partcular problems: 1. There s a strong relance on producng the rght structure, typcally va a seres of condtonal models. 2. Even usng synthetc data n eploratory mode can lead users astray, where ther models based upon an appromaton to the true structure become based, and lead to the selecton of napproprate fnal models to be estmated usng the real data. 6

Addng random nose n general Addng random nose s less etreme than synthetc data. We suppose that the attacker has avalable a set of q values for y (the varables to be used, say yy that she ntends to match aganst records n the data set. We propose to construct a new set of varables, z, whch s what the attacker wll see zz yy + mm where m has a predefned (normal dstrbuton (other dstrbutons are possble e.g. dfferental prvacy technques often use a double eponental dstrbuton For smplcty, assume ndependence across varables to be dsturbed or we mght consder the case of correlated nose (correlated wth true values to preserve the correlaton structure and suffcent statstcs. Note that y can be contnuous or dscrete (categores numbered 1,,p 7

Addng random nose n general The value of the varance (σσ 2 mm wll determne the strength of the resstance to attack and can be a functon of the true varablty of each varable. We now form a measure of the dstance between the yy and each z and then rank these dstances. A general dstance measure can be wrtten n the form DD zz yy TT WW(zz yy, where, for eample, WW 1 Ω kk But, more smply we can choose the Eucldean dstance for each comparson record DD qq jj1 (zz yy jj 2, qq DD jj1 (zz yy jj 2, 1,., nn 8

Rankng the dstances A ratonal attacker chooses closest record(s to ther own as the correct one(s. Form RR RRRRRRRR(DD, RR RRRRRRRR DD, Defne value of for RR 1. Defne h RR 1, For eample: Thus f h0 we have the correct match. RR RR 1 3 2 2 1 3 3 2 1 2, so h 3 1 2 f attacker chooses closest record. h measures dfference between chosen and correct method So choose nose added large enough so that, say, Pr h < pp < εε (say, pp 3, εε 0.1 9

A smulaton Generate 10 3 records wth 5 normal varables and σσ 2 mm 0.1 All varances 1 and covarances 0.25. For each true value record (attacker s y generate DD, DD The followng table gves some estmates of dsclosveness n terms of h for a range of ndvduals at dfferent dstances from the medan. 10

Dstrbuton for h 0 1 2 3 4 5 hh Cumulatve percentle of D dstrbuton 10 20 30 40 50 52.2 49.4 43.9 41.3 41.7 62.9 60.7 56.1 53.1 53.1 70.0 65.3 62.0 61.2 60.8 74.7 70.2 68.6 66.0 65.8 78.5 74.4 72.8 70.1 68.7 80.8 77.5 76.5 72.7 71.5 11

More results Lowest decle Pr(h>5. For combnatons of Ω aaaaaa σσ mm 2 where Ω always has unt dagonal elements and equal off-dagonal elements (gven by columns 0.1 0.5 are shown. Sample sze 1000. σσ mm 2. 0.1 0.2 0.3 0.4 0.5 0.1 0.15 0.16 0.19 0.23 0.24 0.2 0.45 0.43 0.46 0.50 0.54 0.3 0.58 0.63 0.63 0.65 0.70 0.4 0.73 0.74 0.74 0.76 0.77 We see that the procedure s readly tuned smply by changng the varance of the nose. We are also studyng the possblty of a more sophstcated attack that uses vales of y predcted from the perturbed dataset rather than the z themselves. 12

The h-nde and k-anonymsaton If we have, say, 2-anonymty ths mples that an attacker s able to dentfy two ndvdual records matchng her own nformaton, so choosng ether of them at random means that there s a probablty of 0.5 that t s the correct one. The h-nde, however, only yelds a sngle ndvdual as the closest, for eample wth a probablty about 0.5 and thus provdes less nformaton to the attacker than n the case of 2-anonymty. 13

The h-nde and k-anonymzaton II For k-anonymty an attacker may be qute content that they can access 2 or perhaps even 5 records contanng the one that s sought. By contrast, wth the h-nde procedure, n our most favourable case, the probablty of the sought-for ndvdual beng one of the two nearest s just over 60% and one of the fve nearest just under 80% Thus t could be argued that ths s suffcent to deter an attacker and hence sutable n terms of dsclosveness. In practce careful attenton needs to be pad to the amount of nose requred to satsfy dsclosure concerns. 14

How to remove the nose Assume nose η ~ d(0, 2 σ η add to contnuous varable We get unbased totals and means but larger varance and bases where predctors ncorporate nose How to make correct nferences n a general modellng framework? y η 2 σ η Assume a smple regresson model wth a dependent varable that has been subjected to Gaussan addtve nose wth a mean of 0 and a postve varance The predctor varable s error free we assume. 15 15

16 16 The model s: where denotes the true but unobserved value of the dependent varable If we regress on then snce y + + + y y n y η ε β α 1,...,, y y (, ( (, (, ( (, ( (, ( Var y Cov Var Cov y Cov Var y Cov Var y Cov + + η η β 0, ( Cov η How to remove the nose

How to remove the nose Addtve nose on the dependent varable thus does not bas slope coeffcent but ncreases standard errors due to the ncrease n varance Var ( y Var( y + Var( η Now add nose η to predctor varable The model s now: y α + + β η + ε, 1,..., n where denotes true but unobserved value of 17

How to remove the nose If we regress y on then for the least squares scope coeffcent: β Cov( y, Var( Cov( y, + η Var( + Var( η Cov( y, Var( + + Cov( y, Var( η η Cov( y, Var( + Var( η snce Cov( y, η 0 Addtve nose on predctor varable bases slope coeffcent downwards (attenuaton Thus we need sutable methodology to deal wth these measurement errors 18

How to remove the nose For the lease squares slope coeffcent n a smple lnear regresson: ˆ β p Cov( y, Var( + Var( We defne 2 2 1 λ 1 / σ as the relablty rato η ( + σ η σ βσ 2 + 2 2 η β (1 + σ 2 η / σ 2 σ 1 A consstent estmate of the slope coeffcent s obtaned by dvdng least squares estmate by λ 2 λ σ η To calculate we assumes that s released and known to the researcher. 19 19

How to remove the nose n general Nose s random wth known propertes so a measurement error model s requred Ths requres that the parameters used to generate the nose are known to the researchers. Current work (usng a CLOSER grant at Brstol s underway to develop software to show how the nose should be generated n such a way that the parameters can be released under a predetermned h-nde to protect aganst attrbute dsclosure whlst preservng utlty 20

How to remove the nose In smple lnear regresson, correlated nose can be added whch produce unbased estmates of slope coeffcents by usng standard regresson technques. Current work at Brstol s developng algorthms ncorporatng measurement error models that wll handle generalzed lnear models and multlevel data of dfferent types. Specalsaton to anonymsaton wth user software currently funded through ESRC (va Closer at Brstol (Boyd, Goldsten and Burton Can be combned wth handlng mssng data values. Some loss of statstcal effcency but enables underlyng sgnal to be etracted and thus provdes unbased parameter estmates. 21

Further thoughts Often, a data attacker wll have no pre-estng ndvdual data and may trawl the dataset to dscover an nterestng record, for eample an ndvdual wth an unusual combnaton of values. They may then attempt to dentfy the real person usng other varables n the data record. Our procedure s also relevant to such an attack so long as the nose has been appled to the varables n queston. How to tune the nose and dfferental nose related to dentfablty of varables s an area for further research. For eample we mght wsh to add relatvely more nose to a varable such as heght than har colour. Now, t may well be the case that, condtonal on the data avalable to the attacker, a varable such as ncome can be predcted wth suffcent accuracy wthn ths dataset, and f the data structure s well appromated ether by removng nose or va synthess then ncome could be farly accurately predcted and ths may be suffcent for an attacker s purpose. Needs further consderaton. Provson of sutable analyss tools and tranng for data analysts s mportant dscussons are underway wth Government departments and agences through ADRN. 22

Thank you for your attenton 23