Co-clustering for differentially private synthetic data generation

Similar documents
CS573 Data Privacy and Security. Differential Privacy tabular data and range queries. Li Xiong

Privacy-preserving machine learning. Bo Liu, the HKUST March, 1st, 2015.

Differentially Private H-Tree

Differentially Private Multi-Dimensional Time Series Release for Traffic Monitoring

An Efficient Clustering Method for k-anonymization

CS573 Data Privacy and Security. Differential Privacy. Li Xiong

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Introduction to Data Mining

CS573 Data Privacy and Security. Li Xiong

Data mining: concepts and algorithms

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Security Control Methods for Statistical Database

A Review Of Synthetic Data Generation Methods For Privacy Preserving Data Publishing

Table Of Contents: xix Foreword to Second Edition

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Privacy-Preserving Machine Learning

Data Anonymization. Graham Cormode.

Differentially Private Multi- Dimensional Time Series Release for Traffic Monitoring

Differentially Private H-Tree

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Parallel Composition Revisited

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Expectation Maximization (EM) and Gaussian Mixture Models

Comparison and Analysis of Anonymization Techniques for Preserving Privacy in Big Data

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

Statistical and Synthetic Data Sharing with Differential Privacy

Contents. Preface to the Second Edition

Data Distortion for Privacy Protection in a Terrorist Analysis System

Chapter 1, Introduction

Distributed Data Anonymization with Hiding Sensitive Node Labels

The Applicability of the Perturbation Model-based Privacy Preserving Data Mining for Real-world Data

Project Participants

Learning Bayesian Networks (part 3) Goals for the lecture

On Privacy-Preservation of Text and Sparse Binary Data with Sketches

Demonstration of Damson: Differential Privacy for Analysis of Large Data

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER

with BLENDER: Enabling Local Search a Hybrid Differential Privacy Model

Data Anonymization - Generalization Algorithms

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Frequent grams based Embedding for Privacy Preserving Record Linkage

Machine Learning Techniques for Data Mining

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Anonymization Algorithms - Microaggregation and Clustering

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

Pufferfish: A Semantic Approach to Customizable Privacy

Louis Fourrier Fabien Gaie Thomas Rolf

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Differential Privacy. CPSC 457/557, Fall 13 10/31/13 Hushiyang Liu

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Data Preprocessing. Komate AMPHAWAN

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

Unsupervised Learning

Semi-Supervised Clustering with Partial Background Information

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING

A Survey on Frequent Itemset Mining using Differential Private with Transaction Splitting

Data Preprocessing. Slides by: Shree Jaswal

Computer Vision. Exercise Session 10 Image Categorization

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Slides for Data Mining by I. H. Witten and E. Frank

Code No: R Set No. 1

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

arxiv: v1 [cs.ds] 12 Sep 2016

Privacy Preserving Machine Learning: A Theoretically Sound App

Question Bank. 4) It is the source of information later delivered to data marts.

COMS 4771 Clustering. Nakul Verma

A Survey on Postive and Unlabelled Learning

University of Florida CISE department Gator Engineering. Clustering Part 5

Record Linkage using Probabilistic Methods and Data Mining Techniques

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

An Adaptive Algorithm for Range Queries in Differential Privacy

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

CLUSTER BASED ANONYMIZATION FOR PRIVACY PRESERVATION IN SOCIAL NETWORK DATA COMMUNITY

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

A generic and distributed privacy preserving classification method with a worst-case privacy guarantee

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Unsupervised Learning

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

TUBE: Command Line Program Calls

Lecture on Modeling Tools for Clustering & Regression

HIDE: Privacy Preserving Medical Data Publishing. James Gardner Department of Mathematics and Computer Science Emory University

PATTERN RECOGNITION USING NEURAL NETWORKS

Lecture 7: Decision Trees

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Transcription:

Co-clustering for differentially private synthetic data generation Tarek Benkhelif, Françoise Fessant, Fabrice Clérot and Guillaume Raschia January 23, 2018 Orange Labs & LS2N Journée thématique EGC & IA : Données personnelles, vie privée et éthique

Context

Privacy preserving data publishing - Releasing data, either in their original or aggregated form - Protecting individuals represented in the data - Providing sufficient utility 1

Privacy preserving data publishing Private Differential privacy Public Supervised classification Original data Anonymisation mechanism Synthetic Released data Exploratory analysis Group based Differential privacy k-anonymity l-diversity t-closeness 2

Privacy preserving data publishing Private Differential privacy Public Supervised classification Original data Anonymisation mechanism Synthetic Released data Exploratory analysis Group based Differential privacy k-anonymity l-diversity t-closeness 2

Privacy preserving data publishing Private Differential privacy Public Supervised classification Original data Anonymisation mechanism Synthetic Released data Exploratory analysis Group based Differential privacy k-anonymity l-diversity t-closeness 2

Privacy preserving data publishing Private Differential privacy Public Supervised classification Original data Anonymisation mechanism Synthetic Released data Exploratory analysis Group based Differential privacy k-anonymity l-diversity t-closeness 2

Privacy preserving data publishing Private Differential privacy Public Supervised classification Original data Anonymisation mechanism Synthetic Released data Exploratory analysis Group based Differential privacy k-anonymity l-diversity t-closeness 2

Privacy preserving data publishing Private Differential privacy Public Supervised classification Original data Anonymisation mechanism Synthetic Released data Exploratory analysis Group based k-anonymity l-diversity t-closeness Differential privacy Same format as the original data Multidimensional data Independent of the data mining task 2

Privacy preserving data publishing Private Differential privacy Public Supervised classification Original data Anonymisation mechanism Synthetic Released data Exploratory analysis Group based k-anonymity l-diversity t-closeness Differential privacy Same format as the original data Multidimensional data Independent of the data mining task 2

Differential Privacy: Intuition With Jack??? OR??? Without Jack 3

Differential Privacy - It should not harm you or help you as an individual to enter or to leave the dataset. - To ensure this property, we need a mechanism whose output is nearly unchanged by the presence or absence of a single respondent in the database. - In constructing a formal approach, we concentrate on pairs of databases (D 1, D 2 ) differing on only one row, with one a subset of the other and the larger database containing a single additional row. 4

Differential Privacy ε-differential Privacy [Dwo06] A data release mechanism A satisfies ε-differential privacy if for all neighboring database D 1 and D 2, and released output O, Pr[A(D 1 ) = O] e ε Pr[A(D 2 ) = O]. Achieving ε-dp : Laplace mechanism Adds random noise to the true answer of a query Q, A Q (D) = Q(D) + Ñ, where Ñ is the Laplace noise. The magnitude of the noise depends on the privacy levels and the query s sensitivity 5

Existing approaches

Base line algorithm 1. Discretize attribute domain into cells Limitations 6

Base line algorithm 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) Limitations 6

Base line algorithm 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) 3. Use noisy counts to either... Limitations 6

Base line algorithm Limitations 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) 3. Use noisy counts to either... 3.1 Answer queries directly (assume distribution is uniform within cell) 6

Base line algorithm Limitations 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) 3. Use noisy counts to either... 3.1 Answer queries directly (assume distribution is uniform within cell) 3.2 Generate synthetic data (derive distribution from counts and sample) 6

Base line algorithm Limitations Granularity of discretization 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) 3. Use noisy counts to either... 3.1 Answer queries directly (assume distribution is uniform within cell) 3.2 Generate synthetic data (derive distribution from counts and sample) 6

Base line algorithm Limitations Granularity of discretization - Coarse: detail lost 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) 3. Use noisy counts to either... 3.1 Answer queries directly (assume distribution is uniform within cell) 3.2 Generate synthetic data (derive distribution from counts and sample) 6

Base line algorithm Limitations Granularity of discretization - Coarse: detail lost - Fine: noise overwhelms signal 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) 3. Use noisy counts to either... 3.1 Answer queries directly (assume distribution is uniform within cell) 3.2 Generate synthetic data (derive distribution from counts and sample) 6

DP multidimensional data release approaches Approach Dimension Mixed data type Parameter(s) DPCube [XXFG12] Multi-D Variance threshold DP-MHMD [RKS16] Multi-D Attribute grouping DiffGen [MCFY11] Multi-D Attributes taxonomy n br of specializations PrivBayes [ZCP + 14] Multi-D Bayesian network degree 7

PrivBayes [ZCP + 14] A B C DEF G PrivBayes decompose High-dimensional table A B C C D [ZCPSX14].. B E DEF Low-dimensional tables Method: Use Bayesian network to learn data distribution After BN learned, generate synthetic data by sampling from BN Challenge: privately choosing good decomposition A B C DEF G Noisy table reconstruct A B C C D Add noise.. B E DEF Noisy tables Tutorial: Differential Privacy in the Wild 21 8

Proposition: DPCocGen

Co-clustering Bi-clustering Simultaneously partition the rows and columns of a data matrix. D-clustering Simultaneously partition the d-dimensions of a data hyper cube. Capture the interaction (underlying structure) between the d entities. 9

MODL Co-clustering features Grouping Discover the best reordering and grouping of the data cube 1 that: maximize the mutual information between the d-clusterings Aggregation Aggregation ability which allows to decrease the number of clusters in a greedy optimal way 1 Boullé, M.: Functional data clustering via piecewise constant nonparametric density estimation. 10

DPCocGen Differentially private Co-clustering noise co-clustering Transform Full-dim distribution Noisy distribution Co-clustering matrix ε1 Composition theorem ε = ε1 + ε2 Original data ε2 Partition noise generate Co-clustering matrix Noisy co-clustering matrix Synthetic data 11

DPCocGen Differentially private Co-clustering noise co-clustering Transform Full-dim distribution Noisy distribution Co-clustering matrix ε1 Composition theorem ε = ε1 + ε2 Original data ε2 Partition noise generate Co-clustering matrix Noisy co-clustering matrix Synthetic data 11

DPCocGen Differentially private Co-clustering noise co-clustering Transform Full-dim distribution Noisy distribution Co-clustering matrix ε1 Composition theorem ε = ε1 + ε2 Original data ε2 Partition noise generate Co-clustering matrix Noisy co-clustering matrix Synthetic data 11

DPCocGen Differentially private Co-clustering noise co-clustering Transform Full-dim distribution Noisy distribution Co-clustering matrix ε1 Composition theorem ε = ε1 + ε2 Original data ε2 Partition noise generate Co-clustering matrix Noisy co-clustering matrix Synthetic data 11

DPCocGen Differentially private Co-clustering noise co-clustering Transform Full-dim distribution Noisy distribution Co-clustering matrix ε1 Composition theorem ε = ε1 + ε2 Original data ε2 Partition noise generate Co-clustering matrix Noisy co-clustering matrix Synthetic data 11

DPCocGen Differentially private Co-clustering noise co-clustering Transform Full-dim distribution Noisy distribution Co-clustering matrix ε1 Composition theorem ε = ε1 + ε2 Original data ε2 Partition noise generate Co-clustering matrix Noisy co-clustering matrix Synthetic data 11

DPCocGen Differentially private Co-clustering noise co-clustering Transform Full-dim distribution Noisy distribution Co-clustering matrix ε1 Composition theorem ε = ε1 + ε2 Original data ε2 Partition noise generate Co-clustering matrix Noisy co-clustering matrix Synthetic data 11

Evaluation of DPCocGen

Evaluation Criteria 1. Joint distribution preservation To observe 12

Evaluation Criteria 1. Joint distribution preservation 2. Relative error for random range queries To observe 12

Evaluation Criteria 1. Joint distribution preservation 2. Relative error for random range queries 3. Performance in classification with a classifier that learns from synthetic data To observe 12

Evaluation Criteria 1. Joint distribution preservation 2. Relative error for random range queries 3. Performance in classification with a classifier that learns from synthetic data To observe 1. Impact of the privacy budget ε 12

Evaluation Criteria 1. Joint distribution preservation 2. Relative error for random range queries 3. Performance in classification with a classifier that learns from synthetic data To observe 1. Impact of the privacy budget ε 2. Impact of the aggregation level (number of cells) 12

Evaluation Criteria 1. Joint distribution preservation 2. Relative error for random range queries 3. Performance in classification with a classifier that learns from synthetic data To observe 1. Impact of the privacy budget ε 2. Impact of the aggregation level (number of cells) 3. Comparison with the base line algorithm and PrivBayes 12

Adult dataset Adult - The dataset 2 contains 48,842 instances and has 14 different attributes. The characteristics of the attributes are both numeric and nominal - The attributes {age, workclass, education, relationship, sex} are retained - We discretize continuous attributes into data-independent equi-width partitions 2 UC Irvine Machine Learning Repository 13

Experiment: Multivariate distribution preservation Hellinger distance The Hellinger distance between two discrete probability distributions P = (p 1,..., p k ) and Q = (q 1,..., q k ) is given by : D Hellinger (P, Q) = 1 2 k i=1 ( p i q i ) 2 Experiment - Compute the multivariate distribution vector P of the original dataset 14

Experiment: Multivariate distribution preservation Hellinger distance The Hellinger distance between two discrete probability distributions P = (p 1,..., p k ) and Q = (q 1,..., q k ) is given by : D Hellinger (P, Q) = 1 2 k i=1 ( p i q i ) 2 Experiment - Compute the multivariate distribution vector P of the original dataset - Compute the multivariate distribution vector Q of the synthetic data generated using DPCocGen 14

Experiment: Multivariate distribution preservation Hellinger distance The Hellinger distance between two discrete probability distributions P = (p 1,..., p k ) and Q = (q 1,..., q k ) is given by : D Hellinger (P, Q) = 1 2 k i=1 ( p i q i ) 2 Experiment - Compute the multivariate distribution vector P of the original dataset - Compute the multivariate distribution vector Q of the synthetic data generated using DPCocGen - Compute the multivariate distribution vector Q of the synthetic data generated using Base line 14

Experiment: Multivariate distribution preservation Hellinger distance The Hellinger distance between two discrete probability distributions P = (p 1,..., p k ) and Q = (q 1,..., q k ) is given by : D Hellinger (P, Q) = 1 2 k i=1 ( p i q i ) 2 Experiment - Compute the multivariate distribution vector P of the original dataset - Compute the multivariate distribution vector Q of the synthetic data generated using DPCocGen - Compute the multivariate distribution vector Q of the synthetic data generated using Base line - Compute D Hellinger (P, Q) and D Hellinger (P, Q ) 14

Results: Multivariate distribution preservation Variation of the Hellinger distance for different DP strategies, ɛ = 0. 1 0.60 Variation of the Hellinger distance for different DP strategies, ɛ = 0. 5 0.36 0.55 0.34 Hellinger distance 0.50 0.45 0.40 Hellinger distance 0.32 0.30 0.28 0.35 0.26 0.30 Base Line 144 288 360 480 720 960 1200 Number of cells ε = 0.1 0.24 Base Line 144 288 360 480 720 960 1200 Number of cells ε = 0.5 50 datasets are generated for each configuration 15

Experiment: Random range queries Experiment - Generate 100 random queries - Compute all the queries and report their average error - Iterate over 15 runs 16

Results: Random range queries 45 40 35 Base line DPCocGen PrivBayes 30 Relative error (%) 25 20 15 10 5 0 0.01 0.02 0.05 0.10 0.20 Epsilon 0.50 17

Experiment: Classification performances Experiment Randomly divide the original dataset into 2 sets : - Training set: contains 80% of the data - Test set: contains 20% of the data 18

Experiment: Classification performances Experiment Randomly divide the original dataset into 2 sets : - Training set: contains 80% of the data - Test set: contains 20% of the data Generate synthetic data using DPCocGen, Base line and PrivBayes on the Training set 18

Experiment: Classification performances Experiment Randomly divide the original dataset into 2 sets : - Training set: contains 80% of the data - Test set: contains 20% of the data Generate synthetic data using DPCocGen, Base line and PrivBayes on the Training set Learn a naive Bayes classifier from the synthetic data to predict the value of the attribute Sex 18

Experiment: Classification performances Experiment Randomly divide the original dataset into 2 sets : - Training set: contains 80% of the data - Test set: contains 20% of the data Generate synthetic data using DPCocGen, Base line and PrivBayes on the Training set Learn a naive Bayes classifier from the synthetic data to predict the value of the attribute Sex Measure classification performances of the trained models on the Test set 18

Classification : predict Sex 0.9 0.8 AUC 0.7 0.6 0.5 0.01 0.02 0.05 0.10 0.20 Epsilon Base line DPCocGen Original Data PrivBayes Figure 1: Average AUC, across 15 runs 19 0.50

Conclusion Advantages 1. Parameter-free 2. Preserves utility Limits 1. Limited dimension 2. Requires a discretization step Perspectives 1. Using differentially private dimension reduction strategies, to tackle the dimension limitation 20

Thank you! Cynthia Dwork. Differential privacy. In Michele Bugliesi, Bart Preneel, Vladimiro Sassone, and Ingo Wegener, editors, Automata, Languages and Programming, volume 4052 of Lecture Notes in Computer Science, pages 1 12. Springer Berlin Heidelberg, 2006. Noman Mohammed, Rui Chen, Benjamin Fung, and Philip S Yu. Differentially private data release for data mining. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 493 501. ACM, 2011. Harichandan Roy, Murat Kantarcioglu, and Latanya Sweeney. Practical differentially private modeling of human movement data. In IFIP Annual Conference on Data and Applications Security and Privacy, pages 170 178. Springer, 2016. Yonghui Xiao, Li Xiong, Liyue Fan, and Slawomir Goryczka. Dpcube: differentially private histogram release through multidimensional partitioning. arxiv preprint arxiv:1202.5358, 2012. Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Privbayes: Private data release via bayesian networks. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1423 1434. ACM, 2014. 21