Synthetic Data. Michael Lin

Similar documents
Comparative Evaluation of Synthetic Dataset Generation Methods

Missing Data and Imputation

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland

Security Control Methods for Statistical Database

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

Privacy in Statistical Databases

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Michelle Hayes Mary Joel Holin. Michael Roanhouse Julie Hovden. Special Thanks To. Disclaimer

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

The Two Dimensions of Data Privacy Measures

Multiple imputation using chained equations: Issues and guidance for practice

RESAMPLING METHODS. Chapter 05

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

Handling missing data for indicators, Susanne Rässler 1

Cryptography & Data Privacy Research in the NSRC

A Solidify Understanding Task

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Definition. Quantifying Anonymity. Anonymous Communication. How can we calculate how anonymous we are? Who you are from the communicating party

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Section 4 Matching Estimator

The Importance of Modeling the Sampling Design in Multiple. Imputation for Missing Data

Secure Multiparty Computation

Missing Data Analysis for the Employee Dataset

Data Protection and Information Security. Presented by Emma Hawksworth Slater and Gordon

Privacy and Security Aspects Related to the Use of Big Data Progress of work in the ESS. Pascal Jacques Eurostat Local Security Officer 1

Data Anonymization. Graham Cormode.

Missing data analysis. University College London, 2015

CS6501: Great Works in Computer Science

Cross-validation and the Bootstrap

CS573 Data Privacy and Security. Differential Privacy. Li Xiong

An imputation approach for analyzing mixed-mode surveys

CS475 Network and Information Security

Privacy Challenges in Big Data and Industry 4.0

A STOCHASTIC METHOD FOR ESTIMATING IMPUTATION ACCURACY

Handling Data with Three Types of Missing Values:

Multiple-imputation analysis using Stata s mi command

CS573 Data Privacy and Security. Cryptographic Primitives and Secure Multiparty Computation. Li Xiong

CS682 Advanced Security Topics

Privacy, Security & Ethical Issues

Protecting the Privacy with Human-Readable Pseudonyms: One-Way Pseudonym Calculation on Base of Primitive Roots

K ANONYMITY. Xiaoyong Zhou

Privacy Preserving Service Discovery for Interoperability in Power to the Edge Approach Research and Development Initiative, Chuo University

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

Missing Data Techniques

Research Use of Restricted Data: The HRS Experience. The Health and Retirement Study The University of Michigan

Microdata Publishing with Algorithmic Privacy Guarantees

Panel Data 4: Fixed Effects vs Random Effects Models

Lab 3: Sampling Distributions

STA 4273H: Statistical Machine Learning

Hardness of Approximation for the TSP. Michael Lampis LAMSADE Université Paris Dauphine

CS152: Programming Languages. Lecture 11 STLC Extensions and Related Topics. Dan Grossman Spring 2011

Learning from Data: Adaptive Basis Functions

Last time. Reasoning about programs. Coming up. Project Final Presentations. This Thursday, Nov 30: 4 th in-class exercise

Reasoning about programs

Discrete Mathematics and Probability Theory Summer 2016 Dinh, Psomas, and Ye HW 2

0x1A Great Papers in Computer Security

How Do Tor Users Interact With Onion Services?

NON-CENTRALIZED DISTINCT L-DIVERSITY

Pseudonymization risk analysis in distributed systems

SOCIAL NETWORKING IN TODAY S BUSINESS WORLD

6. 5 Symmetries of Quadrilaterals

Machine Learning on Encrypted Data

Secure Multiparty Computation

Grade 6 Math Circles November 6 & Relations, Functions, and Morphisms

Differential Privacy. Seminar: Robust Data Mining Techniques. Thomas Edlich. July 16, 2017

CSC 411 Lecture 4: Ensembles I

WHAT TYPE OF NEURAL NETWORK IS IDEAL FOR PREDICTIONS OF SOLAR FLARES?

The Bootstrap and Jackknife

Crowd-Blending Privacy

Opening Windows into the Black Box

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

We will show that the height of a RB tree on n vertices is approximately 2*log n. In class I presented a simple structural proof of this claim:

Introduction to Geophysical Inversion

Privacy Policy. I. How your information is used. Registration and account information. March 3,

Cross-validation and the Bootstrap

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Secure Multi-Party Computation. Lecture 13

Introduction to Assurance

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Algorithms and Game Theory Date: 12/3/15

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

Secure Development Processes

A Mathematical Proof. Zero Knowledge Protocols. Interactive Proof System. Other Kinds of Proofs. When referring to a proof in logic we usually mean:

Zero Knowledge Protocols. c Eli Biham - May 3, Zero Knowledge Protocols (16)

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Review Paper onbuilding Prediction based model for cloud-based data mining

SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER

Cleanup and Statistical Analysis of Sets of National Files

Randomized Response Technique in Data Mining

Formal Methods for Assuring Security of Computer Networks

Robbing the Bank with a Theorem Prover

Houghton Mifflin MATHEMATICS Level 5 correlated to NCTM Standard

Privacy Preserving Machine Learning: A Theoretically Sound App

Introduction to Prof. Clarkson Fall Today s music: Prelude from Final Fantasy VII by Nobuo Uematsu (remastered by Sean Schafianski)

WEB SITE PRIVACY POLICY

Missing Data Missing Data Methods in ML Multiple Imputation

Overview of Information Security

CS 153 Design of Operating Systems Winter 2016

FMC: An Approach for Privacy Preserving OLAP

Transcription:

Synthetic Data Michael Lin 1

Overview The data privacy problem Imputation Synthetic data Analysis 2

Data Privacy As a data provider, how can we release data containing private information without disclosing this private information? For some values of private and disclosure Many many approaches. Could teach an entire course about it! Removal of data, k-anonymity, synthetic data... 3

Synthetic Data Overview The basic idea is simple: Analyze the data to determine its statistical properties Create a data set based on this knowledge Release the new data set Does this satisfy data privacy requirements? Is this useful? 4

Synthetic Data Overview Is it even possible to create a data set that preserves the statistical properties of the original? How do we do it in general? 5

Imputation Imputation - a statistical method for filling in missing data values Multiple imputation - impute data m times and release all m data sets S R A S R A A M - 20 A M W 20 B F - 21 Impute B F B 21 C F B 26 C F B 26 D - W - D F W 22 BACK 6

Multiple Imputation With Large Sample Sizes The original formulation of multiple imputation (Rubin 1987) Y obs Y mis is the observed data is the data missing due to non-response The distribution is described by: D = (X, Y obs, I, R) Y mis (Y mis D) based on posterior predictive distribution of 7

Multiple Imputation With Large Sample Sizes I - a vector that indicates whether a given individual is selected to be surveyed R - a vector that indicates whether a given individual responded to the survey Design Variables Variables/ Predictors Y X1 X2 X3 Education Sex Race Age We assume X is missing no data 8

Multiple Imputation With Large Sample Sizes Data provider repeats process from previous slides m times and releases m complete data sets Each complete data set can be analyzed with regular statistics and software After all m have been analyzed for some variable Q (ie. population mean), 3 equations give the estimated value of Q and the variance of the estimate 9

Multiple Imputation With Large Sample Sizes Q m = B m = Ū m = m l=1 Q (l) /m Sample Mean m (Q (l) Q m ) 2 /(m 1) l=1 m l=1 Variance Across Samples U (l) /m. Sample Variance Q m estimates Q, T m = (1 + 1/m)B m + Ūm estimates the variance of Q given this data set (a t- distribution) As m increases, these estimates improve BACK 10

Multiple Imputation and Data Privacy What does imputation have to do with data privacy? Traditional imputing is a method for using available data to fill in missing data What if all the responses are missing? 11

Creating Fully Synthetic Data Previously, we imputed values only for the sample. Now impute values for the population not in the sample. This produces the l-th complete data set (X, Y (l) com) Population (Data Unknown) Sample Data Set Impute Population Complete Data Set Sample Data Set 12

Creating Fully Synthetic Data Randomly sample from (X, Y (l) produce synthetic data set d n syn com) d (l) = (X, Y (l) syn) times to There is still a small possibility of sampling real data (can eliminate this possibility) Repeat m times and release these data sets Population Complete Data Set Sample Data Set Sample d (l) 13

Analyzing Fully Synthetic Data Calculate Q, B, and U as for normal multiple imputation However, calculate the variance for Q as T f = (1+1/m)B m Ūm T m = (1 + 1/m)B m + Ūm, compared to for normal imputation Intuitively, the first term estimates the variance of Q, and the second term estimates the variance due to the random sampling of (X, Y (l) com) 14

Partially Synthetic Data The same process as normal multiple imputation, except we replace data instead of filling it in S R A S R A A M B 20 A M W 20 B F W 21 Impute B F B 21 C F B 26 C F B 24 D F W 23 D F W 22 15

Partially Synthetic Data Replacing instead of filling in changes the analysis Use the same 3 equations, but now we measure variance with: T p = B m /m+ūm Note that it s trivial to identify which variables are synthetic in partially synthetic data 16

Analysis As always, we want to measure two things: How useful is this data? How well is confidentiality preserved? What trade-offs do we make here? 17

Confidentiality Identifying a person based on fully synthetic data is claimed to be pretty much impossible It is easier (but still difficult) to identify the real variables that the synthetic data is based on Both these claims are based on the security of using modeled data rather than actual data What if the model is too good? 18

Confidentiality Risks Variables imputed from distributions with small variances could be identified from synthetic data If the statistical models used for imputation are too accurate, real data can be leaked Bootstrapping can leak real data Bootstrapping - statistical resampling method that re-uses real data 19

Confidentiality Risks These risks can be controlled: Use less precise distributions when imputing This hurts the utility of the synthetic data Don t bootstrap 20

Utility The utility of synthetic data is based almost entirely on how good the distribution models of the original data are If the models are perfect, synthetic data will preserve all correlations and statistical measurements present in the original Since perfect models are impossible, very good ones will have to do 21

Utility What are the downsides of synthetic data? If an analyst wants to analyze a tenuous or obscure relationship in the original data, the synthetic modeling may not capture it Fundamentally: it s impossible to analyze anything that isn t modeled 22

Paper Example Generally, the synthetic data is very good for most variables, and awful for others The bad variables tend to measure relationships not captured in the models Does not discuss real or potential reidentification disclosure Predictive disclosure example is rather soft 23

Comments Where s the proof that synthetic data makes the risk of reidentification practically non-existant? Risk of reidentification is highly dependent on the models used, so this probably can t be proved in general, but at least some mathematical logic is needed No mathematical justification or proof given 24