Synthetic Data Michael Lin 1
Overview The data privacy problem Imputation Synthetic data Analysis 2
Data Privacy As a data provider, how can we release data containing private information without disclosing this private information? For some values of private and disclosure Many many approaches. Could teach an entire course about it! Removal of data, k-anonymity, synthetic data... 3
Synthetic Data Overview The basic idea is simple: Analyze the data to determine its statistical properties Create a data set based on this knowledge Release the new data set Does this satisfy data privacy requirements? Is this useful? 4
Synthetic Data Overview Is it even possible to create a data set that preserves the statistical properties of the original? How do we do it in general? 5
Imputation Imputation - a statistical method for filling in missing data values Multiple imputation - impute data m times and release all m data sets S R A S R A A M - 20 A M W 20 B F - 21 Impute B F B 21 C F B 26 C F B 26 D - W - D F W 22 BACK 6
Multiple Imputation With Large Sample Sizes The original formulation of multiple imputation (Rubin 1987) Y obs Y mis is the observed data is the data missing due to non-response The distribution is described by: D = (X, Y obs, I, R) Y mis (Y mis D) based on posterior predictive distribution of 7
Multiple Imputation With Large Sample Sizes I - a vector that indicates whether a given individual is selected to be surveyed R - a vector that indicates whether a given individual responded to the survey Design Variables Variables/ Predictors Y X1 X2 X3 Education Sex Race Age We assume X is missing no data 8
Multiple Imputation With Large Sample Sizes Data provider repeats process from previous slides m times and releases m complete data sets Each complete data set can be analyzed with regular statistics and software After all m have been analyzed for some variable Q (ie. population mean), 3 equations give the estimated value of Q and the variance of the estimate 9
Multiple Imputation With Large Sample Sizes Q m = B m = Ū m = m l=1 Q (l) /m Sample Mean m (Q (l) Q m ) 2 /(m 1) l=1 m l=1 Variance Across Samples U (l) /m. Sample Variance Q m estimates Q, T m = (1 + 1/m)B m + Ūm estimates the variance of Q given this data set (a t- distribution) As m increases, these estimates improve BACK 10
Multiple Imputation and Data Privacy What does imputation have to do with data privacy? Traditional imputing is a method for using available data to fill in missing data What if all the responses are missing? 11
Creating Fully Synthetic Data Previously, we imputed values only for the sample. Now impute values for the population not in the sample. This produces the l-th complete data set (X, Y (l) com) Population (Data Unknown) Sample Data Set Impute Population Complete Data Set Sample Data Set 12
Creating Fully Synthetic Data Randomly sample from (X, Y (l) produce synthetic data set d n syn com) d (l) = (X, Y (l) syn) times to There is still a small possibility of sampling real data (can eliminate this possibility) Repeat m times and release these data sets Population Complete Data Set Sample Data Set Sample d (l) 13
Analyzing Fully Synthetic Data Calculate Q, B, and U as for normal multiple imputation However, calculate the variance for Q as T f = (1+1/m)B m Ūm T m = (1 + 1/m)B m + Ūm, compared to for normal imputation Intuitively, the first term estimates the variance of Q, and the second term estimates the variance due to the random sampling of (X, Y (l) com) 14
Partially Synthetic Data The same process as normal multiple imputation, except we replace data instead of filling it in S R A S R A A M B 20 A M W 20 B F W 21 Impute B F B 21 C F B 26 C F B 24 D F W 23 D F W 22 15
Partially Synthetic Data Replacing instead of filling in changes the analysis Use the same 3 equations, but now we measure variance with: T p = B m /m+ūm Note that it s trivial to identify which variables are synthetic in partially synthetic data 16
Analysis As always, we want to measure two things: How useful is this data? How well is confidentiality preserved? What trade-offs do we make here? 17
Confidentiality Identifying a person based on fully synthetic data is claimed to be pretty much impossible It is easier (but still difficult) to identify the real variables that the synthetic data is based on Both these claims are based on the security of using modeled data rather than actual data What if the model is too good? 18
Confidentiality Risks Variables imputed from distributions with small variances could be identified from synthetic data If the statistical models used for imputation are too accurate, real data can be leaked Bootstrapping can leak real data Bootstrapping - statistical resampling method that re-uses real data 19
Confidentiality Risks These risks can be controlled: Use less precise distributions when imputing This hurts the utility of the synthetic data Don t bootstrap 20
Utility The utility of synthetic data is based almost entirely on how good the distribution models of the original data are If the models are perfect, synthetic data will preserve all correlations and statistical measurements present in the original Since perfect models are impossible, very good ones will have to do 21
Utility What are the downsides of synthetic data? If an analyst wants to analyze a tenuous or obscure relationship in the original data, the synthetic modeling may not capture it Fundamentally: it s impossible to analyze anything that isn t modeled 22
Paper Example Generally, the synthetic data is very good for most variables, and awful for others The bad variables tend to measure relationships not captured in the models Does not discuss real or potential reidentification disclosure Predictive disclosure example is rather soft 23
Comments Where s the proof that synthetic data makes the risk of reidentification practically non-existant? Risk of reidentification is highly dependent on the models used, so this probably can t be proved in general, but at least some mathematical logic is needed No mathematical justification or proof given 24