CS573 Data Privacy and Security. Differential Privacy tabular data and range queries. Li Xiong

Size: px

Start display at page:

Download "CS573 Data Privacy and Security. Differential Privacy tabular data and range queries. Li Xiong"

Cory Wiggins
5 years ago
Views:

1 CS573 Data Privacy and Security Differential Privacy tabular data and range queries Li Xiong

2 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Algorithms for high dimensional data

3 Example: cohort discovery from medical records Histograms Cohort discovery: range queries Select COUNT(*) from D Where A1 in I1 and A2 in I2 and and Am in Im.

Example: statistical agencies: data publishing A marginal over attributes A

aka cube, contingency table E.g. 2-way marginal on EmploymentStatus and Gender U.

over different combinations of available attributes Hundreds of marginals

4 Example: statistical agencies: data publishing A marginal over attributes A 1,, A k reports count for each combination of attribute values. aka cube, contingency table E.g. 2-way marginal on EmploymentStatus and Gender U.S. Census Bureau statistics can typically be derived from k-way marginal over different combinations of available attributes Hundreds of marginals released Module 3 Tutorial: Differential Privacy in the Wild 4

Example: range queries over spatial data Input: sensitive data D Input: range query workload W Shown is workload of 3 range queries BeijingTaxi dataset[1]: 4,268,780 records of (lat,lon) pairs of

5 Example: range queries over spatial data Input: sensitive data D Input: range query workload W Shown is workload of 3 range queries BeijingTaxi dataset[1]: 4,268,780 records of (lat,lon) pairs of taxi pickup locations in Beijing, China in 1 month. Scatter plot of input data Task: compute answers to workload W over private input D Module 3 [1]Raw data from: Taxi trajectory open Tutorial: Differential Privacy in the Wild dataset, Tsinghua university, China

6 Problem variant: offline vs. online Offline (batch): Entire W given as input, answers computed in batch Online (adaptive): W is sequence q 1, q 2, that arrives online Adaptive: analyst s choice for q i can depend on answers a 1,, a i 1 Module 3 Tutorial: Differential Privacy in the Wild 6

7 Important aspects of problem: Data and query complexity Data complexity Dimensionality: number of attributes Domain size: number of distinct attribute combinations Many techniques specialized for low dimensional data Query complexity Given query workload vs. no query workload Classes of queries: histograms, count queries, linear queries (sum, average), median Module 3 Tutorial: Differential Privacy in the Wild 7

8 Solution variants: query answers vs. synthetic data Two high-level approaches to solving problem 1. Direct: Output of the algorithm is list of query answers 2. Synthetic data: Algorithm constructs a synthetic dataset D, which can be queried directly by analyst Analyst can pose additional queries on D (though answers may not be accurate) Module 3 Tutorial: Differential Privacy in the Wild 8

with differential privacy Parametric and semi-parametric

9 Synthetic Data: Categories of Methods Nonparametric methods release empirical distributions, i.e. histograms with differential privacy Parametric and semi-parametric methods learn parameters of a distribution with differential privacy

10 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, Transformation: Wavelet, Fourier Transform, An evaluation framework: DPBench Algorithms for high dimensional data

11 Baseline algorithm: IDENTITY Scatter plot of input data 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) unit histogram 3. Use noisy counts to either 1. Answer queries directly (assume distribution is uniform within cell) 2. Generate synthetic data (derive distribution from counts and sample) Module 3 Tutorial: Differential Privacy in the Wild

12 Baseline algorithm: IDENTITY Scatter plot of input data Limitations Granularity of discretization Coarse: detail lost Fine: noise overwhelms signal Noise accumulates: squared error grows linearly with range 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) unit histogram 3. Use noisy counts to either 1. Answer queries directly (assume distribution is uniform within cell) 2. Generate synthetic data (derive distribution from counts and sample) Module 3 Tutorial: Differential Privacy in the Wild

13 [HMMCZ16] Empirical benchmarks An evaluation framework for standardized evaluation of privacy algorithms for range queries (over 1 and 2D) Demo:

14 Data-Dependent Partitioning Domain-based (data-independent) partitioning does not work very well Equi-width: equal bucket range Uniformity assumption Data-driven partitioning V-optimal: with the least frequency variance Intuition: highest uniformity within each bucket How to do it with differential privacy? October 2,

15 Histograms (review) Divide data into buckets and store average (sum) for each bucket Partitioning rules: Equi-width: equal bucket range Equi-depth: equal frequency V-optimal: with the least frequency variance October 2,

16 An Early Attempt: DPCube [SDM 2010, ICDE 2012 demo] Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y ε/2-dp Original Records 1. Compute unit histogram with differential privacy 2. kd-tree partitioning 3. Compute merged bin counts with differential privacy ε/2-dp DP Interface DP unit Histogram DP V-optimal Histogram Multi-dimensional partitioning

17 kd-tree based partitioning Choose dimension and splitting point to split (minimize variance) Repeat until: Count of this node less than threshold Variance or entropy of this node less than threshold

18 DPCube [SecureDM 2010, ICDE 2012 demo] Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y ε/2-dp Original Records Limitations: DP unit histogram very noisy Affects the accuracy of partitioning Sequential composition ε/2-dp DP unit Histogram Multi-dimensional partitioning DP V-optimal Histogram DP Interface

19 A Later Improvement: Private Spatial decompositions [CPSSY 12] quadtree kd-tree Approach: (top down) partitioning with differential privacy Quad tree and hybrid/kd-tree

20 Building a Private kd-tree Process to build a private kd-tree Input: maximum height h, minimum leaf size L, data set Choose dimension to split Get (private) median in this dimension Create child nodes and add noise to the counts Recurse until: Max height is reached Noisy count of this node less than L Budget along the root-leaf path has used up The entire PSD satisfies DP by the composition property 21

21 Building a Private kd-tree Process to build a private kd-tree Input: maximum height h, minimum leaf size L, data set Choose dimension to split Get (private) median in this dimension exponential mechanism with utility function(x) = rank(x) rank(median) Create child nodes and add noise to the counts Recurse until: Max height is reached Noisy count of this node less than L Budget along the root-leaf path has used up The entire PSD satisfies DP by the composition property 22

22 Building Private Spatial Decompositions privacy budget allocation Budget is split between medians and counts at each node Tradeoff accuracy of division with accuracy of counts Budget is split across levels of the tree Privacy budget used along any root-leaf path should total Optimal budget allocation Post processing with consistency check Sequential composition Parallel composition 23

23 Data-dependent partitioning Heuristics based methods Kd-tree, quad-tree Optimal methods V-optimal histogram (1D or 2D) Module 3 Tutorial: Differential Privacy in the Wild 24

24 Data-aware/Workload-Aware Mechanism [LHMY14] Step 1: dynamic programming based methods for optimal partitioning Step 2: matrix mechanism for optimal noise given a query workload

25 Data Transformations Can think of trees as a data-dependent transform of input Can apply other data transformations General idea: Apply transform of data Add noise in the transformed space (based on sensitivity) Publish noisy coefficients, or invert transform (post-processing) Goal: pick a transform that preserves good properties of data And which has low sensitivity, so noise does not corrupt Original Data Transform Noise Noisy Coefficients Coefficients Invert Private Data 26

26 [HMMCZ16] Empirical benchmarks An evaluation framework for standardized evaluation of privacy algorithms for range queries (over 1 and 2D) Demo: Key findings: Scale/size and shape of data significantly affect algorithm error In a high signal regime (high scale, high epsilon), simpler data independent methods such as IDENTITY works well In a low signal regime (low scale, low epsilon), datadependent algorithm should be considered but no guarantees While no algorithm universally dominates across settings, DAWA is a competitive choice on most datasets

27 Programming Assignment and Competition: Laplace mechanism for Range queries Required: Implement the baseline IDENTITY histogram algorithm Evaluate accuracy for random set of range queries Optional: Optimizations and enhancement Competition

28 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, An evaluation framework: DPBench Algorithms for high dimensional data Copula functions [LXJ14] Bayesian networks [ZCPSX14]

PrivacyOnTheMap Non-parametric methods Original data Synthetic data

29 Traditional Approaches Parametric methods Fit the data to a distribution, make inferences about parameters e.g. PrivacyOnTheMap Non-parametric methods Original data Synthetic data Perturbation Histogram Learn empirical distribution through histograms e.g. PSD, Privelet, FP, P-HP

30 Semi-parametric modeling using Copula Semi-parametric methods functions Haoran Li, Li Xiong, Xiaoqian Jiang. Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions, EDBT 2014

31 Gaussian copula: models the dependence with arbitrary margins Gaussian distribution: models the joint distribution

DP marginal histograms Original data set Age Hours /week Income 42 64 30K 31 82 60K 28 40 20K 43 36 80K MLE Age

32 DP marginal histograms Original data set Age Hours /week Income K K K K MLE Age Hours/week Income Step 1: Computing DP marginal Histograms 1 ~ P DP correlation matrix 1 DP dependence structure DP synthetic data set Age Hours /week Income K K K K Step 3: Sampling DP synthetic data Step 2: Computing DP correlation matrix through DP MLE (Maximum Likelihood Estimation

33 Overview Age Hours /week Income Gender K F K M K F K M Gender = F Gender = M Age Hours /week Income Age Hours /week Income K K K K DPCopula DPCopula n~ n Lap(1/ ) Age Hours /week Income K K ~n K K 2 Age Hours /week Income n~ n Lap(1/ ) ~n

34 Datasets US Census data: 4 attributes, 100,000 records Brazil data: 8 attributes, 188,846 records Synthetic data Comparison: PSD, Privelet+, FP, P-HP Metrics: Random range-count queries with random query predicates covering all attributes Relative error: Absolute error:

35 Query accuracy vs. differential privacy budget

36 Gaussian dependence assumption Pair-wise attribute correlation does not scale with high dimensions Works well for continuous data or attributes with large domains

37 Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, An evaluation framework: DPBench Algorithms for high dimensional data Copula functions [LXJ14] Bayesian networks [ZCPSX14]

38 convert + noise sample sensitive database D full-dim tuple distribution noisy distribution synthetic database D approximate convert + noise sample a set of low-dim distributions noisy low-dim distributions

39 Bayesian network example P ( B ) P ( E ) t f t f Burglary Earthquake Alarm P ( A B, E ) B E t f t t t f f t f f P ( J A) A t f t f JohnCalls MaryCalls P ( M A) A t f t f

40 A 5-dimensional database: Pr age age Pr work age workclass education Pr edu age title Pr title work income Pr income work

41 A 5-dimensional database: age workclass income education title Pr Pr age Pr work age Pr edu age Pr title work Pr income work

42 STEP 1: Choose a suitable Bayesian network N must in a differentially private way STEP 2: Compute conditional distributions implied by N straightforward to do under differential privacy inject noise Laplace mechanism STEP 3: Generate synthetic data by sampling from N post-processing: no privacy issues

43 Finding optimal 1-degree Bayesian network was solved in [Chow-Liu 68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information I of its edges finding the maximum spanning tree, where the weight of edge (X, Y) is mutual information I(X, Y).

45 Build a 1-degree BN for database A B C D Alan Bob Cykie David Eric Frank George Helen Ivan Jack

46 Start from a random attribute A A C B D

Select next tree edge by its mutual information A 0.5 0.5 B 0.5 0.2 A B C D Alan 0 0 C 0 0 0.

47 Select next tree edge by its mutual information A B A B C D Alan 0 0 C Bob Cykie David Eric Frank George Helen D Ivan Jack candidates: A B A C A D

48 Select next tree edge by its mutual information A I = 1 I = 0. 4 I = 0 C candidates: A B A C A D B D

49 Select next tree edge by its mutual information A C B D

50 Select next tree edge by its mutual information A B I = 0 I = 0. 4 I = 0. 2 I = 0 C D candidates: A C A D B C B D

51 Select next tree edge by its mutual information A C DONE! B D

52 Do it under Differential Privacy! (Non-private) select the edge with maximum I (Private) I is data-sensitive -> the best edge is also data-sensitive

53 Databases D Edges e define q D, e R How good edge e is as the result of selection, given database D Return e with probability: Pr[e] exp ε 2 q D, e Δ q info noise where Δ q = max D,D,e q D, e q(d, e) 1

54 STEP 1: Choose a suitable Bayesian network N must in a differentially private way STEP 2: Compute conditional distributions implied by N straightforward to do under differential privacy inject noise Laplace mechanism STEP 3: Generate synthetic data by sampling from N post-processing: no privacy issues

55 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, Transformation: Wavelet, Fourier Transform, Algorithms for high dimensional data Copula functions Bayesian networks

56 Open questions High dimensional data Robust and private algorithm selection Error bounds for data-dependent algorithms Module 3 Tutorial: Differential Privacy in the Wild 70

57 References [ACC12] Aćs et al. Differentially private histogram publishing through lossy compression. In ICDM, [BBDS12] Blocki et al. The johnson-lindenstrauss transform itself preserves differential privacy. In FOCS, [BCDKMT07] Barak et al. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In PODS, [BLR08] Blum et al. A learning theory approach to noninteractive database privacy. In STOC, [DNRR15] Dwork et al. Pure Differential Privacy for Rectangle Queries via Private Partitions. In ASIACRYPT, [CPSSY12] Cormode et al. Differentially Private Spatial Decompositions. In ICDE, [GAHRW14] Gaboardi et al. Dual Query: Practical Private Query Release for High Dimensional Data. In ICML, [HLM12] Hardt et al. A simple and practical algorithm for differentially private data release. In NIPS, [HMMCZ16] Hay et al. Principled Evaluation of Differentially Private Algorithms using DPBench. In SIGMOD, [HRMS10] Hay et al. Boosting the accuracy of differentially private histograms through consistency. In PVLDB, [LHMY14] Li et al. A data- and workload-aware algorithm for range queries under differential privacy. In PVLDB, [LHRMM10] Li et al. Optimizing linear counting queries under differential privacy. In PODS, [LM12] Li et al. An adaptive mechanism for accurate query answering under differential privacy. In PVLDB, [LM13] Li et al. Optimal error of query sets under the differentially-private matrix mechanism. In ICDT, [LZWY11] Li et al. Compressive mechanism: utilizing sparse representation in differential privacy. In WPES, [QYL13] Qardaji et al. Understanding hierarchical methods for differentially private histograms. In PVLDB, [QYL13] Qardaji et al. Differentially private grids for geospatial data. In ICDE, [RN10] Rastogi et al. Differentially private aggregation of distributed time-series with transformation and encryption. In SIGMOD, [WWLTRD09] Wang et al. Privacy-preserving genomic computation through program specialization. In CCS, [XWG10] Xiao et al. Differential privacy via wavelet transforms. In ICDE, [ZCPSX14] Zhang et al. PrivBayes: private data release via bayesian networks. In SIGMOD, [ZXX16] Zhang et al. PrivTree: A Differentially Private Algorithm for Hierarchical Decompositions. In SIGMOD, Module 3 Tutorial: Differential Privacy in the Wild 71

Matrix Mechanism and Data Dependent algorithms

Matrix Mechanism and Data Dependent algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 9 : 590.03 Fall 16 1 Recap: Constrained Inference Lecture 9 : 590.03 Fall 16 2 Constrained Inference