Differential Privacy. Seminar: Robust Data Mining Techniques. Thomas Edlich. July 16, PDF Free Download

Differential Privacy Seminar: Robust Techniques Thomas Edlich Technische Universität München Department of Informatics kdd.in.tum.de July 16, 2017

Outline 1. Introduction 2. Definition and Features of Differential Privacy 3. Techniques 4. Practical Issues and Limitations 5. Differential Privacy in Machine Learning & Differential Privacy 2

Introduction

Privacy Protection Anonymization Removal of identifying attributes such as names or social security number. Often considered enough to protect privacy. Differential Privacy 4

Privacy Protection Anonymization Removal of identifying attributes such as names or social security number. Often considered enough to protect privacy. However: Netflix Prize Dataset [9] Movie A B C 1 5 3 1 2 1 5 3 3 3 1 5 Netflix Ratings Linkage Attack Movie A B C 1 9 5 2 2 1 10 6 3 4 2 9 IMDb Ratings Reidentification of medical records using publicly available voting records [12] Differential Privacy 4

Privacy Protection What is privacy protection? Nothing about an individual should be learnable from the database that cannot be learned without access to the database. [2] Differential Privacy 5

Privacy Protection What is privacy protection? Nothing about an individual should be learnable from the database that cannot be learned without access to the database. [2] Proven to be impossible if the privacy mechanism is useful. Reason: Auxiliary information. [3] Differential Privacy 5

Definition and Features of Differential Privacy

Differential Privacy Differential Privacy A mathematical definition of privacy which bounds the privacy risk for an participant in a database. Makes it possible to learn properties of a population while protecting the privacy of individuals. Differential Privacy 7

Formalizing Differential Privacy [11] ɛ-differential Privacy An algorithm A priv with A priv (D) T provides ɛ-differential privacy if Pr[A priv (D) S] e ɛ Pr[A priv (D ) S] (1) for all S T and all datasets D, D differing in only a single entry. (ɛ, δ)-differential Privacy An algorithm A priv with A priv (D) T provides (ɛ, δ)-differential privacy if Pr[A priv (D) S] e ɛ Pr[A priv (D ) S] + δ (2) for all S T and all datasets D, D differing in only a single entry. Differential Privacy 8

Resilience to arbitrary auxiliary information [4] Differential Privacy provides plausible deniability to each participant since the same outcome could have been produced using a dataset without him. This definition is independent of available side information Furthermore: Differential Privacy holds regardless what auxiliary information is available right now or will be available in the future. Differential Privacy 9

Postprocessing Postprocessing [4] Let A be a (ɛ, δ)-differentially private mechanism and f an arbitrary mapping. Then the composition f A is (ɛ, δ)-differentially private. Differential Privacy 10

Composition Composition [11] Let A 1 priv and A2 priv be algorithms with privacy guarantees of ɛ 1 and ɛ 2. Then applying both algorithms to the data has a privacy risk of at most ɛ 1 + ɛ 2. Differential Privacy 11

Techniques

Techniques Approaches: Input Perturbation Output Perturbation Algorithm Perturbation Differential Privacy 13

Input perturbation [11] [11] Add noise directly to the database D The perturbed dataset can then be published and guarantees differential privacy for any following algorithm. Example: Randomized Response Differential Privacy 14

Input Perturbation [4] [13] Randomized Response Question: Have you ever committed a crime? Randomization Process: 1. Flip a coin. 2. If tails : answer truthfully. 3. If heads : Flip a coin. tails : say no. heads : say yes. Plausible deniability for the individual. True distribution can still be estimated: p: all people who have committed a crime, y: number of people who said yes : E(y) = 0.5 p + 0.25 p + 0.25 (1 p) = 0.5 p + 0.25 p = 2 y 0.5 Differential Privacy 15

Input perturbation: Pro s & Con s Pro Results can be reproduced Privacy is not dependent on a specific algorithm Contra Determining the amount of noise needed and therefore determining ɛ not trivial Privacy guarantees might be worse than for algorithm-specific techniques Differential Privacy 16

Output Perturbation[4][7] [11] Add noise to the results of A nonpriv Only publish the perturbed results Destroy original data Differential Privacy 17

Output Perturbation l1-sensitivity Maximum difference of the function over all pairs of databases D and D differing in a single record. S(A) = max D,D A(D) A(D ) 1 (3) Laplace Mechanism Given an algorithm A nonpriv : D R k, the Laplace Algorithm adds Laplacian noise to the result of A nonpriv [7]: A priv (x, ɛ) = A nonpriv (x) + (Z 1,..., Z k ) (4) where Z i are i.i.d. random variables with Z i Lap( S(A nonpriv ) ɛ ) Differential Privacy 18

Output Perturbation: Pro s & Con s Pro Better privacy guarantees than input perturbation Easier to add noise and control the privacy Contra Results cannot be reproduced Differential Privacy 19

Exponential Mechanism [7] [11] Sometimes adding noise to the input or output is not possible. Example [4]: Items for sale: A: 1,00$, B: 1,00$, C: 3,01$ Best price: 3,01$, revenue: 3,01$ 2nd best price: 1,00$, revenue: 3,00$ Revenue for price 3,02$: 0$ Revenue for price 1,01$: 1,01$ Differential Privacy 20

Exponential Mechanism [7] Construct a utility-measure over the dataset D and all possible outputs k: The sensitivity of q is: q(d, k) = u, u R (5) S(q) = max k,d,d q(d, k) q(d, k) (6) The exponential mechanism picks a random value for k with distribution: p(k) exp( ɛ q(d, k)) (7) 2S(q) Therefore the exponential mechanism is biased towards values of k with a higher utility. Differential Privacy 21

Exponential Mechanism: Pro s & Con s Pro Biased towards the more useful values Contra Computationally expensive Requires modification of existing algorithms Differential Privacy 22

Practical Issues and Limitations

Practical Issues and Limitations Some solutions for achieving differential privacy still rely on technical assumptions about the data (e.g. discrete/continuous data). How to choose ɛ and δ? Rule of thumb: δ 1 D The lower, the better: But what is low enough? Trade-off between privacy and utility: Figure : Lap( 1 ) for different ɛ ɛ Differential Privacy 24

Differential Privacy in Machine Learning & Data Mining

Differential Privacy in Machine Learning & Data Mining How to achieve differential privacy in machine learning and data mining: Input and Output Perturbation: Using these techniques enables us to use the standard machine learning techniques and achieving differential privacy at the same time. Algorithm Perturbation: This requires the machine learning technique to be modified. Differential Privacy 26

Differentially Private Graph Clustering [8] Motivation Suppose there exists a graph consisting of users of a social network and their relationship. Consider detecting a connection between two user a privacy violation Publishing the original graph as a whole is a clear privacy breach Even if just the community structure of a graph is revealed it might be possible for an attacker to infer the existence (or non-existence) between nodes Differential Privacy 27

Differentially Private Graph Clustering [8] Graph Perturbation (PIG) [8] Privacy-Itegrated Graph Guarantees edge-differential privacy Perturbs the input graph graph guarantees privacy independent from clustering algorithms Differential Privacy 28

Differentially Private Graph Clustering [8] Algorithm 1 Graph Perturbation Algorithm PIG 1: function PerturbGraph(Adjacency matrix A, privacy parameter s) 2: for all a ij A with i < j do 3: if preservation is chosen with probability 1 s then 4: continue 5: else 6: Value of a ij is randomized 7: 8: if 0 is chosen with probability 1 2 then a ij = a ji = 0 9: else 10: a ij = a ji = 1 11: end if 12: end if 13: end for 14: return A 15: end function Differential Privacy 29

Differentially Private Graph Clustering [8] Evaluation It can be proven that a PIG-perturbed graph guarantees edge-differential privacy for ɛ ln( 2 s 1). Figure : Clustering quality using the algorithm SCAN.[8] Differential Privacy 30

Other DM/ML techniques using Differential Privacy [8] Community Detection (Input Perturbation, Algorithm Perturbation) [10] Deep Learning (Algorithm Perturbation) [1] Decision Trees (Output Perturbation) [6] Differential Privacy 31

Questions?

References I [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS 16, pages 308 318, New York, NY, USA, 2016. ACM. [2] T. Dalenius. Towards a methodology for statistical disclosure control. Statistisk Tidskrift, 15:429 444, 1977. [3] C. Dwork. Differential privacy. In 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006), volume 4052, pages 1 12, Venice, Italy, July 2006. Springer Verlag. Differential Privacy 33

References II [4] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014. [5] A. Gupta, K. Ligett, F. McSherry, A. Roth, and K. Talwar. Differentially private approximation algorithms. Aug 2009. [6] G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright. A practical differentially private random decision tree classifier. In 2009 IEEE International Conference on Workshops, pages 114 121, Dec 2009. [7] Z. Ji, Z. C. Lipton, and C. Elkan. Differential Privacy and Machine Learning: a Survey and Review. ArXiv e-prints, Dec. 2014. Differential Privacy 34

References III [8] Y. Mülle and Chris Clifton and Klemens Böhm. Privacy-integrated graph clustering through differential privacy. In EDBT/ICDT Workshops, 2015. [9] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy, SP 08, pages 111 125, Washington, DC, USA, 2008. IEEE Computer Society. [10] H. H. Nguyen, A. Imine, and M. Rusinowitch. Detecting communities under differential privacy. CoRR, abs/1607.02060, 2016. [11] A. D. Sarwate and K. Chaudhuri. Signal processing and machine learning with differential privacy: Algorithms and challenges for continuous data. IEEE Signal Processing Magazine, 30(5):86 94, Sept 2013. Differential Privacy 35

References IV [12] L. Sweeney. K-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557 570, Oct. 2002. [13] S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63 69, 1965. Differential Privacy 36

Differential Privacy. Seminar: Robust Data Mining Techniques. Thomas Edlich. July 16, 2017