Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

Size: px

Start display at page:

Download "Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University"

Vivian Clarissa Melton
6 years ago
Views:

1 Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy Xiaokui Xiao Nanyang Technological University

2 Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity l-diversity differential privacy Conclusion and open problems

3 Privacy Preserving Data Publishing data data curator recipient contributors Each contributor: provide data about herself Curator: collects data and releases them in a certain form Recipient: uses the released data for analysis

4 Example: Census Data Release data data individuals census bureau general public Each contributor: provide data about herself Curator: collects data and releases them in a certain form Recipient: uses the released data for analysis

5 Example: Medical Data Release data data patients hospital medical researcher Each contributor: provide data about herself Curator: collects data and releases them in a certain form Recipient: uses the released data for analysis

6 Privacy Preserving Data Publishing data data curator recipient contributors Objectives: The privacy of the contributors are protected The recipient gets useful data

7 Why is this important? Many types of research rely on the availability of private data Demographic research Medical research Social network studies Web search studies

8 Why is it a difficult problem? Intuition: There are only 7 billion people on earth 7 billion 33 bits Theoretically speaking, we need only 33 bits of information to pinpoint an individual

9 Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity l-diversity differential privacy Conclusions and open problems

10 Privacy Breach: The MGIC Case Time: mid-1990s Curator: Massachusetts Group Insurance Commission (MGIC) Data released: anonymized medical records Intention: facilitate medical research Name Birth Date Gender ZIP Disease Alice 1960/01/01 F flu Bob 1965/02/02 M dyspepsia Cathy 1970/03/03 F pneumonia David 1975/04/04 M gastritis Medical Records

11 Privacy Breach: The MGIC Case Time: mid-1990s Curator: Massachusetts Group Insurance Commission (MGIC) Data released: anonymized medical records Intention: facilitate medical research match Name Birth Date Gender ZIP Alice 1960/01/01 F Bob 1965/02/02 M Cathy 1970/03/03 F David 1975/04/04 M Voter Registration List Birth Date Gender ZIP Disease 1960/01/01 F flu 1965/02/02 M dyspepsia 1970/03/03 F pneumonia 1975/04/04 M gastritis Medical Records

12 Privacy Breach: The AOL Case Time: 2006 Curator: American Online Data released: anonymized search log Intention: facilitate research on web search Log record: < User ID, Query, > Example: < , UQ, >

13 Privacy Breach: The AOL Case Log record: < User ID, Query, > Example: < , UQ, > Attacker: New York Times Method: Find all log entries for AOL user Many queries for businesses and services in Lilburn, GA (population 11K) A number of queries for different persons with the last name Arnold Lilburn has 14 people with the last name Arnold The New York Times contacted them and found that AOL User is Thelma Arnold

14 Privacy Breach: The DNA case Time: reported in 2005 Curator: A sperm bank Data released: A sperm donor s date of birth and birthplace Result: The donor offspring (a boy) was able to identify his biological father How? By exploiting the information on the Y- chromosome

15 Y-chromosome vs. Surname surname XY XX surname XY XX There is a strong correlation between Y-chromosomes and surnames surname XY surname XY XX

16 Y-chromosome vs. Surname The 15-year-old boy purchased the service from a company that collected DNA samples from 45,000 individuals provided service for Y-chromosome matching Result There were two (relatively) close matches, with almost identical surnames The boy thus learned the possible surnames of his biological farther

17 Privacy Breach: The DNA case The boy knew The possible last names of his biological father as well as his date of birth and birthplace He paid another company to retrieve the names of persons born in that place on that date Only one person had a matching surname Summary: The sperm bank in-directly revealed the identity of the donor, by disclosing his date of birth and birthplace

18 Lessons Learned Any information released by the data curator can potentially be exploited by the adversary In the MGIC case: genders, birth dates, ZIP codes In the AOL case: keywords in search queries In the DNA case: date of birth, birthplace Solution? Do not release the exact information from the original data

such that the contributors privacy is adequately protected the

19 Privacy Preserving Data Publishing data modified data curator recipient contributors Publish a modified version of the data, such that the contributors privacy is adequately protected the published data is useful for its intended purpose (at least to some degree)

20 Privacy Preserving Data Publishing data modified data curator recipient contributors Two issues privacy principle: what do we mean by adequately protected privacy? modification method: how should we modify the data to ensure privacy while maximizing utility?

21 Existing Solutions Earliest solutions dated back to the 1960 s Solutions before 2000 Mostly without a formal privacy model Evaluates privacy based on empirical studies only This talk will focus on solutions with formal privacy models (developed after 2000) k-anonymity l-diversity differential privacy

22 Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity [Sweeney 2002] l-diversity differential privacy Conclusions and open problems

23 k-anonymity: Example Suppose that we want to publish the medical records below Name Age ZIP Disease Andy flu Bob dyspepsia Cathy pneumonia Diane gastritis

24 k-anonymity: Example Suppose that we want to publish the medical records below We know that eliminating names is not enough because an adversary may identify patients by Age and ZIP Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease flu dyspepsia pneumonia gastritis medical records

25 k-anonymity: Example k-anonymity [Sweeney 2002] How? requires that each (Age, ZIP) combination can be matched to at least k patients Make Age and ZIP less specific in the medical records Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease flu dyspepsia pneumonia gastritis medical records

26 k-anonymity: Example k-anonymity [Sweeney 2002] requires that each (Age, ZIP) combination can be matched to at least k patients Age ZIP Disease Name Age ZIP Andy Bob Cathy Diane adversary s knowledge flu dyspepsia pneumonia gastritis medical records

27 k-anonymity: Example k-anonymity [Sweeney 2002] requires that each (Age, ZIP) combination can be matched to at least k patients generalization Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis medical records

28 k-anonymity: Example k-anonymity [Sweeney 2002] requires that each (Age, ZIP) combination can be matched to at least k patients Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table

29 k-anonymity: Example k-anonymity [Sweeney 2002] requires that each (Age, ZIP) combination can be matched to at least k patients Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table

30 k-anonymity: General Approach Identify the attributes that the adversary may know Referred to as Quasi-Identifiers (QI) Divide tuples in the table into groups of sizes at least k Generalize the QI values of each group to make them identical QI group 1 group 2 Age ZIP Disease flu dyspepsia pneumonia gastritis medical records

31 k-anonymity: General Approach Identify the attributes that the adversary may know Referred to as Quasi-Identifiers (QI) Divide tuples in the table into groups of sizes at least k Generalize the QI values of each group to make them identical QI Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table

32 k-anonymity: Algorithms Numerous algorithms for k-anonymity had been proposed Objective: achieve k-anonymity with the least amount of generalization This line of research became obsolete Reason: k-anonymity was found to be vulnerable [Machanavajjhala et al. 2006] QI Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table

33 k-anonymity: Vulnerability k-anonymity requires that each combination of quasi-identifiers (QI) is hidden in a group of size at least k But it says nothing about the remaining attributes Result: Disclosure of sensitive attributes is possible QI sensitive Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table

34 k-anonymity: Vulnerability k-anonymity requires that each combination of quasi-identifiers (QI) is hidden in a group of size at least k But it says nothing about the remaining attributes Result: Disclosure of sensitive attributes is possible QI sensitive Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] dyspepsia [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table

35 k-anonymity: Vulnerability Intuition: Hiding in a group of k is not sufficient The group should have a diverse set of sensitive values QI sensitive Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] dyspepsia [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table

36 Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity l-diversity [Machanavajjhala et al. 2006] differential privacy Conclusions and open problems

37 l-diversity [Machanavajjhala et al. 2006] Approach: (similar to k-anonymity) Divide tuples into groups, and make the QI of each group identical Requirement: (different from k-anonymity) Each group has at least l well-represented sensitive values Several definitions of well-represented exist Simplest one: in each group, no sensitive value is associated with more than 1/l of the tuples Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-diverse table

38 l-diversity [Machanavajjhala et al. 2006] Rationale: The 1/l association in the generalized table leads to 1/l confidence for the adversary Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-diverse table

39 l-diversity: Follow-up Research Algorithms: achieve l-diversity with the least amount of generalization Patches : identify vulnerabilities of l-diversity, and propose an improved privacy notion Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-diverse table

40 l-diversity: Vulnerability Suppose that the adversary wants to find out the disease of Bob The adversary knows that Bob is unlikely to have breast cancer So he knows that Bob is likely to have dyspepsia Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] breast cancer [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-diverse table

41 l-diversity: Vulnerability Intuition: It is not sufficient to impose constraints of the diversity of sensitive values in each group Need to take into account the adversary s background knowledge (e.g., males are unlikely to have breast cancer) Name Age ZIP Andy Bob Cathy Diane adversary s knowledge Age ZIP Disease [20,30] [10000,20000] breast cancer [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-diverse table

42 l-diversity: Vulnerability Intuition: It is not sufficient to impose constraints of the diversity of sensitive values in each group Need to take into account the adversary s background knowledge (e.g., males are unlikely to have breast cancer) Follow-up research on three issues: How to express the background knowledge of the adversary? How to derive background knowledge? How to generalize data to protect privacy against background knowledge? This led to papers after papers

43 Algorithm-Based Attacks Algorithms designed for l-diversity (and its improvements) are often vulnerable to algorithm-based attacks Intuition: Those algorithms always try to use the least amount of generalization to achieve l-diversity An adversary can utilize the characteristics of the algorithms to do some reverse engineering on the generalized tables

44 l-diversity: Summary l-diversity and its follow-up approaches address the weakness of k-anonymity tackle much more advanced adversary models Problems with this line of research It does not converge to a final adversary model Most proposed methods are vulnerable to algorithm-based attacks

45 Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity l-diversity differential privacy [Dwork 2006] Conclusions and open problems

46 Differential Privacy [Dwork 2006] A privacy principle proposed by theoreticians More difficult to understand k-anonymity and l-anonymity Becomes well-adopted because Its privacy model is generally considered strong enough Its definition naturally takes into account algorithm-based attacks

47 Differential Privacy: Intuition Suppose that we have a dataset D that contains the medical record of every individual in Australia Suppose that Alice is the dataset Intuitively, is it OK to publish the following information? Whether Alice has diabetes The total number of diabetes patients in D Why is it OK to publish the latter but not the former? Intuition: The former completely depends on Alice The latter does not depend much on Alice

48 Differential Privacy: Intuition In general, we should only publish information that does not highly depend on any particular individual This motivates the definition of differential privacy

Differential Privacy: Definition data randomized algorithm A modified data recipient Neighboring datasets: Two datasets and, such that can be obtained by changing one single tuple in A randomized

49 Differential Privacy: Definition data randomized algorithm A modified data recipient Neighboring datasets: Two datasets and, such that can be obtained by changing one single tuple in A randomized algorithm satisfies -differential privacy, iff for any two neighboring datasets and and for any output of, Rationale: The output of the algorithm does not highly depend on any particular tuple in the input

50 Differential Privacy: Definition Illustration of -differential privacy exp Pr Pr exp

51 Comparison with k-anonymity and l- diversity Differential privacy does not directly model the adversary s knowledge exp Pr Pr exp But its privacy protection is generally considered strong enough

52 Comparison with k-anonymity and l- diversity Differential privacy is more general exp Pr Pr exp There is no restriction on the type of O It can be a table, a set of frequent itemsets, a regression model, etc.

53 The Differential Privacy Landscape This leads to a lot of research interests from various communities Database and data mining (SIGMOD, VLDB, ICDE, KDD, ) Security (CCS, ) Machine learning (NIPS, ICML, ) Systems (SIGCOMM, OSDI, ) Theory (STOC, FOCS, )

54 Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity l-diversity differential privacy General approaches for achieving differential privacy Research issues Conclusions and open problems

55 Achieving Differential Privacy: Example Suppose that we have a set D of medical records We want to release the number of diabetes patients in D (say, 1000) How to do it in a differentially private manner?

56 Achieving Differential Privacy: Example Naïve solution: Release 1000 directly But it violates differential privacy, since does not hold Better solution: Add noise into 1000 before releasing it Intuition: the noise could achieve the requirements of differential privacy Question: what kind of noise should we add?

57 Laplace Distribution ; increase/decrease by changes by a factor of variance: ; is referred as the scale

58 Achieving Differential Privacy: Example Add Laplace noise before releasing the number of diabetes patients in D Changing one tuple in D shifting the mean of Laplace distribution by 1 The two distributions have bounded differences differential privacy is satisfied Pr Pr ratio bounded exp # of diabetes patients

59 The Laplace Mechanism In general, if we want to release a set of values (e.g., counts) from a dataset, We add i.i.d. Laplace noise to each value to achieve differential privacy This general approach is called the Laplace mechanism Figuring out the correct amount of noise to use could be a research issue

60 Histogram Suppose that we want to release a histogram on Age from a dataset D Using the Laplace mechanism: We add i.i.d. Laplace noise to the histogram counts How much noise? Previous example: noise leads to (1/)-differential privacy This case: noise leads to (1/)-differential privacy Rationale: Changing one tuple may change two counts simultaneously in the histogram We need twice the noise to conceal such changes

61 Histogram In general, if the values to be published are obtained from a complex task (e.g., regression) It could be much more challenging to derive the correct amount of noise to use

62 Optimization of Accuracy A more common research issue: choosing a good strategy to publish data Example: Histogram vs. Histogram + binary tree The latter is good for range queries It answers range query is answered by taking the sum of O(log n) noisy counts the former requires O(n) counts Note: the latter requires logn times the noise required the former, but it pays off

63 Choosing a Good Strategy In general, choosing a good strategy requires exploiting the characteristics of the input data the output results the way that users may use the output results Most differential privacy papers focus on this issue

64 Other Research Issues General approaches beyond the Laplace mechanism E.g., the exponential mechanism [McSherry et al. 2007], which is suitable for problems with non-numeric outputs

65 Conclusion An overview of existing solutions for privacy preserving data publishing k-anonymity l-diversity differential privacy General methods for differential privacy, and related research issues

66 Open Problems Differentially private algorithms for complex data/tasks, e.g., Graph queries Principle component analysis (PCA) Trajectories Main challenge: Difficult to identify a good strategy

67 Open Problems Differential privacy might be too strong It requires that changing one tuple should not bring much change to the published result Alternative interpretation: Even if an adversary knows n 1 tuples in the input data, he won t be able to infer information about the remaining tuple Knowing n 1 individuals is often impossible How should we relax differential privacy?

68 Open Problems How to choose an appropriate for - differential privacy? Need a way to quantify the cost of privacy and the gain of utility in releasing data

69 Open Problems What do we do to protect genome data? Challenges The data is highly complex The queries are highly complex Definition of privacy is unclear

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness Data Security and Privacy Topic 18: k-anonymity, l-diversity, and t-closeness 1 Optional Readings for This Lecture t-closeness: Privacy Beyond k-anonymity and l-diversity. Ninghui Li, Tiancheng Li, and