Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy Xiaokui Xiao Nanyang Technological University
Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity l-diversity differential privacy Conclusion and open problems
Privacy Preserving Data Publishing data data curator recipient contributors Each contributor: provide data about herself Curator: collects data and releases them in a certain form Recipient: uses the released data for analysis
Example: Census Data Release data data individuals census bureau general public Each contributor: provide data about herself Curator: collects data and releases them in a certain form Recipient: uses the released data for analysis
Example: Medical Data Release data data patients hospital medical researcher Each contributor: provide data about herself Curator: collects data and releases them in a certain form Recipient: uses the released data for analysis
Privacy Preserving Data Publishing data data curator recipient contributors Objectives: The privacy of the contributors are protected The recipient gets useful data
Why is this important? Many types of research rely on the availability of private data Demographic research Medical research Social network studies Web search studies
Why is it a difficult problem? Intuition: There are only 7 billion people on earth 7 billion 33 bits Theoretically speaking, we need only 33 bits of information to pinpoint an individual
Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity l-diversity differential privacy Conclusions and open problems
Privacy Breach: The MGIC Case Time: mid-1990s Curator: Massachusetts Group Insurance Commission (MGIC) Data released: anonymized medical records Intention: facilitate medical research Name Birth Date Gender ZIP Disease Alice 1960/01/01 F 10000 flu Bob 1965/02/02 M 20000 dyspepsia Cathy 1970/03/03 F 30000 pneumonia David 1975/04/04 M 40000 gastritis Medical Records
Privacy Breach: The MGIC Case Time: mid-1990s Curator: Massachusetts Group Insurance Commission (MGIC) Data released: anonymized medical records Intention: facilitate medical research match Name Birth Date Gender ZIP Alice 1960/01/01 F 10000 Bob 1965/02/02 M 20000 Cathy 1970/03/03 F 30000 David 1975/04/04 M 40000 Voter Registration List Birth Date Gender ZIP Disease 1960/01/01 F 10000 flu 1965/02/02 M 20000 dyspepsia 1970/03/03 F 30000 pneumonia 1975/04/04 M 40000 gastritis Medical Records
Privacy Breach: The AOL Case Time: 2006 Curator: American Online Data released: anonymized search log Intention: facilitate research on web search Log record: < User ID, Query, > Example: < 4417749, UQ, >
Privacy Breach: The AOL Case Log record: < User ID, Query, > Example: < 4417749, UQ, > Attacker: New York Times Method: Find all log entries for AOL user 4417749 Many queries for businesses and services in Lilburn, GA (population 11K) A number of queries for different persons with the last name Arnold Lilburn has 14 people with the last name Arnold The New York Times contacted them and found that AOL User 4417749 is Thelma Arnold
Privacy Breach: The DNA case Time: reported in 2005 Curator: A sperm bank Data released: A sperm donor s date of birth and birthplace Result: The donor offspring (a boy) was able to identify his biological father How? By exploiting the information on the Y- chromosome
Y-chromosome vs. Surname surname XY XX surname XY XX There is a strong correlation between Y-chromosomes and surnames surname XY surname XY XX
Y-chromosome vs. Surname The 15-year-old boy purchased the service from a company that collected DNA samples from 45,000 individuals provided service for Y-chromosome matching Result There were two (relatively) close matches, with almost identical surnames The boy thus learned the possible surnames of his biological farther
Privacy Breach: The DNA case The boy knew The possible last names of his biological father as well as his date of birth and birthplace He paid another company to retrieve the names of persons born in that place on that date Only one person had a matching surname Summary: The sperm bank in-directly revealed the identity of the donor, by disclosing his date of birth and birthplace
Lessons Learned Any information released by the data curator can potentially be exploited by the adversary In the MGIC case: genders, birth dates, ZIP codes In the AOL case: keywords in search queries In the DNA case: date of birth, birthplace Solution? Do not release the exact information from the original data
Privacy Preserving Data Publishing data modified data curator recipient contributors Publish a modified version of the data, such that the contributors privacy is adequately protected the published data is useful for its intended purpose (at least to some degree)
Privacy Preserving Data Publishing data modified data curator recipient contributors Two issues privacy principle: what do we mean by adequately protected privacy? modification method: how should we modify the data to ensure privacy while maximizing utility?
Existing Solutions Earliest solutions dated back to the 1960 s Solutions before 2000 Mostly without a formal privacy model Evaluates privacy based on empirical studies only This talk will focus on solutions with formal privacy models (developed after 2000) k-anonymity l-diversity differential privacy
Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity [Sweeney 2002] l-diversity differential privacy Conclusions and open problems
k-anonymity: Example Suppose that we want to publish the medical records below Name Age ZIP Disease Andy 20 10000 flu Bob 30 20000 dyspepsia Cathy 40 30000 pneumonia Diane 50 40000 gastritis
k-anonymity: Example Suppose that we want to publish the medical records below We know that eliminating names is not enough because an adversary may identify patients by Age and ZIP Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease 20 10000 flu 30 20000 dyspepsia 40 30000 pneumonia 50 40000 gastritis medical records
k-anonymity: Example k-anonymity [Sweeney 2002] How? requires that each (Age, ZIP) combination can be matched to at least k patients Make Age and ZIP less specific in the medical records Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease 20 10000 flu 30 20000 dyspepsia 40 30000 pneumonia 50 40000 gastritis medical records
k-anonymity: Example k-anonymity [Sweeney 2002] requires that each (Age, ZIP) combination can be matched to at least k patients Age ZIP Disease Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge 20 10000 flu 30 20000 dyspepsia 40 30000 pneumonia 50 40000 gastritis medical records
k-anonymity: Example k-anonymity [Sweeney 2002] requires that each (Age, ZIP) combination can be matched to at least k patients generalization Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis medical records
k-anonymity: Example k-anonymity [Sweeney 2002] requires that each (Age, ZIP) combination can be matched to at least k patients Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table
k-anonymity: Example k-anonymity [Sweeney 2002] requires that each (Age, ZIP) combination can be matched to at least k patients Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table
k-anonymity: General Approach Identify the attributes that the adversary may know Referred to as Quasi-Identifiers (QI) Divide tuples in the table into groups of sizes at least k Generalize the QI values of each group to make them identical QI group 1 group 2 Age ZIP Disease 20 10000 flu 30 20000 dyspepsia 40 30000 pneumonia 50 40000 gastritis medical records
k-anonymity: General Approach Identify the attributes that the adversary may know Referred to as Quasi-Identifiers (QI) Divide tuples in the table into groups of sizes at least k Generalize the QI values of each group to make them identical QI Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table
k-anonymity: Algorithms Numerous algorithms for k-anonymity had been proposed Objective: achieve k-anonymity with the least amount of generalization This line of research became obsolete Reason: k-anonymity was found to be vulnerable [Machanavajjhala et al. 2006] QI Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table
k-anonymity: Vulnerability k-anonymity requires that each combination of quasi-identifiers (QI) is hidden in a group of size at least k But it says nothing about the remaining attributes Result: Disclosure of sensitive attributes is possible QI sensitive Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table
k-anonymity: Vulnerability k-anonymity requires that each combination of quasi-identifiers (QI) is hidden in a group of size at least k But it says nothing about the remaining attributes Result: Disclosure of sensitive attributes is possible QI sensitive Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] dyspepsia [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table
k-anonymity: Vulnerability Intuition: Hiding in a group of k is not sufficient The group should have a diverse set of sensitive values QI sensitive Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] dyspepsia [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-anonymous table
Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity l-diversity [Machanavajjhala et al. 2006] differential privacy Conclusions and open problems
l-diversity [Machanavajjhala et al. 2006] Approach: (similar to k-anonymity) Divide tuples into groups, and make the QI of each group identical Requirement: (different from k-anonymity) Each group has at least l well-represented sensitive values Several definitions of well-represented exist Simplest one: in each group, no sensitive value is associated with more than 1/l of the tuples Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-diverse table
l-diversity [Machanavajjhala et al. 2006] Rationale: The 1/l association in the generalized table leads to 1/l confidence for the adversary Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-diverse table
l-diversity: Follow-up Research Algorithms: achieve l-diversity with the least amount of generalization Patches : identify vulnerabilities of l-diversity, and propose an improved privacy notion Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] flu [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-diverse table
l-diversity: Vulnerability Suppose that the adversary wants to find out the disease of Bob The adversary knows that Bob is unlikely to have breast cancer So he knows that Bob is likely to have dyspepsia Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] breast cancer [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-diverse table
l-diversity: Vulnerability Intuition: It is not sufficient to impose constraints of the diversity of sensitive values in each group Need to take into account the adversary s background knowledge (e.g., males are unlikely to have breast cancer) Name Age ZIP Andy 20 10000 Bob 30 20000 Cathy 40 30000 Diane 50 40000 adversary s knowledge Age ZIP Disease [20,30] [10000,20000] breast cancer [20,30] [10000,20000] dyspepsia [40,50] [30000,40000] pneumonia [40,50] [30000,40000] gastritis 2-diverse table
l-diversity: Vulnerability Intuition: It is not sufficient to impose constraints of the diversity of sensitive values in each group Need to take into account the adversary s background knowledge (e.g., males are unlikely to have breast cancer) Follow-up research on three issues: How to express the background knowledge of the adversary? How to derive background knowledge? How to generalize data to protect privacy against background knowledge? This led to papers after papers
Algorithm-Based Attacks Algorithms designed for l-diversity (and its improvements) are often vulnerable to algorithm-based attacks Intuition: Those algorithms always try to use the least amount of generalization to achieve l-diversity An adversary can utilize the characteristics of the algorithms to do some reverse engineering on the generalized tables
l-diversity: Summary l-diversity and its follow-up approaches address the weakness of k-anonymity tackle much more advanced adversary models Problems with this line of research It does not converge to a final adversary model Most proposed methods are vulnerable to algorithm-based attacks
Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity l-diversity differential privacy [Dwork 2006] Conclusions and open problems
Differential Privacy [Dwork 2006] A privacy principle proposed by theoreticians More difficult to understand k-anonymity and l-anonymity Becomes well-adopted because Its privacy model is generally considered strong enough Its definition naturally takes into account algorithm-based attacks
Differential Privacy: Intuition Suppose that we have a dataset D that contains the medical record of every individual in Australia Suppose that Alice is the dataset Intuitively, is it OK to publish the following information? Whether Alice has diabetes The total number of diabetes patients in D Why is it OK to publish the latter but not the former? Intuition: The former completely depends on Alice The latter does not depend much on Alice
Differential Privacy: Intuition In general, we should only publish information that does not highly depend on any particular individual This motivates the definition of differential privacy
Differential Privacy: Definition data randomized algorithm A modified data recipient Neighboring datasets: Two datasets and, such that can be obtained by changing one single tuple in A randomized algorithm satisfies -differential privacy, iff for any two neighboring datasets and and for any output of, Rationale: The output of the algorithm does not highly depend on any particular tuple in the input
Differential Privacy: Definition Illustration of -differential privacy exp Pr Pr exp
Comparison with k-anonymity and l- diversity Differential privacy does not directly model the adversary s knowledge exp Pr Pr exp But its privacy protection is generally considered strong enough
Comparison with k-anonymity and l- diversity Differential privacy is more general exp Pr Pr exp There is no restriction on the type of O It can be a table, a set of frequent itemsets, a regression model, etc.
The Differential Privacy Landscape This leads to a lot of research interests from various communities Database and data mining (SIGMOD, VLDB, ICDE, KDD, ) Security (CCS, ) Machine learning (NIPS, ICML, ) Systems (SIGCOMM, OSDI, ) Theory (STOC, FOCS, )
Outline Privacy preserving data publishing: What and Why Examples of privacy attacks Existing solutions k-anonymity l-diversity differential privacy General approaches for achieving differential privacy Research issues Conclusions and open problems
Achieving Differential Privacy: Example Suppose that we have a set D of medical records We want to release the number of diabetes patients in D (say, 1000) How to do it in a differentially private manner?
Achieving Differential Privacy: Example Naïve solution: Release 1000 directly But it violates differential privacy, since does not hold Better solution: Add noise into 1000 before releasing it Intuition: the noise could achieve the requirements of differential privacy Question: what kind of noise should we add?
Laplace Distribution ; increase/decrease by changes by a factor of variance: ; is referred as the scale 0.5 0.45 0.4 0.35 0.3 1 2 4 0.25 0.2 0.15 0.1 0.05 0-10 -8-6 -4-2 0 2 4 6 8 10
Achieving Differential Privacy: Example Add Laplace noise before releasing the number of diabetes patients in D Changing one tuple in D shifting the mean of Laplace distribution by 1 The two distributions have bounded differences differential privacy is satisfied Pr Pr ratio bounded exp 1000 1001 # of diabetes patients
The Laplace Mechanism In general, if we want to release a set of values (e.g., counts) from a dataset, We add i.i.d. Laplace noise to each value to achieve differential privacy This general approach is called the Laplace mechanism Figuring out the correct amount of noise to use could be a research issue
Histogram Suppose that we want to release a histogram on Age from a dataset D Using the Laplace mechanism: We add i.i.d. Laplace noise to the histogram counts How much noise? Previous example: noise leads to (1/)-differential privacy This case: noise leads to (1/)-differential privacy Rationale: Changing one tuple may change two counts simultaneously in the histogram We need twice the noise to conceal such changes
Histogram In general, if the values to be published are obtained from a complex task (e.g., regression) It could be much more challenging to derive the correct amount of noise to use
Optimization of Accuracy A more common research issue: choosing a good strategy to publish data Example: Histogram vs. Histogram + binary tree The latter is good for range queries It answers range query is answered by taking the sum of O(log n) noisy counts the former requires O(n) counts Note: the latter requires logn times the noise required the former, but it pays off
Choosing a Good Strategy In general, choosing a good strategy requires exploiting the characteristics of the input data the output results the way that users may use the output results Most differential privacy papers focus on this issue
Other Research Issues General approaches beyond the Laplace mechanism E.g., the exponential mechanism [McSherry et al. 2007], which is suitable for problems with non-numeric outputs
Conclusion An overview of existing solutions for privacy preserving data publishing k-anonymity l-diversity differential privacy General methods for differential privacy, and related research issues
Open Problems Differentially private algorithms for complex data/tasks, e.g., Graph queries Principle component analysis (PCA) Trajectories Main challenge: Difficult to identify a good strategy
Open Problems Differential privacy might be too strong It requires that changing one tuple should not bring much change to the published result Alternative interpretation: Even if an adversary knows n 1 tuples in the input data, he won t be able to infer information about the remaining tuple Knowing n 1 individuals is often impossible How should we relax differential privacy?
Open Problems How to choose an appropriate for - differential privacy? Need a way to quantify the cost of privacy and the gain of utility in releasing data
Open Problems What do we do to protect genome data? Challenges The data is highly complex The queries are highly complex Definition of privacy is unclear