K-Anonymity and Other Cluster- Based Methods. Ge Ruan Oct. 11,2007

Size: px

Start display at page:

Download "K-Anonymity and Other Cluster- Based Methods. Ge Ruan Oct. 11,2007"

Russell Davidson
6 years ago
Views:

1 K-Anonymity and Other Cluster- Based Methods Ge Ruan Oct 11,2007

2 Data Publishing and Data Privacy Society is experiencing exponential growth in the number and variety of data collections containing person-specific information These collected information is valuable both in research and business Data sharing is common Publishing the data may put the respondent s privacy in risk Objective: Maximize data utility while limiting disclosure risk to an acceptable level

3 Related Works Statistical Databases The most common way is adding noise and still maintaining some statistical invariant Disadvantages: destroy the integrity of the data

4 Related Works(Cont d) Multi-level Databases Data is stored at different security classifications and users having different security clearances (Denning and Lunt) Eliminating precise inference Sensitive information is suppressed, ie simply not released (Su and Ozsoyoglu) Disadvantages: It is impossible to consider every possible attack Many data holders share same data But their concerns are different Suppression can drastically reduce the quality of the data

5 Related Works (Cont d) Computer Security Access control and authentication ensure that right people has right authority to the right object at right time and right place That s not what we want here A general doctrine of data privacy is to release all the information as much as the identities of the subjects (people) are protected

6 K-Anonymity Sweeny came up with a formal protection model named k-anonymity What is K-Anonymity? If the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release Ex If you try to identify a man from a release, but the only information you have is his birth date and gender There are k people meet the requirement This is k-anonymity

7 Classification of Attributes Key Attribute: Name, Address, Cell Phone which can uniquely identify an individual directly Always removed before release Quasi-Identifier: 5-digit ZIP code,birth date, gender A set of attributes that can be potentially linked with external information to re-identify entities 87% of the population in US can be uniquely identified based on these attributes, according to the Census summary data in 1991 Suppressed or generalized

8 Classification of Attributes(Cont d) Hospital Patient Data DOB Sex Zipcode Disease 1/21/76 Male Heart Disease 4/13/86 Female Hepatitis 2/28/76 Male Brochitis 1/21/76 Male Broken Arm 4/13/86 Female Flu 2/28/76 Female Hang Nail Vote Registration Data Name DOB Sex Zipcode Andre 1/21/76 Male Beth 1/10/81 Female Carol 10/1/44 Female Dan 2/21/84 Male Ellen 4/19/72 Female Andre has heart disease!

9 Classification of Attributes(Cont d) Sensitive Attribute: Medical record, wage,etc Always released directly These attributes is what the researchers need It depends on the requirement

10 K-Anonymity Protection Model PT: Private Table RT,GT1,GT2: Released Table QI: Quasi Identifier (Ai,,Aj) (A1,A2,,An): Attributes Lemma:

12 Attacks Against K-Anonymity Unsorted Matching Attack This attack is based on the order in which tuples appear in the released table Solution: Randomly sort the tuples before releasing

13 Attacks Against K-Anonymity(Cont d) Complementary Release Attack Different releases can be linked together to compromise k-anonymity Solution: Consider all of the released tables before release the new one, and try to avoid linking Other data holders may release some data that can be used in this kind of attack Generally, this kind of attack is hard to be prohibited completely

14 Attacks Against K-Anonymity(Cont d) Complementary Release Attack (Cont d)

15 Attacks Against K-Anonymity(Cont d) Complementary Release Attack (Cont d)

16 Attacks Against K-Anonymity(Cont d) Temporal Attack (Cont d) Adding or removing tuples may compromise k-anonymity protection

17 Attacks Against K-Anonymity(Cont d) k-anonymity does not provide privacy if: Sensitive values in an equivalence class lack diversity The attacker has background knowledge Homogeneity Attack Bob Zipcode Age Background Knowledge Attack Carl Zipcode Age A 3-anonymous patient table Zipcode Age Disease 476** 2* Heart Disease 476** 2* Heart Disease 476** 2* Heart Disease 4790* 40 Flu 4790* 40 Heart Disease 4790* 40 Cancer 476** 3* Heart Disease 476** 3* Cancer 476** 3* Cancer A Machanavajjhala et al l-diversity: Privacy Beyond k-anonymity ICDE 2006

18 l-diversity Distinct l-diversity Each equivalence class has at least l wellrepresented sensitive values Limitation: Doesn t prevent the probabilistic inference attacks Ex In one equivalent class, there are ten tuples In the Disease area, one of them is Cancer, one is Heart Disease and the remaining eight are Flu This satisfies 3-diversity, but the attacker can still affirm that the target person s disease is Flu with the accuracy of 70% A Machanavajjhala et al l-diversity: Privacy Beyond k-anonymity ICDE 2006

19 l-diversity(cont d) Entropy l-diversity Each equivalence class not only must have enough different sensitive values, but also the different sensitive values must be distributed evenly enough In the formal language of statistic, it means the entropy of the distribution of sensitive values in each equivalence class is at least log(l) Sometimes this maybe too restrictive When some values are very common, the entropy of the entire table may be very low This leads to the less conservative notion of l-diversity A Machanavajjhala et al l-diversity: Privacy Beyond k-anonymity ICDE 2006

20 l-diversity(cont d) Recursive (c,l)-diversity The most frequent value does not appear too frequently r 1 <c(r l +r l+1 + +r m ) A Machanavajjhala et al l-diversity: Privacy Beyond k-anonymity ICDE 2006

21 Limitations of l-diversity l-diversity may be difficult and unnecessary to achieve A single sensitive attribute Two values: HIV positive (1%) and HIV negative (99%) Very different degrees of sensitivity l-diversity is unnecessary to achieve 2-diversity is unnecessary for an equivalence class that contains only negative records l-diversity is difficult to achieve Suppose there are records in total To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

22 Limitations of l-diversity(cont d) l-diversity is insufficient to prevent attribute disclosure Skewness Attack Two sensitive values HIV positive (1%) and HIV negative (99%) Serious privacy risk Consider an equivalence class that contains an equal number of positive records and negative records l-diversity does not differentiate: Equivalence class 1: 49 positive + 1 negative Equivalence class 2: 1 positive + 49 negative l-diversity does not consider the overall distribution of sensitive values

23 Limitations of l-diversity(cont d) l-diversity is insufficient to prevent attribute disclosure Similarity Attack Bob Zip Age Conclusion 1 Bob s salary is in [20k,40k], which is relative low 2 Bob has some stomach-related disease A 3-diverse patient table Zipcode Age Salary Disease 476** 476** 476** 4790* 4790* 4790* 476** 2* 2* 2* * 20K 30K 40K 50K 100K 70K 60K Gastric Ulcer Gastritis Stomach Cancer Gastritis Flu Bronchitis Bronchitis 476** 3* 80K Pneumonia 476** 3* 90K Stomach Cancer l-diversity does not consider semantic meanings of sensitive values

24 t-closeness: A New Privacy Measure Rationale A completely generalized table Age Zipcode Gender Disease * * * Flu Belief B 0 B 1 Knowledge External Knowledge Overall distribution Q of sensitive values * * * * * * * * * Heart Disease Cancer Gastritis

25 t-closeness: A New Privacy Measure Rationale A released table Age Zipcode Gender Disease 2* 479** Male Flu Belief Knowledge 2* 479** Male Heart Disease 2* 479** Male Cancer B 0 External Knowledge B 1 Overall distribution Q of sensitive values * * Gastritis B 2 Distribution P i of sensitive values in each equi-class

26 t-closeness: A New Privacy Measure Rationale Belief B 0 B 1 B 2 Knowledge External Knowledge Overall distribution Q of sensitive values Distribution P i of sensitive values in each equi-class Observations Q should be public Knowledge gain in two parts: Whole population (from B 0 to B 1 ) Specific individuals (from B 1 to B 2 ) We bound knowledge gain between B 1 and B 2 instead Principle The distance between Q and P i should be bounded by a threshold t

27 Distance Measures P=(p 1,p 2,,p m ), Q=(q 1,q 2,,q m ) Trace-distance KL-divergence None of these measures reflect the semantic distance among values Q: {3K,4K,5K,6K,7K,8K,9K,10K,11k} P 1 :{3K,4K,5k} P 2 :{5K,7K,10K} Intuitively, D[P 1,Q]>D[P 2,Q] Ground distance for any pair of values D[P,Q] is dependent upon the ground distances

28 Earth Mover s Distance Formulation P=(p 1,p 2,,p m ), Q=(q 1,q 2,,q m ) d ij : the ground distance between element i of P and element j of Q Find a flow F=[f ij ] where f ij is the flow of mass from element i of P to element j of Q that minimizes the overall work: subject to the constraints:

29 Earth Mover s Distance Example {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} Move 1/9 probability for each of the following pairs 3k->6k,3k->7k cost: 1/9*(3+4)/8 4k->8k,4k->9k cost: 1/9*(4+5)/8 5k->10k,5k->11k cost: 1/9*(5+6)/8 Total cost: 1/9*27/8=0375 With P2={6k,8k,11k}, we can get the total cost is 0167 < 0375 This make more sense than the other two distance calculation method

30 How to calculate EMD EMD for numerical attributes Ordered distance ordered Ordered-distance is a metric dist( vi, vj) = i j m 1 Non-negative, symmetry, triangle inequality Let r i =p i -q i, then D[P,Q] is calculated as: m i 1 1 D[ PQ, ] = ( r1 + r1+ r2 + + r1+ r2+ + rm 1 ) = r m 1 m 1 i= 1 j= 1 j

31 How to calculate EMD EMD for categorical attributes Equal distance equal Equal-distance is a metric D[P,Q] is calculated as: dist( vi, vj) = 1 m 1 D[ PQ, ] = pi qi = ( pi qi) = ( pi qi) 2 i i i= 1 p > q pi< qi

32 How to calculate EMD(Cont d) EMD for categorical attributes Hierarchical distance Hierarchical distance is a metric hierarchical dist( vi, vj) = level( vi, vj) H Respiratory&digestive system diseases Respiratory system diseases Digestive system diseases Respiratory infection Vascular lung diseases Stomach diseases Colon diseases Flu Pneumonia Bronchitis Pulmonary edema Pulmonary embolism Gastric ulcer Stomach cancer Colitis Colon cancer

33 How to calculate EMD(Cont d) EMD for categorical attributes pi qi if N is a leaf extra( N) = extra( C) otherwise C Child ( N ) pos _ extra( N) = extra( C) C Child( N) extra( C) > 0 neg _ extra( N) = extra( C) C Child ( N ) extra( C ) < 0 height( N) cos t( N) = min( pos _ extra( N), neg _ extra( N)) H D[P,Q] is calculated as: D[ PQ, ] = cos t( N) N

34 Experiments Goal To show l-diversity does not provide sufficient privacy protection (the similarity attack) To show the efficiency and data quality of using t- closeness are comparable with other privacy measures Setup Adult dataset from UC Irvine ML repository tuples, 9 attributes (2 sensitive attributes) Algorithm: Incognito

35 Experiments Similarity attack (Occupation) 13 of 21 entropy 2-diversity tables are vulnerable 17 of 26 recursive (4,4)-diversity tables are vulnerable Comparisons of privacy measurements k-anonymity Entropy l-diversity Recursive (c,l)-diversity k-anonymity with t-closeness

36 Experiments Efficiency The efficiency of using t-closeness is comparable with other privacy measurements

37 Experiments Data utility Discernibility metric; Minimum average group size The data quality of using t-closeness is comparable with other privacy measurements

38 Conclusion Limitations of l-diversity l-diversity is difficult and unnecessary to achieve l-diversity is insufficient in preventing attribute disclosure t-closeness as a new privacy measure The overall distribution of sensitive values should be public information The separation of the knowledge gain EMD to measure distance EMD captures semantic distance well Simple formulas for three ground distances

39 Questions? Thank you!

K ANONYMITY. Xiaoyong Zhou

K ANONYMITY. Xiaoyong Zhou K ANONYMITY LATANYA SWEENEY Xiaoyong Zhou DATA releasing: Privacy vs. Utility Society is experiencing exponential growth in the number and variety of data collections containing person specific specific