Achieving k-anonmity* Privacy Protection Using Generalization and Suppression

Similar documents
Data Anonymization - Generalization Algorithms

K ANONYMITY. Xiaoyong Zhou

Slicing Technique For Privacy Preserving Data Publishing

Anonymization Algorithms - Microaggregation and Clustering

Preserving Privacy during Big Data Publishing using K-Anonymity Model A Survey

K-Anonymity and Other Cluster- Based Methods. Ge Ruan Oct. 11,2007

SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER

Survey Result on Privacy Preserving Techniques in Data Publishing

Introduction to Data Mining

Survey of Anonymity Techniques for Privacy Preserving

Survey of k-anonymity

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

Emerging Measures in Preserving Privacy for Publishing The Data

Data Anonymization. Graham Cormode.

An efficient hash-based algorithm for minimal k-anonymity

A Review of Privacy Preserving Data Publishing Technique

Privacy Preserving Data Mining. Danushka Bollegala COMP 527

Privacy Preserved Data Publishing Techniques for Tabular Data

0x1A Great Papers in Computer Security

Enhanced Slicing Technique for Improving Accuracy in Crowdsourcing Database

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method

Hiding the Presence of Individuals from Shared Databases: δ-presence

(α, k)-anonymity: An Enhanced k-anonymity Model for Privacy-Preserving Data Publishing

CS573 Data Privacy and Security. Li Xiong

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness

NON-CENTRALIZED DISTINCT L-DIVERSITY

Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching

Privacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S.

Anonymized Data: Generation, Models, Usage. Graham Cormode Divesh Srivastava

Anonymity in Unstructured Data

MASTER S THESIS. Yohannes Kifle Russom. Computer Science. Study program/ Specialization: Spring semester, Open / Restricted access.

A FUZZY BASED APPROACH FOR PRIVACY PRESERVING CLUSTERING

Evaluating the Classification Accuracy of Data Mining Algorithms for Anonymized Data

The K-Anonymization Method Satisfying Personalized Privacy Preservation

A Survey of Privacy Preserving Data Publishing using Generalization and Suppression

Efficient k-anonymization Using Clustering Techniques

Maintaining K-Anonymity against Incremental Updates

K-Anonymity. Definitions. How do you publicly release a database without compromising individual privacy?

Comparison and Analysis of Anonymization Techniques for Preserving Privacy in Big Data

A Theory of Privacy and Utility for Data Sources

Lightning: Utility-Driven Anonymization of High-Dimensional Data

k-anonymization May Be NP-Hard, but Can it Be Practical?

Service-Oriented Architecture for Privacy-Preserving Data Mashup

Using Multi-objective Optimization to Analyze Data Utility and. Privacy Tradeoffs in Anonymization Techniques

Utility-Based Anonymization Using Local Recoding

Microdata Publishing with Algorithmic Privacy Guarantees

Maintaining K-Anonymity against Incremental Updates

PRACTICAL K-ANONYMITY ON LARGE DATASETS. Benjamin Podgursky. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University

L-Diversity Algorithm for Incremental Data Release

Introduction To Security and Privacy Einführung in die IT-Sicherheit I

Local and global recoding methods for anonymizing set-valued data

Privacy-preserving Anonymization of Set-valued Data

m-privacy for Collaborative Data Publishing

SMMCOA: Maintaining Multiple Correlations between Overlapped Attributes Using Slicing Technique

Towards the Anonymisation of RDF Data

Parallel Composition Revisited

Privacy in Statistical Databases

Comparative Analysis of Anonymization Techniques

The Applicability of the Perturbation Model-based Privacy Preserving Data Mining for Real-world Data

Anonymizing datasets with demographics and diagnosis codes in the presence of utility constraints

An Ad Omnia Approach to Defining and Achiev ing Private Data Analysis

n-confusion: a generalization of k-anonymity

Utility-Based k-anonymization

On the Tradeoff Between Privacy and Utility in Data Publishing

Privacy Preserving Data Sharing in Data Mining Environment

Pufferfish: A Semantic Approach to Customizable Privacy

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Differential Privacy. Cynthia Dwork. Mamadou H. Diallo

Efficient Algorithms for Masking and Finding Quasi-Identifiers

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Learning Semantically Robust Rules from Data

A Review on Privacy Preserving Data Mining Approaches

Incognito: Efficient Full Domain K Anonymity

HIDE: Privacy Preserving Medical Data Publishing. James Gardner Department of Mathematics and Computer Science Emory University

Towards Trajectory Anonymization: A Generalization-Based Approach

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING

Privacy Preserving Data Publishing for Recommender System

Privacy Consensus in Anonymization Systems Via Game Theory

Co-clustering for differentially private synthetic data generation

Clustering Based K-anonymity Algorithm for Privacy Preservation

Steps Towards Location Privacy

On-site Service and Safe Output Checking in Japan

Approaches to distributed privacy protecting data mining

Secured Medical Data Publication & Measure the Privacy Closeness Using Earth Mover Distance (EMD)

Private Database Synthesis for Outsourced System Evaluation

Injector: Mining Background Knowledge for Data Anonymization

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

Approximation Algorithms for k-anonymity 1

Security Control Methods for Statistical Database

On Syntactic Anonymity and Differential Privacy

Privacy-Preserving Machine Learning

An Efficient Clustering Method for k-anonymization

CERIAS Tech Report Injector: Mining Background Knowledge for Data Anonymization by Tiancheng Li; Ninghui Li Center for Education and Research

Personalized Privacy Preserving Publication of Transactional Datasets Using Concept Learning

Pseudonymization risk analysis in distributed systems

Privacy Preserving in Knowledge Discovery and Data Publishing

Introduction to Privacy

Towards Application-Oriented Data Anonymization

(δ,l)-diversity: Privacy Preservation for Publication Numerical Sensitive Data

3 No-Wait Job Shops with Variable Processing Times

Transcription:

UT DALLAS Erik Jonsson School of Engineering & Computer Science Achieving k-anonmity* Privacy Protection Using Generalization and Suppression Murat Kantarcioglu Based on Sweeney 2002 paper

Releasing Private Data Problem: Publishing private data while, at the same time, protecting individual privacy Challenges: How to quantify privacy protection How to maximize the usefulness of published data How to minimize the risk of disclosure

Sanitization Automated de-identification of private data with certain privacy guarantees Opposed to formal determination by statisticians requirement of HIPAA Two major research directions 1. Perturbation (e.g. random noise addition) 2. Anonymization (e.g. k-anonymization)

Anonymization HIPAA revisited Limited data set: no unique identifiers Safe enough? Was not for the Governor of Massachusetts # %87 of US citizens can possibly be uniquely identified using ZIP, sex and birth date # # L. Sweeney, k-anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557-570, 2002.

Anonymization Removing unique identifiers is not sufficient Quasi-identifier (QI) Maximal set of attributes that could help identify individuals Assumed to be publicly available (e.g., voter registration lists)

Anonymization As a process 1. Remove all unique identifiers 2. Identify QI-attributes, model adversary s background knowledge 3. Enforce some privacy definition (e.g. k-anonymity)

k-anonymity Each released record should be indistinguishable from at least (k-1) others on its QI attributes Alternatively: cardinality of any query result on released data should be at least k k-anonymity is (the first) one of many privacy definitions in this line of work l-diversity, t-closeness, m-invariance, delta-presence...

Hardness Given some data set R and a QI Q, does R satisfy k- anonymity over Q? Easy to tell in polynomial time, NP! Finding an optimal anonymization is not easy NP-hard: reduction from k-dimensional perfect matching* A polynomial solution implies P = NP Heuristic solutions DataFly, Incognito, Mondrian, TDS, *A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS 04.

Tools Generalization Replacing (recoding) a value with a less specific but semantically consistent one Suppression Not releasing any value at all Advantages 1. Reveals what was done to the data 2. Truthful (no incorrect implications) 3. Trade-off between anonymity and distortion 4. Adjustable to the recipient s needs (only one s)

DGH / VGH ZIP attribute Suppression value Ground domain

Example QI = {Race, ZIP} k = 2 k-anonymous relation should have at least 2 tuples with the same values on Dom(Race i ) x Dom(ZIP j ) where Race i and ZIP j are chosen from corresponding DGHs

Example

k-minimal Generalization Given R k, there is always a trivial solution Generalize all attributes to VGH root Not very useful if there exists another k-anonymization with higher granularity (more specific) values k-minimal generalization Satisfies k-anonymity None of its specializations satisfies k-anonymity E.g., [0,2] is not minimal, since [0,1] is k-anonymous E.g., [1,0] is minimal, since [0,0] is not k-anonymous

Precision Metric, Prec(.) Multiple k-minimal generalizations may exist E.g., [1,0] and [0,1] from the example Precision metric indicates the generalization with minimal information loss, maximal usefulness Informally, since Prec is not based on entropy Problem: how to define usefulness

Precision Metric, Prec(.) Precision: average height of generalized values, normalized by VGH depth per attribute per record N A : number of attributes PT : data set size DGH Ai : depth of the VGH for attribute A i

Precision Metric, Prec(.) Notice that precision depends on DGH/VGH Different DGHs result in different precision measurements for the same table Structure of DGHs might determine the generalization of choice DGHs should be semantically meaningful I.e., created by domain experts

k-minimal Distortion Most precise release that adheres to k-anonymity Precision measured by Prec(.) Any k-minimal distortion is a k-minimal generalization In the example, only [0,1] is a k-minimal distortion [0,0] is not k-anonymous [1,0] and others are less precise

MinGen Algorithm Steps: Generate all generalizations of the private table Discard those that violate k-anonymity Find all generalizations with the highest precision Return one based on some preference criteria Unrealistic Even with attribute level generalization/suppression, there are too many candidates

MinGen Algorithm Attribute level global recoding Cell (tuple) level local recoding

MinGen Algorithm

DataFly Algorithm Steps: Create equivalences over the Cartesian product of QI attributes Heuristically select an attribute to generalize Continue until < k records remain (suppression) Too much distortion due to attribute level generalization and greedy choices k-anonymity is guaranteed

DataFly Algorithm

μ-argus Algorithm Steps: Generalize until each QI attribute appears k times Check k-anonymity over 2/3-combinations Keeps generalizing according to data holder s choices Suppress any remaining outliers k-anonymity is not guaranteed Faster than DataFly

μ-argus Algorithm

What s Next? l-diversity: homogenous distribution of sensitive attribute values within anonymized data Japanese Umeko has viral infection Neighbor Bob has cancer

UTD Anonymization Library Contains 5 different methods of anonymization Soon to come: Support for 2 other anonymity definitions Integration with Weka Perturbation methods