Data Anonymization. Graham Cormode.

Similar documents
Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

CS573 Data Privacy and Security. Differential Privacy. Li Xiong

Anonymized Data: Generation, Models, Usage. Graham Cormode Divesh Srivastava

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness

Privacy-Preserving Machine Learning

Survey Result on Privacy Preserving Techniques in Data Publishing

Pufferfish: A Semantic Approach to Customizable Privacy

Achieving k-anonmity* Privacy Protection Using Generalization and Suppression

K ANONYMITY. Xiaoyong Zhou

0x1A Great Papers in Computer Security

Approximation Algorithms for Clustering Uncertain Data

Parallel Composition Revisited

Introduction to Data Mining

Differential Privacy. Cynthia Dwork. Mamadou H. Diallo

SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER

An Ad Omnia Approach to Defining and Achiev ing Private Data Analysis

A Review of Privacy Preserving Data Publishing Technique

Microdata Publishing with Algorithmic Privacy Guarantees

Privacy Preserving Data Mining. Danushka Bollegala COMP 527

Differentially Private H-Tree

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin

Cryptography & Data Privacy Research in the NSRC

Enhanced Slicing Technique for Improving Accuracy in Crowdsourcing Database

Differential Privacy. CPSC 457/557, Fall 13 10/31/13 Hushiyang Liu

An Adaptive Algorithm for Range Queries in Differential Privacy

Co-clustering for differentially private synthetic data generation

Differential Privacy. Seminar: Robust Data Mining Techniques. Thomas Edlich. July 16, 2017

A Case Study: Privacy Preserving Release of Spa9o- temporal Density in Paris

SMMCOA: Maintaining Multiple Correlations between Overlapped Attributes Using Slicing Technique

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Privacy Preserving in Knowledge Discovery and Data Publishing

Survey of Anonymity Techniques for Privacy Preserving

Sublinear Algorithms January 12, Lecture 2. First, let us look at a couple of simple testers that do not work well.

Privacy in Statistical Databases

Adding Differential Privacy in an Open Board Discussion Board System

K-Anonymity and Other Cluster- Based Methods. Ge Ruan Oct. 11,2007

Secure Multiparty Computation

Emerging Measures in Preserving Privacy for Publishing The Data

Protecting Against Maximum-Knowledge Adversaries in Microdata Release: Analysis of Masking and Synthetic Data Using the Permutation Model

A Theory of Privacy and Utility for Data Sources

Approaches to distributed privacy protecting data mining

Mining Search Log for Privacy Definitions and Phish Safe

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method

Privacy-preserving machine learning. Bo Liu, the HKUST March, 1st, 2015.

CLUSTER BASED ANONYMIZATION FOR PRIVACY PRESERVATION IN SOCIAL NETWORK DATA COMMUNITY

m-privacy for Collaborative Data Publishing

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING

Computer Security CS 526

The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighborhood attacks

Service-Oriented Architecture for Privacy-Preserving Data Mashup

DEEP LEARNING WITH DIFFERENTIAL PRIVACY Martin Abadi, Andy Chu, Ian Goodfellow*, Brendan McMahan, Ilya Mironov, Kunal Talwar, Li Zhang Google * Open

Distributed Data Anonymization with Hiding Sensitive Node Labels

Comparison and Analysis of Anonymization Techniques for Preserving Privacy in Big Data

CS573 Data Privacy and Security. Differential Privacy tabular data and range queries. Li Xiong

Michelle Hayes Mary Joel Holin. Michael Roanhouse Julie Hovden. Special Thanks To. Disclaimer

(δ,l)-diversity: Privacy Preservation for Publication Numerical Sensitive Data

Steps Towards Location Privacy

Differentially Private Algorithm and Auction Configuration

An Efficient Clustering Method for k-anonymization

NON-CENTRALIZED DISTINCT L-DIVERSITY

Introduction To Security and Privacy Einführung in die IT-Sicherheit I

The Two Dimensions of Data Privacy Measures

1 A Tale of Two Lovers

Security Control Methods for Statistical Database

Privacy Challenges in Big Data and Industry 4.0

Slicing Technique For Privacy Preserving Data Publishing

CS573 Data Privacy and Security. Li Xiong

Cryptography & Data Privacy Research in the NSRC

Lecture 14 Alvaro A. Cardenas Kavitha Swaminatha Nicholas Sze. 1 A Note on Adaptively-Secure NIZK. 2 The Random Oracle Model

Anonymizing Social Networks

Privacy Preserving Machine Learning: A Theoretically Sound App

Crowd-Blending Privacy

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

Synthetic Data. Michael Lin

CS573 Data Privacy and Security. Cryptographic Primitives and Secure Multiparty Computation. Li Xiong

RippleMatch Privacy Policy

CSA E0 312: Secure Computation October 14, Guest Lecture 2-3

Approximation Algorithms for k-anonymity 1

On Syntactic Anonymity and Differential Privacy

Private Database Synthesis for Outsourced System Evaluation

Introduction to Data Mining

Comparative Analysis of Anonymization Techniques

Distributed Private Data Collection at Scale

Relationship Privacy: Output Perturbation for Queries with Joins

The Confounding Problem of Private Data Release

FREQUENT subgraph mining (FSM) is a fundamental and

(α, k)-anonymity: An Enhanced k-anonymity Model for Privacy-Preserving Data Publishing

Evolutionary Computation

1 General description of the projects

with BLENDER: Enabling Local Search a Hybrid Differential Privacy Model

Mapping Internet Sensors with Probe Response Attacks

Survey of k-anonymity

Introduction to Secure Multi-Party Computation

Course : Data mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

PRIVACY IN SOCIAL NETWORKS: A SURVEY

VPriv: Protecting Privacy in Location- Based Vehicular Services

IS-2150/TEL281: Information Security and Privacy, Spring 2016 Lab4: Laboratory on Privacy Total : 100 points

U-Prove Technology Overview

Privately Solving Linear Programs

Transcription:

Data Anonymization Graham Cormode graham@research.att.com 1

Why Anonymize? For Data Sharing Give real(istic) data to others to study without compromising privacy of individuals in the data Allows third-parties to try new analysis and mining techniques not thought of by the data owner For Data Retention and Usage Various requirements prevent companies from retaining customer information indefinitely E.g. Google progressively anonymizes IP addresses in search logs Internal sharing across departments (e.g. billing marketing) 2

Models of Anonymization Interactive Model (akin to statistical databases) Data owner acts as gatekeeper to data Researchers pose queries in some agreed language Gatekeeper gives an (anonymized) answer, or refuses to answer Send me your code model Data owner executes code on their system and reports result Cannot be sure that the code is not malicious, compiles Offline, aka publish and be damned model Data owner somehow anonymizes data set Publishes the results, and retires Seems to best model many real releases 3

Objectives for Anonymization Prevent (high confidence) inference of associations Prevent inference of salary for an individual in census data Prevent inference of individual s video viewing history Prevent inference of individual s search history in search logs All aim to prevent linking sensitive information to an individual Have to model what knowledge might be known to attacker Background knowledge: facts about the data set (X has salary Y) Domain knowledge: broad properties of data (illness Z rare in men) 4

Utility Anonymization is meaningless if utility of data not considered The empty data set has perfect privacy, but no utility The original data has full utility, but no privacy What is utility? Depends what the application is For fixed query set, can look at max, average distortion Problem for publishing: want to support unknown applications! Need some way to quantify utility of alternate anonymizations 5

Part 1: Syntactic Anonymizations Syntactic anonymization modifies the input data set To achieve some syntactic property intended to make reidentification difficult Many variations have been proposed: k-anonymity l-diversity t-closeness and many many more 6

Tabular Data Example Census data recording incomes and demographics SSN DOB Sex ZIP Salary 11-1-111 1/21/76 M 53715 50,000 22-2-222 4/13/86 F 53715 55,000 33-3-333 2/28/76 M 53703 60,000 44-4-444 1/21/76 M 53703 65,000 55-5-555 4/13/86 F 53706 70,000 66-6-666 2/28/76 F 53706 75,000 Releasing SSN Salary association violates individual s privacy SSN is an identifier, Salary is a sensitive attribute (SA) 7

Tabular Data Example: De-Identification Census data: remove SSN to create de-identified table DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000 Does the de-identified table preserve an individual s privacy? Depends on what other information an attacker knows 8

Tabular Data Example: Linking Attack De-identified private data + publicly available data DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000 SSN DOB 11-1-111 1/21/76 33-3-333 2/28/76 Cannot uniquely identify either individual s salary DOB is a quasi-identifier (QI) 9

Tabular Data Example: Linking Attack De-identified private data + publicly available data DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000 SSN DOB Sex ZIP 11-1-111 1/21/76 M 53715 33-3-333 2/28/76 M 53703 Uniquely identified both individuals salaries [DOB, Sex, ZIP] is unique for majority of US residents [Sweeney 02] 10

Tabular Data Example: Anonymization Anonymization through QI attribute generalization DOB Sex ZIP Salary 1/21/76 M 537** 50,000 4/13/86 F 537** 55,000 2/28/76 * 537** 60,000 1/21/76 M 537** 65,000 4/13/86 F 537** 70,000 2/28/76 * 537** 75,000 SSN DOB Sex ZIP 11-1-111 1/21/76 M 53715 33-3-333 2/28/76 M 53703 Cannot uniquely identify tuple with knowledge of QI values E.g., ZIP = 537** ZIP {53700,, 53799} 11

Tabular Data Example: Anonymization Anonymization through sensitive attribute (SA) permutation DOB Sex ZIP Salary 1/21/76 M 53715 55,000 4/13/86 F 53715 50,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 75,000 2/28/76 F 53706 70,000 SSN DOB Sex ZIP 11-1-111 1/21/76 M 53715 33-3-333 2/28/76 M 53703 Can uniquely identify tuple, but uncertainty about SA value Much more precise form of uncertainty than generalization 12

k-anonymization [Samarati, Sweeney 98] k-anonymity: Table T satisfies k-anonymity wrt quasi-identifiers QI iff each tuple in (the multiset) T[QI] appears at least k times Protects against linking attack k-anonymization: Table T is a k-anonymization of T if T is generated from T, and T satisfies k-anonymity DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 70,000 2/28/76 F 53706 75,000 DOB Sex ZIP Salary 1/21/76 M 537** 50,000 4/13/86 F 537** 55,000 2/28/76 * 537** 60,000 1/21/76 M 537** 65,000 4/13/86 F 537** 70,000 2/28/76 * 537** 75,000 13

Homogeneity Attack [Machanavajjhala+ 06] Issue: k-anonymity requires each tuple in (the multiset) T[QI] to appear k times, but does not say anything about the SA values If (almost) all SA values in a QI group are equal, loss of privacy! The problem is with the choice of grouping, not the data For some groupings, no loss of privacy DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 55,000 2/28/76 M 53703 60,000 1/21/76 M 53703 50,000 4/13/86 F 53706 55,000 2/28/76 F 53706 60,000 Not Ok! Ok! DOB Sex ZIP Salary 1/21/76 76-86 * 53715 537** 50,000 4/13/86 76-86 * 53715 537** 55,000 2/28/76 76-86 * 53703 537** 60,000 1/21/76 76-86 * 53703 537** 50,000 4/13/86 76-86 * 53706 537** 55,000 2/28/76 76-86 * 53706 537** 60,000 14

l-diversity [Machanavajjhala+ 06] Intuition: Most frequent value does not appear too often compared to the less frequent values in a QI group Simplified l-diversity defn: for each group, max frequency 1/l l-diversity((1/21/76, *, 537**)) =?? 1 DOB Sex ZIP Salary 1/21/76 * 537** 50,000 4/13/86 * 537** 55,000 2/28/76 * 537** 60,000 1/21/76 * 537** 50,000 4/13/86 * 537** 55,000 2/28/76 * 537** 60,000 15

Simple Algorithm for l-diversity A simple greedy algorithm provides l-diversity Sort tuples based on attributes so similar tuples are close Start with group containing just first tuple Keeping adding tuples to group in order until l-diversity met Output the group, and repeat on remaining tuples DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 50,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 50,000 2/28/76 F 53706 60,000 2-diversity DOB Sex ZIP Salary 1/21/76 M 53715 50,000 4/13/86 F 53715 50,000 2/28/76 M 53703 60,000 1/21/76 M 53703 65,000 4/13/86 F 53706 50,000 2/28/76 F 53706 60,000 16 Knowledge of the algorithm used can reveal associations!

Syntactic Anonymization Summary Pros: Provide natural definitions (e.g. k-anonymity) Keeps data in similar form to input (e.g. as tuples) Give privacy beyond simply removing identifiers Cons: No strong guarantees known against arbitrary adversaries Resulting data not always convenient to work with Attack and patching has led to a glut of definitions 17

Part 2: Differential Privacy A randomized algorithm K satisfies ε-differential privacy if: Given any pair of neighboring data sets, D and D, and any property S: Pr[ K(D) S] e ε Pr[ K(D ) S] Introduced by Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith in 2006 18

Differential Privacy for numeric functions Sensitivity of publishing for a numeric function f: s = max X,X f(x) f(x ), X, X differ by 1 individual To give ε-differential privacy for a function with sensitivity s: Add Laplace noise, Lap(ε/s) to the true output answer 19

Laplace Distribution Laplace with parameter λ is exponential, symmetric about 0: Density at x is f(x) exp(- x /λ) Hence, f(x)/f(x+δ) = exp(- x /λ)/exp(- x+δ /λ) exp(δ/λ) Differential privacy for numeric values: Sensitivity = s Hence, δ=s Set λ = ε/s Ratio of probability at any point x is at most exp(ε) 20

Sensitivity of some functions Count has sensitivity 1 E.g. count how many students are left-handed Sum and median have sensitivity = maximum range of possible values Histograms / contingency tables have sensitivity 2 E.g. Count how many people in salary range 0-50K; 50-100K; 100-150K; 150-200K; 200K+ 21

Dealing with sensitivity Sometimes sensitivity (and hence noise) can be very high: Sensitivity of (sum of salaries) ~ $1BN (some people make this much) Replace with clipped value (e.g. cut off at $1M) Work with histograms/contingency tables instead 22

Contingency Tables Zip 0-50K 50-100K 100-150K 150K+ 53703 200 11 10 5 53706 18 5 65 200 53715 60 100 100 40 23

Noisy Contingency Tables Zip 0-50K 50-100K 100-150K 150K+ 53703 205 8 9 7 53706 19 8 66 201 53715 59 97 98 40 Does this provide sufficient privacy? 24

Exponential Mechanism Exponential mechanism gives more general way to release functions Given input x, define a quality function q x (y) over possible outputs that captures desirability of outputting y q(y) = 0 means perfect match; larger q values less desirable Define s = sensitivity of function q Output y with probability proportional to exp(-ε q x (y)) Claim (without proof): process has (εs) differential privacy Note: must range over all possible outputs for correctness May be very slow to compute if many possible outputs 25

Exponential Mechanism for Median Given input X = set of n elements in range {0 U} Define rank(x) = number of elements less than x Median: x s.t. rank(x) = n/2 Set q(y) = rank(y) n/2 Sensitivity of rank = 2 Use exponential mechanism with q: Elements in range [x j x j+1 ] have same rank, so same q value Compute probability of [x j x j+1 ] as (x j+1 -x j ) exp(-ε rank(x j )-n/2 ) Then pick element uniformly from range x j x j+1 Median now takes time O(n), not O(U) 26

State of Anonymization Data privacy and anonymization is a subject of ongoing research in 2011 Many unresolved challenges: How can a social network release a substantial data set without revealing private connections between users? How can a video website release information on viewing patterns without disclosing who watched what? How can a search engine release information on search queries without revealing who searched for what? How to release private information efficiently over large scale data? 27

Concluding Remarks Like crypto, anonymization proceeds by proposing anonymization methods and attacks upon them Difference: Successful attacks on crypto reveal messages Attacks on anonymization increase probability of inference Long-term goal: propose anonymization methods which resist feasible attacks Anonymization should not be the weakest link 28