Introduction to Data Mining

Similar documents
The Applicability of the Perturbation Model-based Privacy Preserving Data Mining for Real-world Data

CS573 Data Privacy and Security. Li Xiong

Extra readings beyond the lecture slides are important:

Privacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S.

Achieving k-anonmity* Privacy Protection Using Generalization and Suppression

Co-clustering for differentially private synthetic data generation

CS573 Data Privacy and Security. Differential Privacy. Li Xiong

0x1A Great Papers in Computer Security

Privacy Preserving Data Mining. Danushka Bollegala COMP 527

COMP 465: Data Mining Classification Basics

A Review on Privacy Preserving Data Mining Approaches

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Anonymization - Generalization Algorithms

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Anonymization. Graham Cormode.

Security Control Methods for Statistical Database

A FUZZY BASED APPROACH FOR PRIVACY PRESERVING CLUSTERING

Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust

Differential Privacy. Seminar: Robust Data Mining Techniques. Thomas Edlich. July 16, 2017

Data Mining in Bioinformatics Day 1: Classification

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Supervised vs. Unsupervised Learning

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

ADDITIVE GAUSSIAN NOISE BASED DATA PERTURBATION IN MULTI-LEVEL TRUST PRIVACY PRESERVING DATA MINING

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

Classification with Decision Tree Induction

Privacy in Statistical Databases

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

On Privacy-Preservation of Text and Sparse Binary Data with Sketches

Research Paper SECURED UTILITY ENHANCEMENT IN MINING USING GENETIC ALGORITHM

Incognito: Efficient Full Domain K Anonymity

Pufferfish: A Semantic Approach to Customizable Privacy

The applicability of the perturbation based privacy preserving data mining for real-world data

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING

Lecture 7: Decision Trees

Privacy Preserving in Knowledge Discovery and Data Publishing

10-701/15-781, Fall 2006, Final

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Clustering. Supervised vs. Unsupervised Learning

Review on Techniques of Collaborative Tagging

Coverage Approximation Algorithms

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

Randomized Response Technique in Data Mining

K ANONYMITY. Xiaoyong Zhou

Privacy Preserving Decision Tree Mining from Perturbed Data

Exam Advanced Data Mining Date: Time:

The exam is closed book, closed notes except your one-page cheat sheet.

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness

Service-Oriented Architecture for Privacy-Preserving Data Mashup

Microdata Publishing with Algorithmic Privacy Guarantees

Privacy Preserving Data Mining: An approach to safely share and use sensible medical data

SOCIAL MEDIA MINING. Data Mining Essentials

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

IJSER. Privacy and Data Mining

Approaches to distributed privacy protecting data mining

Anonymization Algorithms - Microaggregation and Clustering

Secure Frequent Itemset Hiding Techniques in Data Mining

A generic and distributed privacy preserving classification method with a worst-case privacy guarantee

PRACTICAL K-ANONYMITY ON LARGE DATASETS. Benjamin Podgursky. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University

CSE 565 Computer Security Fall 2018

Differentially Private Multi- Dimensional Time Series Release for Traffic Monitoring

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Overview of Clustering

Classification and Regression Trees

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin

Survey Result on Privacy Preserving Techniques in Data Publishing

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Preserving Privacy during Big Data Publishing using K-Anonymity Model A Survey

Classification: Decision Trees

Basic Data Mining Technique

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Data mining: concepts and algorithms

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

K-Anonymity and Other Cluster- Based Methods. Ge Ruan Oct. 11,2007

Preserving Data Mining through Data Perturbation

CSE4334/5334 DATA MINING

Laplacian Eigenmaps and Bayesian Clustering Based Layout Pattern Sampling and Its Applications to Hotspot Detection and OPC

Data Mining Concepts & Techniques

A Review of Privacy Preserving Data Publishing Technique

Behavioral Data Mining. Lecture 9 Modeling People

Secure Multiparty Computation Introduction to Privacy Preserving Distributed Data Mining

Research Trends in Privacy Preserving in Association Rule Mining (PPARM) On Horizontally Partitioned Database

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering: Classic Methods and Modern Views

Statistics 202: Statistical Aspects of Data Mining

Final Exam DATA MINING I - 1DL360

Privacy Preserving Machine Learning: A Theoretically Sound App

Emerging Measures in Preserving Privacy for Publishing The Data

7. Decision or classification trees

CSE 5243 INTRO. TO DATA MINING

CLASSIFICATION OF C4.5 AND CART ALGORITHMS USING DECISION TREE METHOD

Chapter 14 Global Search Algorithms

Topics in Machine Learning-EE 5359 Model Assessment and Selection

On the Tradeoff Between Privacy and Utility in Data Publishing

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Accountability in Privacy-Preserving Data Mining

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Transcription:

Introduction to Data Mining Privacy preserving data mining Li Xiong Slides credits: Chris Clifton Agrawal and Srikant 4/3/2011 1

Privacy Preserving Data Mining Privacy concerns about personal data AOL query log release Netflix challenge Data scraping

A race to the bottom: privacy ranking of Internet service companies A study done by Privacy International into the privacy practices of key Internet based companies Amazon, AOL, Apple, BBC, ebay, Facebook, Friendster, Google, LinkedIn, LiveJournal, Microsoft, MySpace, Skype, Wikipedia, LiveSpace, Yahoo!, YouTube

A Race to the Bottom: Methodologies Corporate administrative details Data collection and processing Data retention Openness and transparency Customer and user control Privacy enhancing innovations and privacy invasive innovations

A race to the bottom: interim results revealed

Why Google Retains a large quantity of information about users, often for an unstated or indefinite length of time, without clear limitation on subsequent use or disclosure Maintains records of all search strings with associated IP and time stamps for at least 18-24 months Additional personal information from user profiles in Orkut Use advanced profiling system for ads

Remember, they are always watching

Some advice from privacy Use cash when you can. campaigners Do not give your phone number, social-security number or address, unless you absolutely have to. Do not fill in questionnaires or respond to telemarketers. Demand that credit and data-marketing firms produce all information they have on you, correct errors and remove you from marketing lists. Check your medical records often. Block caller ID on your phone, and keep your number unlisted. Never leave your mobile phone on, your movements can be traced. Do not user store credit or discount cards If you must use the Internet, encrypt your e-mail, reject all cookies and never give your real name when registering at websites Better still, use somebody else s computer

Privacy-Preserving Data Mining Data obfuscation (non-interactive model) Original Data Anonymization Sanitized Data Miner Output perturbation (interactive model) Original Data Access Interface Perturbed Results Miner

Classes of Solutions Methods Input obfuscation Perturbation Generalization Output perturbation Metrics Differential privacy Privacy vs. Utility

Data randomization Data Perturbation Randomization (additive noise) Geometric perturbation and projection (multiplicative noise) Randomized response technique (categorical data)

Randomization Based Decision Tree Learning (Agrawal and Srikant 00) Basic idea: Perturb Data with Value Distortion User provides x i +r instead of x i r is a random value Uniform, uniform distribution between [-α, α] Gaussian, normal distribution with µ = 0, σ Hypothesis Miner doesn t see the real data or can t reconstruct real values Miner can reconstruct enough information to identify patterns

Classification using Randomization Data Alice s age 30 70K... 50 40K...... Add random number to Age 30 becomes 65 (30+35) Randomizer Randomizer 65 20K... 25 60K......? Classification Algorithm Model

Output: A Decision Tree for buys_computer age? <=30 overcast 31..40 >40 student? yes credit rating? no yes excellent fair no yes yes February 12, 2008 Data Mining: Concepts and Techniques 14

Attribute Selection Measure: Gini index (CART) If a data set D contains examples from n classes, gini index, gini(d) is defined as gini ( D) = 1 n p 2 j j = 1 where p j is the relative frequency of class j in D If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(d) is defined as Reduction in Impurity: D ( ) 1 D ( ) 2 gini A D = gini D1 + gini ( D 2) D D gini( A) = gini( D) gini ( D) The attribute provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node A February 12, 2008 Data Mining: Concepts and Techniques 15

Randomization Approach Overview Alice s age 30 70K... 50 40K...... Add random number to Age 30 becomes 65 (30+35) Randomizer Randomizer 65 20K... 25 60K...... Reconstruct Distribution of Age Reconstruct Distribution of Salary... Classification Algorithm Model

Original Distribution Reconstruction x 1, x 2,, x n are the n original data values Drawn from n iid random variables with distribution X Using value distortion, The given values are w 1 = x 1 + y 1, w 2 = x 2 + y 2,, w n = x n + y n y i s are from n iid random variables with distribution Y Reconstruction Problem: Given F Y and w i s, estimate F X

Original Distribution Reconstruction: Method Bayes theorem for continuous distribution The estimated density function (minimum mean square error estimator): n 1 f Y ( w i a) f X ( a) f X ( a ) = n i= 1 f w z f zdz Iterative estimation The initial estimate for f X at j=0: uniform distribution Iterative estimation f j X ( a) = + 1 1 n n i= 1 Y ( ) ( ) f Y f Y i Stopping Criterion: difference between successive iterations is small X j ( wi a) fx ( a) j ( w z) f ( z) i X dz

Reconstruction of Distribution 1200 People Number of 1000 800 600 400 200 Original Randomized Reconstructed 0 20 60 Age

Original Distribution Reconstruction

Original Distribution Construction for Decision Tree When are the distributions reconstructed? Global Reconstruct for each attribute once at the beginning Build the decision tree using the reconstructed data ByClass First split the training data Reconstruct for each class separately Build the decision tree using the reconstructed data Local First split the training data Reconstruct for each class separately Reconstruct at each node while building the tree

Accuracy vs. Randomization Level Fn 3 Accuracy 100 90 80 70 60 50 40 10 20 40 60 80 100 150 200 Randomization Level Original Randomized ByClass

More Results Global performs worse than ByClass and Local ByClass and Local have accuracy within 5% to 15% (absolute error) of the Original accuracy Overall, all are much better than the Randomized accuracy

Privacy metrics Privacy metrics of random additive data perturbation 4/3/2011 Data Mining: Principles and Algorithms 24

Unfortunately Random additive data perturbation are subject to data reconstruction attacks Original data can be estimated using spectral filtering techniques H. Kargupta, S. Datta. On the privacy preserving properties of random data perturbation techniques, in ICDM 2003 4/3/2011 Data Mining: Principles and Algorithms 25

Estimating distribution and data values 4/3/2011 Data Mining: Principles and Algorithms 26

Follow-up Work Multiplicative randomization Geometric randomization Also subjective to data reconstruction attacks! Known input-output Known samples 4/3/2011 Data Mining: Principles and Algorithms 27

Data randomization Data Perturbation Randomization (additive noise) Geometric perturbation and projection (multiplicative noise) Randomized response technique (categorical data)

Data Collection Model Data cannot be shared directly because of privacy concern

Background: Randomized Response The true answer is Yes Do you smoke? Biased coin: P( Yes) = θ Head Yes P( Head) = θ ( θ 0.5) θ 0.5 Tail No P'(Yes) =P(Yes) θ +P(No) (1 θ) P'(No) =P(Yes) (1 θ) +P(No) θ

Decision Tree Mining using Randomized Response Multiple attributes encoded in bits Biased coin: P( Yes) = θ Head True answer E: 110 P ( Head ) = θ θ 0.5 ( θ 0.5) Tail False answer!e: 001 Column distribution can be estimated for learning a decision tree Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

Generalization for Multi-Valued Categorical Data q1 q2 q3 q4 S i S i+1 S i+2 True Value: S i S i+3 P'(s1) P'(s2) P'(s3) P'(s4) = q1 q4 q3 q2 P(s1) q2 q1 M q4 q3 P(s2) q3 q2 q1 q4 P(s3) q4 q3 q2 q1 P(s4)

A Generalization RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05] RR Matrix can be arbitrary a 11 a 12 a 13 a 14 a M = 21 a 22 a 23 a 24 a 31 a 32 a 33 a 34 a 41 a 42 a 43 a 44 Can we find optimal RR matrices? OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008

What is an optimal matrix? Which of the following is better? M 1 = 1 1 0 0 0 1 0 0 0 1 M 2 = 1 3 1 3 1 3 1 1 1 2 3 3 3 1 3 1 3 1 3 Privacy: M 2 is better Utility: M 1 is better So, what is an optimal matrix?

Optimal RR Matrix An RR matrix M is optimal if no other RR matrix s privacy and utility are both better than M (i, e, no other matrix dominates M). Privacy Quantification Utility Quantification Privacy and utility metrics Privacy: how accurately one can estimate individual info. Utility: how accurately we can estimate aggregate info.

Optimization algorithm Evolutionary Multi-Objective Optimization (EMOO) The algorithm Start with a set of initial RR matrices Repeat the following steps in each iteration Mating: selecting two RR matrices in the pool Crossover: exchanging several columns between the two RR matrices Mutation: change some values in a RR matrix Meet the privacy bound: filtering the resultant matrices Evaluate the fitness value for the new RR matrices. Note : the fitness values is defined in terms of privacy and utility metrics

Output of Optimization The optimal set is often plotted in the objective space as Pareto front. Worse M 6 M 5 M 8 M 7 M 4 Utility M 1 M 2 M 3 Privacy Better

Classes of Solutions Methods Input obfuscation Perturbation Generalization Output perturbation Differential privacy Metrics Privacy vs. Utility

Data Re-identification Disease Birthdate Sex Zip Name

k-anonymity & l-diversity 40

Privacy preserving data mining Generalization principles k-anonymity, l-diversity, Methods Optimal Greedy Top-down vs. bottom-up 41

Mondrian: Greedy Partitioning Algorithm Problem Need an algorithm to find multi-dimensional partitions Optimal k-anonymous strict multi-dimensional partitioning is NP-hard Solution Use a greedy algorithm Based on k-d trees Complexity O(nlogn)

Example k = 2; Quasi Identifiers: Age, Zipcode What should be the splitting criteria? Patient Data Multi-Dimensional

Unfortunately Generalization based principles and methods are subjective to attacks Background knowledge sensitive Attack dependent 4/3/2011 Data Mining: Principles and Algorithms 44

Classes of Solutions Methods Input obfuscation Perturbation Generalization Output perturbation Metrics Differential privacy Privacy vs. Utility

Differential Privacy Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set D1 Bob in D2 Bob out Differentially Private Interface Q Q(D1) + Y1 User Q(D2) + Y2 A(D1) A(D2)

Differential Privacy Differential privacy Laplace mechanism Q(D) + Y where Y is drawn from Query sensitivity D1 Bob in D2 Bob out Differentially Private Interface Q(D1) + Y1 Q Q(D2) + Y2 User A(D1) A(D2)

Coming up Data mining algorithms using differential privacy Decision tree learning (Data Mining with Differential Privacy, SIGKDD 10) Frequent itemsets mining (discovering frequent patterns in sensitive data, SIGKDD 10) 4/3/2011 Data Mining: Principles and Algorithms 48

Midterm Exam Adjusted mean: 85.3 Adjusted max: 101 Your favorite topics: Clustering, frequent itemsets mining, decision tree Your favorite assignments: Apriori Your least favorite: SOM, Weka analysis 4/3/2011 Data Mining: Principles and Algorithms 49