SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER

Similar documents
A Review of Privacy Preserving Data Publishing Technique

Achieving k-anonmity* Privacy Protection Using Generalization and Suppression

An Efficient Clustering Method for k-anonymization

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Survey Result on Privacy Preserving Techniques in Data Publishing

NON-CENTRALIZED DISTINCT L-DIVERSITY

Data Anonymization - Generalization Algorithms

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method

K ANONYMITY. Xiaoyong Zhou

Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching

Comparison and Analysis of Anonymization Techniques for Preserving Privacy in Big Data

Emerging Measures in Preserving Privacy for Publishing The Data

Survey of Anonymity Techniques for Privacy Preserving

CS573 Data Privacy and Security. Li Xiong

Comparative Analysis of Anonymization Techniques

K-Anonymity and Other Cluster- Based Methods. Ge Ruan Oct. 11,2007

Slicing Technique For Privacy Preserving Data Publishing

An efficient hash-based algorithm for minimal k-anonymity

(δ,l)-diversity: Privacy Preservation for Publication Numerical Sensitive Data

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

Data Anonymization. Graham Cormode.

Survey of k-anonymity

Secured Medical Data Publication & Measure the Privacy Closeness Using Earth Mover Distance (EMD)

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING

Efficient Algorithms for Masking and Finding Quasi-Identifiers

Enhanced Slicing Technique for Improving Accuracy in Crowdsourcing Database

Privacy Preserved Data Publishing Techniques for Tabular Data

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness

Privacy Preserving in Knowledge Discovery and Data Publishing

Data attribute security and privacy in Collaborative distributed database Publishing

Distributed Data Anonymization with Hiding Sensitive Node Labels

Implementation of Privacy Mechanism using Curve Fitting Method for Data Publishing in Health Care Domain

A FUZZY BASED APPROACH FOR PRIVACY PRESERVING CLUSTERING

(α, k)-anonymity: An Enhanced k-anonymity Model for Privacy-Preserving Data Publishing

Service-Oriented Architecture for Privacy-Preserving Data Mashup

Anonymization Algorithms - Microaggregation and Clustering

MultiRelational k-anonymity

AN EFFECTIVE FRAMEWORK FOR EXTENDING PRIVACY- PRESERVING ACCESS CONTROL MECHANISM FOR RELATIONAL DATA

Efficient k-anonymization Using Clustering Techniques

Evaluating the Classification Accuracy of Data Mining Algorithms for Anonymized Data

The K-Anonymization Method Satisfying Personalized Privacy Preservation

SMMCOA: Maintaining Multiple Correlations between Overlapped Attributes Using Slicing Technique

Maintaining K-Anonymity against Incremental Updates

Privacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix

Enhancing Fine Grained Technique for Maintaining Data Privacy

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Record Linkage using Probabilistic Methods and Data Mining Techniques

Privacy Preserving Data Sharing in Data Mining Environment

Injector: Mining Background Knowledge for Data Anonymization

Co-clustering for differentially private synthetic data generation

k-anonymization May Be NP-Hard, but Can it Be Practical?

Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust

Maintaining K-Anonymity against Incremental Updates

An Enhanced Approach for Privacy Preservation in Anti-Discrimination Techniques of Data Mining

IDENTITY DISCLOSURE PROTECTION IN DYNAMIC NETWORKS USING K W STRUCTURAL DIVERSITY ANONYMITY

CERIAS Tech Report

CERIAS Tech Report Injector: Mining Background Knowledge for Data Anonymization by Tiancheng Li; Ninghui Li Center for Education and Research

Datasets Size: Effect on Clustering Results

Research Paper SECURED UTILITY ENHANCEMENT IN MINING USING GENETIC ALGORITHM

Parallel Composition Revisited

Personalized Privacy Preserving Publication of Transactional Datasets Using Concept Learning

Pufferfish: A Semantic Approach to Customizable Privacy

An Efficient Approach for Leakage Tracing

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Data Distortion for Privacy Protection in a Terrorist Analysis System

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

GAIN RATIO BASED FEATURE SELECTION METHOD FOR PRIVACY PRESERVATION

Incognito: Efficient Full Domain K Anonymity

Relational Database Watermarking for Ownership Protection

Lightning: Utility-Driven Anonymization of High-Dimensional Data

m-privacy for Collaborative Data Publishing

Privacy-Preserving Machine Learning

On Privacy-Preservation of Text and Sparse Binary Data with Sketches

Utility-Based k-anonymization

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Leveraging Set Relations in Exact Set Similarity Join

Mining of Web Server Logs using Extended Apriori Algorithm

Introduction to Data Mining

Hiding the Presence of Individuals from Shared Databases: δ-presence

Survey Paper on Efficient and Secure Dynamic Auditing Protocol for Data Storage in Cloud

Security Control Methods for Statistical Database

Comparative Evaluation of Synthetic Dataset Generation Methods

Synthetic Data. Michael Lin

Privacy Preserving Data Mining: An approach to safely share and use sensible medical data

Alpha Anonymization in Social Networks using the Lossy-Join Approach

Deduplication of Hospital Data using Genetic Programming

Randomized Response Technique in Data Mining

Privacy Preserving Health Data Mining

Learning Social Graph Topologies using Generative Adversarial Neural Networks

Preserving Privacy during Big Data Publishing using K-Anonymity Model A Survey

Utility-Based Anonymization Using Local Recoding

Research Article Apriori Association Rule Algorithms using VMware Environment

Privacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S.

A Review on Privacy Preserving Data Mining Approaches

Information Security in Big Data: Privacy & Data Mining

Using Multi-objective Optimization to Analyze Data Utility and. Privacy Tradeoffs in Anonymization Techniques

Privacy Preserving Data Mining. Danushka Bollegala COMP 527

Mining for User Navigation Patterns Based on Page Contents

Towards the Anonymisation of RDF Data

Transcription:

31 st July 216. Vol.89. No.2 25-216 JATIT & LLS. All rights reserved. SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER 1 AMANI MAHAGOUB OMER, 2 MOHD MURTADHA BIN MOHAMAD 1 Faculty of Computing, University Technology Malaysia 2 Faculty of Computing, University Technology Malaysia E-mail: 1 Amani_ome@hotmail.com, 2 murtadha@utm.my ABSTRACT In this paper, a new method to select quasi-identifier (QI) to achieve k-anonymity for protecting privacy is introduced. For this purpose, two algorithms, Selective followed by Decompose algorithm, are proposed. The simulation results show that the proposed algorithm is better. Extensive experimental results on real world data sets confirm efficiency and accuracy of our algorithms. Keywords: K-anonymity, Privacy Preserving, Quasi-identifier. 1. INTRODUCTION The propagation of data along the Internet and access to fast computers with great memory capacities has increased the intensity of data compiled and disseminated about individuals[1]. Needs of this information is valuable in both research and business. Researcher needs for classification, analysis, statistics and computation. But, sharing and publishing the data may put the respondent s privacy at risk. Data publishing concerned with, authorized or proper disclosure of information to outside organizations or people [2]. Information should be disclosed only when specifically authorized and solely for the limited use specified. So, data holders need to release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful. The most common approach used to preserving the privacy, is by removing all information that can directly link data items with individuals. This process is referred to anonymization. However, the rest of attribute contain information that can be used to link with other data to infer identity of responders, those type of attribute call quasi identifier for example (age, sex, Zipcode ). Popular example which can uniquely determine about 87% of the population in United States. To overcome the problem of this type of linking via quasi-identifiers, k-anonymity concept was proposed [1]. This study focuses on the data representation and selection of quasi identifier. The current anonymity methods needs to distort a big amount of information during anonymization process which decrease the data utility. Incautious publication of quasi-identifiers will lead to privacy leakage. On the other hand, data set sometimes contains compact values as one attribute, like zip code and telephone number, those attributes used with other attribute to join the data of an individual so as to infer identity of individual, so our proposed algorithm work with a similar type of attribute to decompose the data into sub attributes. The existing work addressing formal selection of quasi identifier attribute [3]. This algorithms for finding keys/quasi-identifiers exploit the thought of using random samples to tradeoff between accuracy and space complexity, and can be watched as streaming algorithms. Other study addressing QI problem in [4].demonstrating the role of QI in k anonymity. 2. PROBLEM BACKGROUND Anonymity is always related to the identification of a user rather than the specification of that user. For instance, a user can be identified through his/her SSN but in the absence of an information source that associates that SSN with a specific identity, the user is still anonymous[5]. Ensuring proper anonymity protection requires the investigation of the following different issues [6]. Identity disclosure protection. Identity disclosure occurs whenever it is possible to re identify a user, called respondent, from the released data. Techniques for limiting the possibility of re identifying respondents should therefore be adopted. Attribute disclosure protection. Identity 512

31 st July 216. Vol.89. No.2 25-216 JATIT & LLS. All rights reserved. disclosure protection alone does not guarantee privacy of sensitive information because all the respondents in a group could have the same sensitive information. To overcome this issue, mechanisms that protect sensitive information about respondents should be adopted[7]. Record linkage attack is one of the major channels for violating privacy. To address the problem of record linkage attack, different techniques of statistical disclosure control are employed. One such approach called k-anonymity, works by reducing data across a set of key variables to a set of classes [1] [8]. Other variations of k- anonymity can be found in [7] [9] [1] [11]. In a k- anonymized dataset each record is indistinguishable from at least k-1 other records. Therefore, an attacker cannot link the data records to population units with certainty thus reducing the probability of disclosure. However, preserving privacy through statistical disclosure control techniques leads to loss of a big amount of information to satisfy the privacy requirement. Most of the techniques proposed in literature do not focus on the information loss issues. Rather than privacy and computing time to achieve k-anonymity. The method described in this paper maintains a balance between information loss and privacy. Introduced of formal selection of quasi identifier attribute [3] followed by decomposition algorithm deployed to achieve the balance between information loss and privacy. 3. DATA PRE-PROCESSING AND SELECTIVE ALGORITHM The first objective is to minimize loss of data during anonymization process by filtering out tuple with missing data, un-known data and duplicate tuple. Into the preprocessing stage. Secondly, we seeks to identify quasi identifier attribute, significant minimal attribute subset and to evaluate its significance in terms of personal identity. The initial investigation was aimed at finding a basic attribute subset that is appropriate to identify the maximum number of tuples in the dataset. Insufficient selection results using randomly attribute subset leads to an attribute investigation to find specific attribute subsets identifying each tuple of the dataset. Figure 1: gives an overview of the experimental procedure for quasi identifier attribute selection. Collect data from a user Data Set Anonmyize dataset for identiity in table Remove all key /keys attribute from the table Anonmyize dataset for outer tables Nominate set of attributes with full of dependence of person which may be found in different recourses of data (data owners) Powerset of nominated attribute Find the power set of the set of nominated attribute Anvestigation of attribute For each element/elements in the power set, generating table to contain all elements with their total distinct values in table Selection of quasi identifer attribute The element from the power set with maximum number of tuple will be the set of QI attribute output new data set without key attribute and with selective QI attribute Theme Selective algorithm of proper quasi identifier attribute Figure 1: Selective Quasi identifier attributes To select quasi identifier attribute, firstly we nominate multiple attribute as set, and then we generate P(S) from the set to collect all possible combinations of the attribute. Each element in the set examination by the distinct value in the table, the candidate element from the power set will be the element with maximum distinct value, if the maximum distinct value duplicates in many element, we select the element with minimum attribute. The intuition is to identify a minimal set of attributes from T that has the ability to (almost) distinctly identify a record and the ability to separate two data records This study present formal selection procedure depends on probability of ability to infer the identity in the table. Although quasi identifier attributes are an input of any algorithm to anonymize data, but the formal selection of them is still not researched. 3.1 Selective Quasi identifier attributes Algorithm Steps Step 1: Nominate set of attributes with full of dependence of person that may be found in different recourses of data (data owners). 513

31 st July 216. Vol.89. No.2 25-216 JATIT & LLS. All rights reserved. Step 2: Find P(S) of the set of nominated attribute. Number of tuples Step 3: For each element/elements in the P(S), generating table to contain all elements with their total distinct values in table. 1 8 881 Step 4: Element of P(S) with maximum number of tuple will be the set of QI attribute, if there is more than one element, select the element with minimum number of attribute. 4. EXPERIMENTAL RESULT 6 4 2 2 5 1 Sex 449 182 91 Age,Sex Different subset of dataset are used for experimental with different scenarios, Figure 2 shows the distinct number of tuples for each combination of Census_Income subset where Figure 3 d.n.o. for each combination of adult dataset. Lastly figure 4 shows combination for both Census_income and Adult dataset. Details of experiment as follow: Algorithm name: Selective algorithm Data Set: Census -income Total number of tuple: 199523 Nominated set: (Age, Sex, Mace) P(S): (Age, Sex, Mace, (Age, Sex), (Age, Mace), (Sex, Mace), (Age, Sex, Mace)) Table 1: Census Income-Number of Tuples Element Number of Tuples Age 91 Sex 2 Mace 5 Age, Sex 182 Age, Mace 449 Sex, Mace 1 Age, Sex, Mace 881 Figure 2: Census -Income Number of Tuples Data Set: Adult dataset Total number of tuple: 32561 Nominated set (Zip,Sex,Race) P(S): (Zip, Sex, Race,(Zip,Sex), (Zip, Race),(Sex, Race), (Zip, Sex, Race)) Table 2: Adult- Number of Tuples Number of tuple Element1 Number of Tuples Sex 2 Race 5 Sex, Race 1 Zip 21648 Zip, Sex 2219 Zip, Race 21942 Zip, Sex, Race 22188 Number of tuple 21648 2219 21942 22188 2 5 1 Sex Sex,Race Zip,Sex Zip,Sex,Race Figure 3: Adult -Number of Tuples 514

31 st July 216. Vol.89. No.2 25-216 JATIT & LLS. All rights reserved. 25 2 15 1 5 Sex Adult And Census Zip,Sex Number of tuple Adult Number of tuple Census Figure 4: Adult and Census -Income Number of Tuples From experimental we found that only one attribute with highest ability to infer the identity, normally it is one of continuous attribute and by joining any other attribute with it, will increase this ability, namely, in the adult dataset it is Age attribute, and Zip attribute in Census dataset. To overcome the problem of continuous attribute we proposed decomposed attribute algorithm. 5. DECOMPOSER ALGORITHM FOR DATA REPRESENTATION 1 8 6 4 2 Although k-anonymity is a concept that applies to any kind of data, for simplicity its formulation considers data represented by a relational table. Formally, let A be a set of attributes, D be a set of domains, and Dom: A D be a function that associates with each attribute A A a domain D=Dom (A) D, containing the set of values that A can assume. A tuple to over a set {A 1... A p } of attributes is a function that associates with each attribute Ai value v Dom (Ai), i=1... P. DEFINITION 1: (Relational table) let A be a set of attributes, D be a set of domains, and Dom: A D be a function associating each attribute with its domain. A relational table T over a finite set {A1,.,A p } A, of attributes, denoted T(A1,.., A p ) is a set of tuples over the set {A 1,., A p } of attributes. Notation Dom (A, T) denotes the domain of attribute A in T, T denotes the number of tuples in T, and t represents the value v associated with attribute A in T. Similarly, t denotes the sub-tuple of t containing the values of attributes {A 1... A k }. By extending this notation, T represents the subtuples of T containing the values of attributes {A 1... A k }, that is the projection of T over {A1... A k }, keeping duplicates. DEFINITION 2: (Domain generalization relationship) let Dom be a set of ground and generalized domains. A domain generalization relationship, denoted D, is a partial order relation on Dom that satisfies the following conditions: C1: Di, Dj, Dz Dom: Di D Dj,Di D Dz Dj D Dz Dz D Dj C2: all maximal elements of Dom are singleton Condition C1 states that for each domain Di, the set of its generalized domains is totally ordered and each Di has at most one direct generalized domain, Dj. This condition ensures determinism in the generalization process. Condition C2 ensures that all values in each domain can always be generalized to a single value. The definition of the domain generalization relationship implies the existence, for each domain D Dom, of a totally ordered hierarchy, called domain generalization hierarchy and denoted DGHD. Each DGHD can be graphically represented as a chain of vertices, where the top element corresponds to the singleton generalized domain, and the bottom element corresponds to D. Figure 5: shows an example of DGH. Figure 5[1]: Examples of DGH Figure 5 shows an example of domain generalization hierarchies for attributes ZIP, Sex, and Marital Status. DEFINITION 3: (Value generalization relationship) denoted V, can also be defined that associates with each value Vi Di a unique value Vj Dj, where Dj is the direct generalization of Di. The definition of the value generalization relationship implies the existence, for each domain D Dom, of a partially ordered hierarchy, called value generalization hierarchy and denoted VGHD. Each VGHD can be graphically represented as a tree, where the root element corresponds to the unique value in the top domain in DGHD, and the leaves correspond to the values in D. Figure 4.2 shows an example of value generalization 515

31 st July 216. Vol.89. No.2 25-216 JATIT & LLS. All rights reserved. hierarchies for attributes ZIP, Sex, and Marital Status. Figure 6 shows an example of value generalization hierarchies for attributes ZIP, Sex, and Marital Status. Figure 6[1]: example of value generalization hierarchies data set sometimes contains compact values as one attribute, like zip code system which includes (State- city-local address) and telephone number system (country code- area code- personal number) those attributes used with other attribute to join the data of an individual so as to infer the Identity of individual, So decompose algorithm work only with a specific type of attribute to split the data into many attribute 5.1 Code Systems for Numbering It is clear that any international code must have numbering system, for example, Telephone Code Number System Malaysia (61) 6 1... Another example Postal or zip code System Indonesia (124 or 1241 1 2 4... 1 2 4 1... If we split the data in two different table the first one contain the classification according to the system number and the rest of value in another attribute with the same name, for zip code it can be decomposed into 2 digit number as the system in US to represent regional class of zip code, and the rest of code will represent local zip code. Table 4 represent distinct values of sample table 3 for the same number of a tuple. Table 3: ZIP Code Structure Zip State Code Zip Code 28496 28 496 32214 32 214 32275 32 275 28781 28 781 51618 51 618 51835 51 835 54835 54 835 54352 54 352 59496 59 496 59951 59 951 As a result of decomposition algorithm, Stat Code attribute can substitute by identification number of each state in separated table or state name, new Zip code with less number of digits can be generalized or used in data anonymity Table 4: Distinct value of table 3 attributes State Zip Zip Code Code 1 5 8 Table 4: show that the ability to identify each tuple is Zip is 1%, but when we split the Zip to state Code and Zip Code the ability is decreased to 5% in state Code and to 8% in new Zip Code. 6. RESULTS AND DISCUSSION The two algorithms were tested on datasets obtained from UCI Machine Learning Repository: the adult dataset has 32,561 records and 15 attributes of which three attributes (Zip, Race and, Sex) were considered to be quasi-identifiers. The goal is to go to outliers because it contains the value which needs to suppress or change. Total number of distinct record of QI Adult dataset are 168, total number of adult dataset record 3256. The result of algorithm demonstrated in Figure 7 which shows the relation between outlier before and after applying the algorithm for 3 QID and Figure 8 for 2 QID and Figure 9 appalling for only zip cod. 516

31 st July 216. Vol.89. No.2 25-216 JATIT & LLS. All rights reserved. Z level no of record outlier 2 1 1 2 3 4 5 6 method does not consider what attributes the adversary could potentially have. If the adversary can obtain a bit more information about the target victim beyond the minimal set, then he may be able to conduct a successful linking attack. So the choice of QI remains an open issue. REFERENCES: Figure 7: Result of Outlier for Zip, Sex, Rase QI= 3 Result after method of quasi-identifier representation 2 1 2 4 6 Figure 8: Result of Outlier for sex, Rase QI =2, for 2 15 1 5 32561 Tuples Figure 9: outlier of attribute Zip code before and after appalling the method. 7. CONCLUSION No Of record outlier No Of record outlier level level 1 level 2 level 3 level 4 ILoss Ater outlier Iloss Before In this paper, we present simple algorithms for selecting QID [3] followed by decompose algorithm. From the results we show that our method is decreasing loss of information which affect directly of data utility, nonetheless, the minimal set of QI does not imply the most appropriate privacy protection setting because the [1]. Sweeney, L., k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 22. 1(5): p. 557-57. [2]. Fung, B., et al., Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys (CSUR), 21. 42(4): p. 14. [3]. Motwani, R. and Y. Xu. Efficient algorithms for masking and finding quasi-identifiers. in Proceedings of the Conference on Very Large Data Bases (VLDB). 27. [4]. Bettini, C., X.S. Wang, and S. Jajodia, The role of quasi-identifiers in k-anonymity revisited. arxiv preprint cs/61135, 26. [5]. Gokila, S. and P. Venkateswari, ASurvey ON PRIVACY PRESERVING DATA PUBLISHING. International Journal on Cybernetics & Informatics ( IJCI) Vol. 3, No. 1, February 214, 214. [6]. Meyerson, A. and R. Williams. On the complexity of optimal k-anonymity. in Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 24. ACM. [7]. Machanavajjhala, A., et al., l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 27. 1(1): p. 3. [8]. Bayardo, R.J. and R. Agrawal. Data privacy through optimal k-anonymization. in Data Engineering, 25. ICDE 25. Proceedings. 21st International Conference on. 25. IEEE. [9]. Li, N., T. Li, and S. Venkatasubramanian. t- closeness: Privacy beyond k-anonymity and l- diversity. in Data Engineering, 27. ICDE 27. IEEE 23rd International Conference on. 27. IEEE. [1]. Sun, X., et al., Enhanced p-sensitive k- anonymity models for privacy preserving data publishing. Transactions on Data Privacy, 28. 1(2): p. 53-66. [11]. i, T. and N. Li, Towards optimal k- anonymization. Data & Knowledge Engineering, 28. 65(1): p. 22-39. 517