Privacy Preserving Data Sanitization and Publishing. A. N. K. Zaman

Size: px

Start display at page:

Download "Privacy Preserving Data Sanitization and Publishing. A. N. K. Zaman"

Ernest Booth
6 years ago
Views:

1 Privacy Preserving Data Sanitization and Publishing by A. N. K. Zaman A Thesis presented to The University of Guelph In partial fulfillment of requirements for the degree of Doctor of Philosophy in Computer Science Guelph, Ontario, Canada c A. N. K. Zaman, December, 2017

2 ABSTRACT Privacy Preserving Data Sanitization and Publishing A. N. K. Zaman University of Guelph, 2017 Advisor: Dr. Charlie Obimbo Recent trends have shown a drastic increase in large data repositories by corporations, governments, and healthcare organizations. According to Bernard Marr of the Forbes Tech magazine (2015), the growth in data in 2014/15 alone was twice that created in the entire history of the human race. Data sharing is beneficial in areas such as healthcare services, and collaborative research works. However, there is a significant risk of compromising sensitive information, for example through de-anonymization. Privacy Preserving Data Publishing (PPDP) is a way to allow one to share sanitized data while ensuring protection against identity disclosure of an individual. Removing explicit identifiers/personally identifiable information (PII) from a data set and making the data set compliant according to the Health Insurance Portability and Accountability Act (HIPAA) does not guarantee the privacy of data donors. Data sanitization may be achieved in different ways, by k anonymization, l diversity or δ presence, to name but a few, however, differential privacy paradigm provides the strongest privacy guarantee for sanitized data publishing. This research proposes

3 two privacy preserving algorithms that satisfy the ε differential privacy requirement and adopts the non-interactive privacy model for sanitizing and publishing data. Along with the differential privacy, generalization and suppression of attributes is applied to impose privacy and to prevent re-identification of records of a data set. The key contributions of this thesis are: 1) the proposed algorithm adopts the non-interactive model for data publishing; as a result data miners have full access to the published data set for further processing, to promote data sharing in a safe way; 2) the algorithm can sanitize micro and/or HIPPA compliance data sets for publishing; 3) the published data is independent of adversary s background knowledge; 4) the algorithm is independent of the choice of quasi-identifiers (QIDs), and finally, 5) it protects published data set from the re-identification risk. The published sanitized data using the proposed algorithm is shown to have higher data usability in the case of data classification accuracy compared to other existing works, and significantly reduces the risk of re-identification.

4 Dedication Dedicated to those who lost their lives and those who survived, but had become handicapped physically and/or mentally for rest of their lives. The Rana Plaza Tragedy (2013) * Savar * Dhaka * Bangladesh iv

5 Acknowledgements In the name of God, the Most Beneficent, the Most Merciful. First and foremost, I would like to express my sincere appreciation and gratitude to my advisor, Dr. Charlie Obimbo, who always offered valuable support, understanding, and encouragement. His enthusiasm inspired me to research and write this thesis, and I will forever be grateful to him. I would also like to extend my sincere gratitude to Dr. Rozita Dara for her advice and support during this journey. I am sincerely grateful to the members of my advisory committee: Dr. David Chiu and Dr. Radu Muresan, who provided me with their feedback throughout this journey. Next, I like to extend my admiration to my loving wife, Majida, my beloved son, Rafan and my loving daughter, Safa, for their constant love, encouragement, support, and sacrifice. Finally, I like to express my gratefulness to my caring mother, Mabia Khatun, for her love, well wishes, and inspiration. I also would like to remember my beloved father late Mr. Saifuddin Ahmad for his love and courage throughout my entire life. v

6 Contents 1 Introduction Motivation Contributions Organization Publications Related to This Thesis Literature Review Introduction Preliminaries Privacy Models and Different Attacks Record Linkage Attack k-anonymity (X, Y )-Anonymity vi

7 2.4.3 MultiRelational k-anonymity Discussion Attribute Linkage Attacks l-diversity t-closeness Confidence Bounding Attack Table Linkage Attacks δ Presence Probabilistic Attack (c, t)-isolation (d, λ)-privacy ε-differential Privacy Anonymization Mechanisms Generalization Suppression Bucketization Perturbation Conclusion vii

8 3 Methodology Proposed System and Experimental Design Privacy Constraint Laplace Mechanism Anonymization Data Flow Diagram of the Proposed System Utility Measures Classification Accuracy Re-identification Risk Conclusion Sanitizing and Publishing Electronic Health Record Introduction Problem Definition Related work Proposed Algorithm Working Example Data Sets Data Set Preprocessing viii

9 4.7 Results and Discussions Risk of Re-identification Scalability Conclusion Sanitizing and Publishing Real-World Data Set Introduction Problem Definition Related Works Proposed Algorithm Working Example Data Sets Result and Discussion Risk of Re-identification Scalability Conclusion Conclusion and Future Work Limitations of Existing Systems Summary of Contributions ix

10 6.3 Future Work A Mathematical Symbols Used in Thesis 117 x

11 List of Tables 2.1 Examples of Explicit Identifiers, QIDs, and Sensitive Attributes Patient Table External Table Contains Person Specific Data Anonymous Patient Table A published patient data table T A published patient data table T Data table formed by joining T 1 and T Patients Micro Data Patients Generalized Data anonymous Patient Table Different privacy preserving algorithms and attacks [33][41][97] Sample small data set xi

12 4.2 Anonymize form of the Sample data set Anonymize form of the Sample data set Noisy frequencies for the sanitized data Data Set Descriptions Attributes of the Doctor s Bills Data Set V Attributes of the Haberman s Survival Data Set Classification Accuracy for the Doctor s Bill V1 Data Set Classification Accuracy for the Haberman s Survival Data Set Comparisons of re-identification the risk between sanitized, and non-sanitized data sets Anonymize form of the Sample Data Set with Group Frequencies Noisy frequencies for the sanitized data Attributes of the Adult Data Set Attributes of the Doctor s Bills Data Set V Showing classification accuracy using Decision Tree Classifier for the Adult Data Set Classification Accuracy for the Doctor s Bill V2 data set Comparisons of re-identification the risk between sanitized, and non-sanitized data sets xii

13 List of Figures 2.1 Data collection, anonymization, and application areas Presenting quasi-identifiers, linked to re-identify personal data [1][82] Taxonomy trees for profession, gender, and age Suppression of Zip Codes of two German Cities Data Flow of the Proposed Algorithms A sample Doctor s Bill from the Data Set Classification Accuracy for the Doctor s Bill V1 Data Set Classification Accuracy for the Haberman s Survival Data Set Comparisons among proposed algorithm and five other algorithms Risk of Re-identification for the Raw Doctor s Bill V1 Data Set Risk of Re-identification for the Sanitized Doctor s Bill V1 Data Set Risk of Re-identification for the Raw Haberman s Survival Data Set xiii

14 4.8 Risk of Re-identification for the Sanitized Haberman s Survival Data Set Runtime for the 2LPP Algorithm Classification Accuracy for the Adult data set Classification Accuracy for the Doctor s Bill V2 data set Comparisons among proposed algorithm and five other algorithms Risk of Re-identification for the Raw Doctor s Bill V2 Data Set Risk of Re-identification for the Sanitized Doctor s Bill V2 Data Set Risk of Re-identification for the Raw Adult Data Set Risk of Re-identification for the Sanitized Adult Data Set Runtime for the ADiffP Algorithm xiv

15 Chapter 1 Introduction The huge increase in large data repositories by corporations, governments, and healthcare organizations has given credence to developing information-based decision-making systems. Various interested parties are mining trends, and patterns from the data set to improve/design customer service. As a result, data sharing is essential. However, data custodians have legal and ethical responsibilities to maintain the privacy of the data donors. 1.1 Motivation Data breaches have also increased tremendously, which has not only been alarming, but also affected personal lives, governments, and businesses in many ways. Some of the effects include identity theft, financial losses, and interference with political elections. According to the report of the Verizon Data Breach Investigations [89], in 2016, there were 3,141 confirmed cases of data breaches. In a recent data-breach case, a malicious user publicly re- 1

16 leased personally identifiable information (PII) of 112,000 French police officers in a Google drive [14]. Revelation of indirect information such as postal code, gender, and race can also make a person vulnerable to exposure by an intruder. These are called quasi-identifiers (QIDs). Data breaches are occurred in all areas such as healthcare, academia, banking, and retail; however, our focus will be on healthcare data. This research proposes a privacy preserving algorithm to publish sanitized data to promote data sharing for designing and implementing public-spirited policies to expedite effective services and development. Removing explicit identifiers from a data set and making the dataset compliant with the Health Insurance Portability and Accountability Act (HIPAA) [43] or similar regulation does not guarantee the privacy of data donors. To extract knowledge from data, different parties such as researchers and marketers need to process and share data for their own benefits. Data sharing methods and the use of the shared data among interested parties are controlled by certain guidelines and policies. To protect data donors privacy and to prevent misuse of data, removing identifiable attributes such as names, social insurance numbers and addresses of individuals is a common practice before releasing any data. However, this simplified method is not adequate to ensure the privacy of record owners/donors. The following section will represent some real-world examples to highlight the necessity for privacy preserving methods and to clarify the obstacles of developing such techniques to preserve the person-specific data privacy. 2

17 Montjoye et al., [29] studied a three-month credit card report of 1.1 million individuals and they uniquely identified 90% of the record owners by analyzing the spatiotemporal information. They also reported that by knowing the exact price of an item increases the re-identification risk by 22%. The authors also reported that women are more identifiable than men in the case of the metadata of credit cards. The buying pattern by using a credit card makes a person vulnerable to his/her privacy. Another example explains the de-identification [47] of Resident Registration Number (RRN) for South Koreans. RRN is a 13-digit number that encodes demographic information and the pattern is publicly known. Sweeney and Yoo [83] conducted an experiment on the 23,163 prescription data that contains weakly encrypted RRN codes. The authors reported that they were able to de-anonymize the data 100%, and concluded that encrypted national identifiers are also vulnerable. In a similar study Song et al., [79] showed that improper use of RRN makes the Korean individuals vulnerable. In 2013, Sweeney [81] collected a health data set for the year 2011 from Washington State that did not contain patient s name or address (zip code). However, the author linked the newspaper stories in the same year with the keyword hospitalized, and was able to identify 43% of individuals from that data set. Earlier, Sweeney [80] presented an attack to break person-specific privacy by linking the medical data (state employees) collected by the Group Insurance Commission (GIC), Massachusetts, US and the Massachusetts voter registration list. A medical data set was distributed by the GIC for researchers that contained demographic information such as 3

18 gender, postal code, and date of birth. A copy of the voter registration list of Massachusetts was bought by the author and then combined with the GIC health dataset. She was able to identify the former governor of the state of Massachusetts whose name was William Weld. Sweeney showed that on the basis of gender, 5 digit postal codes, and date of birth, 87% of the U.S. population are unique, i.e., 13% of the population have a common zip code and/or gender and/or date of birth. An attack that uses an external data set to identify a person from an anonymized data set is referred to as a linking attack. These kinds of attacks have become widespread and are a source of concern since it is now fairly easy to collect external data from the internet. In 2006, a compressed text file containing twenty million keywords from the search history of more than 650,000 AOL users in a three-month time slot was released by AOL research [12]. A numeric key was assigned as an ID for every searcher; however, A Georgian widow, Thelma Arnold, age 62 was identified by The New York Times using her numeric ID In this case, meta-data (here the search keywords) of the data, discloses the identity of the user. The search keywords were landscapers in Lilburn, GA, used by a number of persons having the last names Arnold. Another query homes sold in Shadow Lake Subdivision Gwinnett county Georgia, helped to identify Thelma. Netflix, one of the largest movie rental companies in the world, once made its users vulnerable. Netflix released a data set [68] of its 500,000 subscribers to publish anonymous movie ratings were referred as the Netflix Prize data set. According to the Netflix website, To protect customer privacy, all personal information identifying individual customers 4

19 have been removed and all customer IDs have been replaced by randomly-assigned IDs. Narayanan and Shmatikov [65] applied the de-anonymization technique on the Netflix data using the background knowledge from the International Movie Database (IMDb) site where users posted non-anonymous reviews. They were able to identify 99% of the users precisely. From the examples above and discussion, it is clear that mere removal of person-specific information does not guarantee the privacy of a data donor. Robust data sanitization techniques are needed to preserve person-specific privacy while keeping the data useful for knowledge mining. Privacy preserving data publishing [70] is important due to the following reasons: To adhere to legal obligations to prevent data breaches by following laws such as the Health Insurance Portability and Accountability Act (HIPAA) in the U.S. [43] and the Personal Health Information and Protection Act (PHIPA) [2] in Canada. To share data among organizations and their partners without disclosing the privacy of an individual. 1.2 Contributions This research has the following key contributions: Where as current systems are interactive, in other words, users have to query the data set and await the response, sometimes being limited in the response they get, 5

20 the system built adopts a non-interactive model for data sanitization and release, so that the data-miners have complete access of the sanitized release data for further processing. The proposed algorithm fulfills ε-differential privacy [33] and Laplace noise is added to sanitize data sets. Differential Privacy prevents an attacker from knowing any information associated to a particular person from a data set. In Chapter 3, ε-differential privacy is discussed in details. Generalization and suppression techniques are applied to achieve anonymization this helps preventing association of the sanitized data set with the external data set (e.g., data from a social network). Generalization is done by substituting an original attribute value with a more generalized form of that value according to the characteristics of that attribute (age: 47 may be substituted by a range 45-50). A suppression operation replaces an attribute value fully or partially by a special symbol (e.g., * or Any ), which indicates that the value has been suppressed. Any data set sanitized with the proposed algorithm will be free from re-identification using quasi-identifiers (QIDs). QIDs are a set of attributes in a data set those are used to identify an individual with the help of external knowledge (please see Figure 2.2). The proposed algorithms do not consider any attribute as QIDs to avoid syntactic processing of a data set. 6

21 The proposed algorithm can handle real life data sets containing categorical, numerical, and set valued attributes. The sanitized and published data sets using the proposed algorithm keeps data usable in case of classification. The proposed algorithm de-identifies a data set in a secure way so the the risk of re-identification is very low, means the the data set is safe to publish. 1.3 Organization The rest of the thesis is organized as follows: Chapter 2 represents the literature reviews of the area of privacy preserving data publishing (PPDP) includes existing privacy models, distinct privacy preserving algorithms, various types of attacks, and privacy breaches. Chapter 3 represents the theoretical background of the differential privacy paradigm, and Laplace noise. This also discussed the techniques used to measure the data usability of the sanitized and published data sets. Chapter 4 represents the proposed two layer privacy preserving (2LPP) algorithm for sanitizing and publishing health care data sets. The usability of the sanitized and published data sets are also presented here. 7

22 Chapter 5 represents the proposed adaptive differential privacy (ADiffP) algorithm for sanitizing and publishing census data set and an another data set that contains a set valued attribute. The usability of the sanitized and published data sets are also presented here. Chapter 6 presents the concluding remarks of this thesis, and suggest some future directions of this research. 1.4 Publications Related to This Thesis 1. A. N. K. Zaman, C. Obimbo, and R. A. Dara, An improved differential privacy algorithm to protect re- identification of data, in Proceedings of the IEEE Canada International Humanitarian Technology Conference (IHTC 2017), Toronto, Ontario, Canada, July 20-22, 2017, pp (Best Paper Award) 2. A. N. K. Zaman, C. Obimbo, and R. A. Dara, An Improved Data Sanitization Algorithm for Privacy Preserving Medical Data Publishing, in Proceedings of the Advances in Artificial Intelligence: 30th Canadian Conference on Artificial Intelligence, Canadian AI 2017, Edmonton, AB, Canada, May 16-19, 2017 pp Cham: Springer International Publishing, A. N. K. Zaman, C. Obimbo, and R. A. Dara, A novel differential privacy approach that enhances classification accuracy, in Proceedings of the Ninth International C 8

23 Conference on Computer Science & Software Engineering, ACM C3S2E 16, Porto, Portugal, July 20-22, 2016, pp A. N. K. Zaman and C. Obimbo, Privacy preserving data publishing: A classification perspective, International Journal of Advanced Computer Science and Applications(IJACSA), vol. 5, no. 9, pp , A. N. K. Zaman, C. Obimbo, R. A. Dara, and David Chiu Minimizing re-identification risk of personal medial data, IEEE Consumer Electronic Magazine, Special Issue on Humanitarian Technology. (To be Submitted) 9

24 Chapter 2 Literature Review 2.1 Introduction Over the past decade, the rate at which Government and Corporations respectively have collected their citizens and customers data, containing private information has grown exponentially. These data, create opportunities for developing knowledge and information-based decision making systems by means of the data mining. Thus, the publication of these data enables it to be shared with varying parties. For example, all California-based, licensed hospitals have to submit person-specific data (Date of Birth, admission and release dates, Zip code, principal language spoken etc.) of all discharged patients to the California Health Facilities Commission to make that data available to interested parties (e.g., insurance, researchers) to promote Equitable Healthcare Accessibility for California [11]. In 2004, the Information Technology Advisory Committee of the President of the United States 10

25 published a report with the title Revolutionizing Health Care through Information Technology [26], to emphasize the importance of implementing a nationwide electronic medical record (EMR) system to promote and to encourage medical knowledge sharing throughout the computerized clinical decision support system. Publishing data is beneficial in many other areas. As discussed earlier (in Chapter 1), in 2006 Netflix (online DVD Rental Company) published a 500,000 movie-ratings data set of subscribers to encourage research to improve the movie recommendation accuracy on the basis of personal movie preferences [68]. From Oct 2012, Canada, and the United States governments started a pilot project called Entry/Exit pilot project [20]. The intent of this project is to share the biographic data of travelers who cross the USA/Canada border between the Canada Border Services Agency (CBSA) and the Department of Homeland Security (DHS). This is an example of data sharing between two governments. In general, an individual may not be willing to share his/her personal information with another person, group or society due to his/her privacy concerns. Privacy should be considered as a privilege, so that an individual will be able to prevent his/her information from being public. Privacy has many aspects such as physical, organizational, intellectual, and informational. This Thesis deals with informational privacy related to personal data. In 1890, Warren and Brandeis [92] published their concern about privacy due to technological improvement of photography and quicker newspaper printing. The authors regarded privacy as related to the inviolate personality of an individual. Nowadays, privacy is 11

26 regarded as a basic human right; however, the notion of privacy varies in different contexts. Here are a few definitions of privacy given by leading researchers in this area: Westin wrote in [93]: Privacy is the claim of individuals, groups or institutions to determine for themselves when, how, and to what extent information about them is communicated to others. Gavison wrote in [42]: A loss of privacy occurs as others obtain information about an individual, pay attention to him, or gain access to him. These three elements of secrecy, anonymity, and solitude are distinct and independent, but interrelated, and the complex concept of privacy is richer than any definition centered around only one of them. Barth et al., in [13]: Privacy as a right to appropriate flows of personal information. Bertino et al., in [16]: The right of an entity to be secure from unauthorized disclosure of sensitive information that are contained in an electronic repository or that can be derived as aggregate and complex information from data stored in an electronic repository. The above definitions focus on the concept of privacy as release of information in a controlled way. One can summarize this as privacy determines what type of personal information should be released and which group or person can access and use it. For the purposes of this Thesis, we give a few relevant definitions: Definition 2.1. Privacy Preserving Data Publishing (PPDP) encompasses privacy models and techniques, which allow one to share anonymous data to ensure protection against identity disclosure. Data anonymization is a technique for PPDP, which makes sure the 12

27 published data is practically useful for processing (mining) while preserving individuals sensitive information [40]. Classification is a fundamental problem in statistics, machine learning and pattern recognition. Definition 2.2. Let a data set have N classifiable attributes and a set L of labels. The task of classification can be defined as the assignment of a specific label L i L to every attribute in a consistent predefined way so that data groups are identified according to their common attributes/characteristics [51]. Definition 2.3. Differential privacy is a privacy model that ensures the highest level of privacy for a record owner while providing actual information about the data set. The definitions above will be discussed in detail in further chapters. The following sections discuss the fundamental ideas of privacy preserving models, different attacks, and their classifications. 2.2 Preliminaries Researchers in data mining, statistics, database, and security communities have been worked on the privacy of data for last few decades [41][5]. The task of preserving the privacy of a data set can be categorized [41][5][48] as: Interactive frameworks 13

28 Non-interactive frameworks In the interactive framework, a privacy-preserving mechanism resides between users and/or researchers queries and raw a data set. The responses (against the queries) and/or queries are evaluated by the privacy preserving mechanism to guarantee privacy. Examples of query responses in case of interactive frameworks are SUM, COUNT, etc., so this method is also called statistical disclosure control (SDC). The interactive framework encompasses two different techniques: auditing queries, and perturbation of outputs. In the case of query auditing, if any response to a query, discloses any sensitive information then the query will be denied or disclosed exact answer otherwise. On the other hand, the privacy mechanism alters the exact answer of a query in a perturbed form (e.g., by the addition of noise) for publication. In non-interactive paradigm, sanitization (e.g., anonymization) is applied to the raw data set to make it anonymous to preserve individual s privacy and then the altered data is published for analysis or processing. As soon as the data is published, the publisher of the data has no further control of the data set. The non-interactive privacy preserving framework is also called PPDP. Existing works on data privacy can also be categorized in two ways. Centralized model, and Distributed model In the centralized approach, a single owner publishes the data set in which the key challenge is to alter the data to preserve privacy and to process the modified data set to 14

29 mine results. One of the most common methods used in the centralized approach is randomization[6][35]. The key idea behind the distributed approach is that different parties were willing to collaborate to obtain aggregate results, but they do not trust each other to distribute their own data set. The main challenge is to execute multiparty computation while preserving data privacy and inputs-outputs. This paradigm is called Secure Multiparty Computation (SMC). Privacy and correctness are two main requirements for SMC [15][17]. PPDP is a leader in many application areas including: Microdata publishing (for sharing, classification, etc.) Data outsourcing (for cloud computing) Collaborative computing (e.g., multiparty computations) Mobile computing (e.g., location privacy and location-based service quality) etc. Figure 2.1 presents the data flow diagram of a PPDP system for the data collection through out the application phases. Figure 2.1: Data collection, anonymization, and application areas 15

30 When an anonymous data set is published, it is expected to be used by the researcher for lawful data analysis. However, there is a high risk that illegitimate users could analyze the published data and discover someone s personal sensitive information. For example, in 2013 the news about PRISM was published by Edward Snowden, a former National Security Agency (NSA) Tech contractor. PRISM is the code name of data mining program used by NSA to access big tech giants like Google, Yahoo, Skype, Face-book servers to collect users data includeing messages, log-in activity, voice and video chat etc. [58]. A data publisher needs a robust data anonymization algorithm that can protect from different attacks as well as keep the data useful for further processing. It is important to note a few definitions [19] which will be useful for the following sections: Definition 2.4. An identifier that helps to recognize an individual explicitly from a data collection using a set of attributes is called an explicit identifier, e.g., social insurance number (SIN) and name. Definition 2.5. If the values of a set of attributes are linked to locate or identify a person from a data set, then these attributes are called quasi-identifiers (QIDs) e.g., postal code, gender, and date of birth. Definition 2.6. Some attributes are considered sensitive and person-specific; these attributes are called sensitive attributes e.g., salary, disease, and disability status. All other attributes will be considered non-sensitive. In this document, the term victim refers to an individual (data donor/owner) who is targeted by an attacker. Table 2.1 presents the above-mentioned terms as examples. 16

31 Table 2.1: Examples of Explicit Identifiers, QIDs, and Sensitive Attributes Explicit Identifier Quasi-Identifier (QID) Sensitive Attributes Name Social Insurance Number (SIN) Date of Birth (dd/mm/yy) Gender Zip Code Disease Ruby /11/74 Female Dengue Jenny /11/84 Female Flu Dan /12/89 Male Cancer Ella /09/81 Female Broken Leg Max /02/85 Male Asthma Figure 2.2 shows how QIDs are used to link with external data to identify an individual. In this example, a medical data set is linked to the voter list to identify a targeted individual. 2.3 Privacy Models and Different Attacks Before talking about the privacy models and different attacks, it is necessary to know the definition of privacy protection [41][28]. If an attacker has full access to a published data set and also has background knowledge (from different sources) about a person from the same data set; however, the attacker will not be able to find the targeted person if the privacy control is effectively implemented for that published data set. The following sections discuss different privacy preserving models and attacks. 17

32 Figure 2.2: Presenting quasi-identifiers, linked to re-identify personal data [1][82] 18

33 2.4 Record Linkage Attack In a record linkage attack [23][94][24][25], an attacker might have some auxiliary knowledge about an individual from other sources like a telephone directory. Let T is a published table, QID represents values of all quasi-identifiers of T, and qid represents a smaller number of records belongs to QID in the Table T. Then in the published Table T, some values of qid on QID (qid QID) separate records that are smaller in number, is defined as a group. If the victim belonging to that smaller group is easily identified by the attacker. So, the record linkage attack makes data donors vulnerable. Tables 2.2, 2.3, and 2.4 present examples of various attacks. Table 2.2: Patient Table Job Sex Age Disease Pilot Male 34 Hepatitis B Pilot Male 39 Hepatitis B Professor Male 37 Influenza Filmmaker Female 31 Dengue Filmmaker Female 31 Influenza Singer Female 31 Influenza Singer Female 31 Influenza Let us consider a hospital released Table 2.2 for research purposes. If an attacker has access to Table 2.3 and (s)he knows the victim, then by combining two tables, the attacker can easily identify the victim s disease. In both Tables 2.2 and 2.3, there are common 19

34 Table 2.3: External Table Contains Person Specific Data Name Job Sex Age Cindy Filmmaker Female 31 Lolo Singer Female 31 Kim Filmmaker Female 31 Ruby Singer Female 33 Sara Singer Female 31 Bobby Pilot Male 34 Max Professor Male 37 Peter Pilot Male 39 Joe Professor Male 39 Table 2.4: 3-Anonymous Patient Table Job Sex Age Disease Artist Female [30-35) Dengue Artist Female [30-35) Influenza Artist Female [30-35) Influenza Artist Female [30-35) Influenza Professional Male [35-40) Hepatitis B Professional Male [35-40) Hepatitis B Professional Male [35-40) Influenza 20

35 attributes job, sex, age. For example, Max, a male professor who is 37 years old, is identified as an Influenza patient by qid= professor, male, 37 by joining two given tables. This is an example of a record linkage attack. k-anonymity [76][80] is a technique that protects from record linkage attacks k-anonymity The idea of k-anonymity was introduced by Samarati and Sweeney [76][80] to protect privacy of the data donors against record linkage using QID. Sweeney explains k-anonymity [71][80] as The information for each person contained in the released table cannot be distinguished from at least k 1 individuals whose information also appears in the release. This definition could be explained as at least k records must have the same quasi-identifier in the publicly released Table T. Table 2.4 represents a 3-anonymous table by the generalization (published more generalized values of attributes) of QIDs. In Table 2.2, there are person-specific information; however, in Table 2.4 there is no person specific information, as this table is generalized according to k-anonymity, and at least 2 records have the same QID. Taxonomy trees of the attributes, profession, age and sex, for the generalization of Table 2.2 are given in Figure (X, Y )-Anonymity To overcome the limitation of k-anonymity and for the easy sequential data release, the idea of (X,Y )-Anonymity was introduced by Wang and Fung [90]. The sequential release 21

36 Figure 2.3: Taxonomy trees for profession, gender, and age 22

37 of a data set is a way to release different attributes as subsets in a sequential manner. For example, in Table 2.5, a data publisher published a Table T 1 earlier and after that the publisher decided to publish another Table T 2 (see on Table 2.6) of the same data set for classification analysis. The column Pid (person identifier) is added for the sake of discussion, not for publication. According to the privacy requirement, from a published table, a data donor s record should not be identified by an attacker. However, if an attacker joins Tables T 1 and T 2 he can identify Sam, Dengue group by analyzing name, disease matching, as this group size is 1. In the same way the attacker is also able to identify Jay, Flu with 100% confidence for the Jay group of persons. From the above discussion, it is found that sequential publication of data makes individuals privacy vulnerable. Table 2.5: A published patient data table T 1 T 1 Pid Job Disease 1 Driver Flu 2 Driver Flu 3 Chef Dengue 4 Teacher Asthma 5 Pilot Dengue In (X,Y )-anonymity, X and Y are present two disjoint sets that have their own attributes. According to (X,Y )-anonymity, at least k different values of Y are linked with every value of X. The k-anonymity could be expressed as a special form of (X, Y )-anonymity, 23

38 Table 2.6: A published patient data table T 2 T 2 Pid Name Job Class 1 Jay Driver CL1 2 Jay Driver CL1 3 Sam Chef CL2 4 Sam Teacher CL3 5 Rosy Pilot CL4 Table 2.7: Data table formed by joining T 1 and T 2 T 3, After Joining above tables as T 1.Job=T 2.Job Pid Name Job Disease Class 1 Jay Driver Flu CL1 2 Jay Driver Flu CL1 3 Sam Chef Dengue CL2 4 Sam Teacher Asthma CL3 5 Rosy Pilot Dengue CL4 24

39 where, X and Y are QIDs and a key in Table T to identify record owners uniquely. A uniform and flexible way are provided by (X, Y )-anonymity to limit the link-ability between attributes of X and Y while the joining of tables takes place MultiRelational k-anonymity One of the major limitations of k-anonymity algorithm is that it only deals with a single data table. To overcome this limitation Nergiz et al., [67] proposed the multir anonymity (multi-relational anonymity) algorithm to achieve privacy while publishing data from a data set that consists of multiple tables. A relational database is presented by the multir anonymity algorithm using P T which stands a person-specific table, and a collection of n tables T 1, T 2,..., T n, where, P id stands for person identifier, and sensitive attributes are contained by P T ; on the other hand each table T i contains foreign keys, QID attributes, and other sensitive attributes. If all given tables are joined together as P T T 1... T n, each record owner (every group of tuples) RO who shares the QID, multir anonymity ensures that same QID belongs to k-1 other record-owners. MultiR anonymity could be expressed as (X,Y )-anonymity, if X=QID, and Y =P id Discussion Anonymity based techniques such as MultiR k-anonymity, k-anonymity and (X,Y )-Anonymity algorithms are proposed to protect from record linkage attacks by making data donors information anonymous in a data set. However, an attacker is still able to locate a targeted 25

40 owner of a record, without identifying her/his record precisely. For example, in Table 2.4, if an attacker targets this arrangement qid= Artist, f emale, [30 35), the chance of having Influenza is 75% for the targeted person, because, for this group, out of 4 records, the owners of 3 have Influenza. Algorithms designed so far to protect record linkage attacks are not sufficient enough to preserve the privacy of record owners. As a result, researchers propose another privacy preserving model to protect attribute linkage. 2.5 Attribute Linkage Attacks If there is a k-anonymous Table T and a collection of similar attributes form a prominent group, then an attacker can easily be able to locate the targeted record holder that belongs to that group with higher confidence [40]. For example, Table 2.4 represents a 3-anonymous table; however, an attacker can easily draw a decision with 100% confidence that Max is a professor who has Influenza as P rofessor, Male, 37 Influenza. The attacker used his knowledge to locate Max from Table 2.2 and 2.3. The following sections discuss algorithms to prevent attribute linkage attacks l-diversity In order to protect attribute linkage, a privacy-preserving algorithm called l-diversity is proposed by Machanavajjhala et al., [63]. Even if a data table is k-anonymous, due to lack of diversity in any QID group, information of a data donor leaks. Machanavajjhala et al., [63] showed that background knowledge of a malicious user makes data donors vulnerable. 26

41 There are two attacks called homogeneity attack and background-knowledge attacks that are demonstrated by the authors. Table 2.8 represents a micro data set, and Table 2.9 represents the generalized form of the Table 2.8. For example, Table 2.9 represents a case of homogeneity attack. Ruby and Cory are neighbors and Ruby knows Cory is a 22 year old girl who is living in the city-area of the postal code If Ruby knows Cory in Table 2.9 then she can easily guess that Cory s information is in the first four rows in the table. Ruby cannot identify Cory uniquely; however, she breaches Cory s privacy by knowing that Cory has Dengue. Another incident can be described from Table 2.9 to present a background attack. Say, Jon and Karl are pen pals. Karl is a Japanese guy of age 35 who lives in the Zip code 14068, and Jon knows all this information. If Joe knows that Karl in Table 2.9 then he can easily guess Karl has either heart disease or viral infection. However, it is medically proven that due to their food habit young Japanese have less chance to have heart disease. Thus, Joe can draw a conclusion that Karl has a viral infection and breach his privacy. According to the l-diversity [77] method, all sensitive values of every QID group is well represented so that at least l diverse values are assigned to each group. Ohrn and Ohno-Machado [69] also proposed a similar idea previously. The understanding of the idea well represented may differ in different instances (in terms of data set). The p-sensitive k-anonymity [87] method is also same as l-diversity privacy model. Two different versions of l-diversity algorithms are known as disclosure-recursive (c, l)-diversity and negative/positive disclosure-recursive (c, l)-diversity also proposed by Machanavajjhala et al., [63]. To satisfy 27

42 Table 2.8: Patients Micro Data Serial Name Age Gender Zip code Nationality Disease 1 Ana 27 F American Dengue 2 Rocky 28 M American Dengue 3 Cory 22 F American Dengue 4 Paul 24 M American Dengue 5 Stewart 53 M Korean Influenza 6 Fillip 56 M Japanese Hepatitis B 7 Gale 45 M Indian Flu 8 Hadi 48 F Chinese Influenza 9 Ian 32 M Russian Heart Failure 10 Julia 36 F Chinese Heart Failure 11 Karl 35 M Japanese Viral Infection 12 Leo 36 M American Viral Infection 28

43 Table 2.9: Patients Generalized Data Name Serial Age Gender Zip code Nationality Disease (Ana) F 1405* American Dengue (Rocky) M 1406* American Dengue (Cory) F 1406* American Dengue (Paul) M 1405* American Dengue (Stewart) M 15*** Asian Influenza (Fillip) M 15*** Asian Hepatitis B (Gale) M 15*** Asian Flu (Hadi) F 15*** Asian Influenza (Ian) M 140** Any Heart Failure (Julia) F 140** Any Heart Failure (Karl) M 140** Any Viral Infection (Leo) M 140** Any Viral Infection 29

44 the recursive (c, l)-diversity mechanism, all QID groups of a table must be (c, l)-diverse, where c is a constant specified by the publisher and l represents sensitive values in rows. For a specific data set, a collection of sensitive values occurs more often compared to other values belonging to a group; this scenario helps an attacker to draw a conclusion there is a high possibility (very likely) that a certain record in that group has those values. This is called probabilistic inference attack. Different versions of l-diversity method are unable to protect probabilistic inference attacks t-closeness l-diversity suffers with similarity attacks, and skewnes, and also in certain cases difficult and unnecessary to achieve, for details [56]. To overcome those limitations Li et al., [56] proposed a method called t-closeness. In t-closeness algorithm, t is a certain threshold value. The t-closeness [77] calculates Earth Mover Distance (EMD) between two distributions for the attributes in the entire data set and a sensitive attribute. Additionally, the calculated distance must be within the given threshold value t. Let, P and Q represent the distribution of a sensitive attribute in the equivalence class, and the distribution of sensitive attributes in the entire data set, respectively, then, according to the t-closeness, EMD(P, Q) t. The t-closeness algorithm has a number of limitations [41]. Firstly, there is a correlation between sensitive attributes of a data set and QIDs; t-closeness degrades utility of privacy preserved data to enforce t-closeness by wiping out the correlation. Secondly, for sensitive numerical data, the t-closeness is unable to prevent attribute linkage attacks [55]. Thirdly, t- 30

45 closeness uses EMD measurement that is not perfect and flexible enough to impose different privacy levels on different sensitive attributes, although it is an alternative of generalization and suppression for data anonymization Confidence Bounding Attack To protect attribute linkage attacks, Wang et al., [91] proposed a new privacy model called confidence bounding. According to this method, for every qid group, privacy templates are created with this form QID s, h where, QID is a quasi-identifier, s is a sensitive attribute, and h is a threshold. The confidence bounding algorithm works to limit the confidence which infers the sensitive properties gained by the ability of a data miner. The value of the confidence is denoted by Conf(QID (s)) and the calculated value of the confidence is represented as a percentage. A data table satisfies confidence bounding if it fulfills this, Conf(QID (s)) h condition. Table 2.10 presents a 3-anonymous table and provides an example of confidence binding. For example, for a sensitive attribute Flu, the threshold is considered 15% for the given data in the Table For the given data, the confidence inferring for Flu is 75%, which is a violation of the given template for confidence binding method in the group Artist, Male, [35 40). 31

46 Table 2.10: 3-anonymous Patient Table Job Sex Age Disease Health Professional Female [40-45) Dengue Health Professional Female [40-45) Dengue Health Professional Female [40-45) Flu Artist Male [35-40) HIV Artist Male [35-40) Flu Artist Male [35-40) Flu Artist Male [35-40) Flu 2.6 Table Linkage Attacks In the case of attribute linkage, and record linkage attacks, it is considered that an attacker is already aware that the targeted person s record is in the published table. A table linkage attack takes place if an attacker tries to confirm whether the targeted person s record is there or not in the published and anonymized data set δ Presence k anonymity based algorithms are designed to deter an attacker from locating a targeted person s record, but the attacker might know the presence of the targeted person in a certain data set; i.e., table linkage attacks are not preventable by k-anonymity algorithms. To overcome this limitation, Nergiz et al., [66] proposed the δ-presence algorithm, where δ represents a satisfactory range of probability (threshold), δ = (δ min, δ max ). Let us consider 32

47 two tables, an external public table (say a voter list) and a private table, T E and T P respectively, where, T P T E. An anonymized table with generalization is ˆT P, fulfilling the threshold δ = (δ min, δ max ) for any targeted person t, if (δ min P r(t T P ˆT P ) δ max ) and t T E. One of the limitations of δ-presence is that it assumes that an external file T E, is used for the data breach is available to both attacker, and publisher and this assumption may not be realistic [41]. 2.7 Probabilistic Attack The Probabilistic attack privacy model [41] deals with the probabilistic belief of an attacker in order to identify a certain record of a targeted person from a data set. In addition, the above mentioned models deal with records, sensitive attributes, and tables to protect linkage attacks that are established by an attacker. Algorithms of this category are presented below: (c, t)-isolation Chawla et al., [22][21] proposed a method called (c, t)-isolation that prevents an attacker from isolating a numerical value from a real database (RDB), where, t is a numerical value in the real database and the degree of similarity is defined by c. The main limitation of this model is that it is only applicable on numeric data. 33

48 2.7.2 (d, λ)-privacy Rastogi et al., [75] proposed the (d, λ)-privacy model that deals with prior and posterior beliefs of an attacker on a tuple t, from a table T which contains r numbers of tuples, where, d (0, 1) and γ represent an attacker s posterior probability. According to the (d, λ)-privacy model, the value of the prior probability, P (r), for every tuple, t, is either equal to 1 or smaller. If P (r)=1, it reflects the matter that an attacker is sure about the presence of t is in the table T and the algorithm is unable to hide information. On the other hand, the algorithm hides the tuple from an attacker if the value of P (r) is smaller than 1. Calculating the dependency of an attacker s knowledge in terms of d may not be feasible for many real life applications [40][62], which is an important limitation of this algorithm ε-differential Privacy All aforementioned privacy models are called partition-based models. They provide privacy protection by enforcing certain syntactic requirements of the released data. Recent research indicates that the partition-based [56][95] privacy models are unable to prevent an attacker s background knowledge. In contrast, differential privacy [30] is a more semantic definition, which accommodates strong guarantees for privacy regardless of an attacker s background knowledge and computational expertise/power [49]. Dwork et al., [33] proposed the ε-differential privacy model, where, ε specifies the degree of privacy for this algorithm. According to this model, the result of an analysis is not affected extensively, even with the addition or deletion of a single record to or from the database. With the same notion, even 34

49 if an attacker joins different databases together, there is no chance to breach the privacy of any data donor. In chapter 3, ε-differential privacy is discussed in details. Table 2.11 summarizes the various privacy models and attacks handled by the corresponding models. Table 2.11: Different privacy preserving algorithms and attacks [33][41][97] Privacy Model Record Various Attacks Attributes Table Probabilistic Linkage Linkage Linkage Linkage k Anonymity MultiR k Anonymity Y Y l Diversity Y Y Confidence Bounding Y (X, Y ) Privacy Y Y t closeness Y Y δ Presence Y (c, t)-isolation Y Y (d, γ) Privacy Y Y ε-differential privacy Y Y Y Y 35

50 2.8 Anonymization Mechanisms Normally, a given raw data set is very unlikely to satisfy a specified privacy model. Certain anonymization mechanisms need to be applied to the raw data set, making it less precise, in order to support a privacy model. In every privacy model, a trade-off takes place between privacy guarantee and data usability due to various anonymization operations. It is worth mentioning that there is more than one anonymization mechanism to achieve a specific privacy model. However, in many cases, it is important to choose the right anonymization mechanism [10] in order to obtain a better trade-off. So far, four kinds of anonymization mechanisms have been widely used, namely generalization, suppression, bucketization, and perturbation Generalization The generalization mechanism generates anonymous releases by replacing some attribute values by their more general forms. In the case of a numerical value, an interval takes place instead of an exact value and that interval covers the exact value. On the other hand, a certain categorical value is replaced with a more generalized value based on the certain taxonomy used to design the privacy algorithm. Usually, no predetermined taxonomy is assigned to a numerical attribute. Taxonomy trees for numerical and categorical attributes are presented in Figure 2.3 (a, b and c). In Table anonymity is applied by generalizing QID according to the taxonomy trees in Figure 2.3. Generalization can be performed using either global recoding scheme or local recoding scheme. The global recoding scheme fur- 36

51 ther includes full-domain generalization scheme, sub-tree generalization scheme, and sibling generalization scheme Suppression Suppression is a straightforward anonymization mechanism. A specialized symbol (e.g., * or Any ) is used to replace values of an attribute for releasing a candidate, which indicates that the attribute/value has been suppressed. 2.9 Bucketization The basic idea of the bucketization mechanism is to break the correlation between sensitive values of a data set and quasi-identifiers (QIDs). It first partitions the records in the actual data table into non-overlapping buckets, each of which is assigned a unique bucket identifier (BID). Then for each bucket, it randomly permutes the sensitive attribute values, and publishes its projection on the quasi-identifier attributes and also its projection on the permuted sensitive attributes Perturbation For a long time, perturbation mechanisms have been used in the field of statistical disclosure control. Adam and Workman [5] have provided a complete summary of perturbation mechanisms that have been widely employed. There are two standard perturbation mechanisms 37

52 that are used for implementing differential privacy algorithms, namely Laplace mechanism and exponential mechanism. The Statistical disclosure control mechanism uses perturbation to sanitize data because this technique is efficient, simple and has the ability to preserve statistical information. In general, perturbation technique uses a synthetic data value instead of an original value. As a result, there is no remarkable difference in statistical information between perturbed and original data [32] Conclusion This Chapter summaries the concept of data privacy, different existing data privacy models, and various attacks for privacy breaches with examples. In this work, privacy refers to an individual s privacy. The models and algorithms discussed above have following key limitations: Partition-based privacy preserving data sanitization algorithms [95][23][41][64] are variations of k-anonymous algorithm. Partition-based algorithm imposes syntactic constraints (on the raw data) to ensure privacy. These algorithms are unable to protect data donors from adversarys background knowledge attacks. Those algorithms are only able to prevent record linkage [23][94][24][25] and attribute linkage attacks [55]. Another limitation of most existing algorithms is identifying and choosing quasi-identifiers (QIDs) from a data set before sanitizing. Partition based algorithms [41][98] such as k- anonymity, l-diversity, M-map etc., fully suppress QID attributes. As a result, due to loses 38

53 of information, the utility (such as classification accuracy) of the data is reduced significantly. 39

54 Chapter 3 Methodology 3.1 Proposed System and Experimental Design This Chapter presents the mathematical background of the proposed algorithms. The formal definition of differential privacy and the idea of Laplace noise are introduced here. The data flow of the proposed algorithms is also explained here. Methods used to measure the risk of re-identification are also described here. The following sections will discuss the detailed implementation of the proposed system Privacy Constraint Current privacy preserving models (such as partition based models and interactive models) [23][95] are vulnerable to different privacy-breaching attacks. In the proposed system, 40

55 ε differential privacy will be used. It is capable of protecting published data sets from different privacy breach attacks. Differential privacy is a new paradigm that provides a strong privacy guarantee [33]. Partitionbased privacy models [23][95] ensure privacy by imposing syntactic constraints on the output. For example, the output may be required to be indistinguishable among k records, or the sensitive value to be well represented in every equivalence group. Instead, differential privacy makes sure that a malicious user will not be able to get any information about a targeted person, whether a data set contains that person s record or not. Informally, a differentially private output is insensitive to any particular record. Thus, while preserving the privacy of an individual, the output of the differential privacy method is computed as if from a data set that does not contain targeted person s record. Current research shows that ε-differential privacy is able to protect from most attacks. ε differential privacy In this Section the formal definition of the ε-differential privacy [33] will be given. Before that the definition of the difference between two databases is given below: Let a data set DB be a collection of records from a universal sample space χ. A histogram is a convenient way to represent a data set. This may be represented by DB N χ, where every entry DB i represents the number of elements in the database DB, for every i χ and N = {0, 1, 2,...}. Let DB 1 and DB 2 be two databases, the distance between them will be their norm distance. 41

56 Definition 3.1. (Distance between databases). The l 1 norm of a database DB is presented as DB 1, and defined as: χ DB 1 = DB i (3.1) i=1 The l 1 distance between two databases, DB 1, and DB 2 is DB 1 DB 2 1. It is important to note that, DB 1 represents the database size i.e., how many records it has, where as DB 1 DB 2 1 represents the number of records that differ between DB 1 and DB 2. Definition 3.2. (Differential Privacy). Let M be a randomization algorithm over the domain N χ. Then M is (ε, δ)-differentially private if ( S Range(M) DB1, DB 2 N χ DB 1 DB ) : P r[m(db 1 ) S] exp(ε)p r[m(db 2 ) S] + δ (3.2) Now, if δ = 0 then the randomization algorithm M becomes ε differentially private P r[m(db 1 ) S] exp(ε) (3.3) P r[m(db 2 ) S] There is a significant difference between (ε, δ), (ε, 0) differential privacy. In every step of data processing using M(DB), (ε, 0)/ε differential privacy ensures that the outputs of two neighboring databases are mostly equally likely simultaneously [32][33]. 42

57 A stronger privacy guarantee may be achieved by choosing a lower value of ε. The values could be 0.01, 0.1, or may be ln 2 or ln 3 [31]. In this research, the value of ε is used in the range: 0.01 ε 1.0. If it is a very small ε then exp(ε) 1 + ε (3.4) To process numeric and non-numeric data with the differential privacy model, the following techniques will be needed Laplace Mechanism Dwork et al., [32] proposed the Laplace mechanism to add noise for numerical values and ensure differential privacy. The Laplace mechanism takes a database DB as input and consists of a function f and the privacy parameter λ. The privacy parameter λ specifies how much noise should be added to produce the privacy preserved output. The mechanism first computes the true output f(db), and then perturbs the noisy output. A Laplace distribution having a probability density function, π: ( x π = λ) 1 exp( ( x /λ)) (3.5) 2λ generates noise, where, x is a random variable; its variance is 2λ 2 and mean is 0. The sensitivity of the noise is defined by the following formula: 43

58 ˆf(DB) = f(db) + lap (λ) (3.6) where, lap(λ) is sampled from Laplace distribution. The expected magnitude of lap (λ) is approximately 1 λ. In a similar way, the following mechanism ˆf(DB) = f(db) + lap ( ) 1 ε ensures ε differential privacy. For a random variable v, the random noise N r = lap is generated using the following equation [72]: (3.7) ( ) 1 ε N r = sign(v) ln (1 2 v ) (3.8) For this research, v = group of data. equation: ( ) 1 as the value of ε ( ) 1 is always positive and random for every ε Finally, the random Laplace noise, N r, is generated using the following N r = sign ( ) ( 1 ln 1 2 ε ( ) ) ( 1 = 1 ln 1 2 ε ( ) ) 1 ε = ln ( 1 2 ( ) ) 1 ε (3.9) 44

59 Thus: N r = ln ( 1 2 ( ) ) 1 ε (3.10) Within the last five years, some recently published works [9][33][53] also prove that adding Laplace noise secures data from the adversary. Theorem 3.3 ([33]). The Laplace mechanism satisfies (ε, 0)-differential privacy. Proof. Let us consider DB 1 N χ, and DB 2 N χ such that DB 1 DB Let f be some function f : N χ R k, and let P DB1 denote the probably density function (π) of M L (DB 1, f, ε) and let P DB2 denotes the probably density function of M L (DB 2, f, ε). At some arbitrary point x R k, then, P DB1 (x) k P DB2 (x) = exp ( ε f(db 1) i x i f ) i=1 exp ( ε f(db 2) i x i f ) = k ( ) exp(ε f(db1 ) i x i f(db 2 ) i x i ) i=1 f k ( ) ε f(db1 ) i f(db 2 ) i exp f i=1 ( ) ε. f(db1 ) f(db 2 ) 1 = exp f exp(ε) (3.11) 45

60 Theorem 3.4 ([33]). Let M L satisfies (ε, 0)-differential privacy, then it also satisfies (kε, 0)- differential privacy for any group of size k. That means, f(db 1 ) f(db 2 ) 1 k and for all S R (R is the range of M L ), then: P r[m L (DB 1 ) S] exp(kε)p r[m L (DB 2 ) S] Proof. Let any pair of data sets DB 1 and DB 2 satisfies this condition f(db 1 ) f(db 2 ) 1 k. Let, there are databases d 0, d 1,..., d k such that d 0 = DB 1, d k = DB 2, d i d i+1 i 1, and for any event, S Ŕ (Ŕ is the range of M L). Then: P r[m L (DB 1 ) S] = P r[m L (d 0 ) S] exp(ε)p r[m L (d 1 ) S] exp(ε)exp(ε)p r[m L (d 2 ) S] = exp(2ε)p r[m L (d 2 ) S]... exp(kε)p r[m L (d k ) S] = exp(kε)p r[m L (DB 2 ) S] (3.12) 46

61 3.1.3 Anonymization Data anonymization is a procedure that converts data to a new form that produces secure data and prevents information leakage from that data set. However, the anonymized data should still be able to be data mined to obtain useful information/pattern. Data anonymization may be achieved in different ways; however, data suppression and generalization are standard methods to perform data anonymization. In this research, generalization and suppression are used to achieve data anonymization. Generalization To anonymize a data set DB, the process of generalization takes place by substituting an original value of an attribute with a more general form of a value. The general value is chosen according to the characteristics of an attribute. For example, in this work, profession: filmmaker and singer are generalized with the artist, and the age 34 is generalized with a range [30-35). Definition 3.5. Let DB = r 1, r 2,..., r n (3.13) be a set of records, where every record r i represent the information of an individual with attributes A = A 1, A 2,..., A d (3.14) 47

62 It is assumed that each attribute A i has a finite domain, denoted by Ω(A i ). The domain of DB is defined as Ω(DB) = Ω(A 1 ) Ω(A 2 )... Ω(A d ) (3.15) Suppression Suppression is a straightforward anonymization mechanism. A suppression operation replaces an attribute value fully or partially by a special symbol (e.g., * or Any ), which indicates that the value has been suppressed. Suppression is used to prevent disclosure of any value from a data set. Figure 3.1 shows the taxonomy tree (TT) of the Zip code suppression of German cities. The first two digits of a Zip code represents a city; for example the number 42 represents the City of Velbert (At the root of the TT, it is showing any/*****, and a full zip code is placed at the leaf of the TT). Taxonomy tree depth (TTD), plays a role in the usability of the published data Data Flow Diagram of the Proposed System The data flow diagram of the proposed system is presented in the Figure 3.2. Data donors provide their personal data for various reasons, for example for online shopping. As soon as the proposed system gets the raw data, it removes the personally identifiable information (PII) from the raw data set. 48

63 Figure 3.1: Suppression of Zip Codes of two German Cities Figure 3.2: Data Flow of the Proposed Algorithms 49

64 Then, generalization and/or suppression technique is used to anonymize data. At the final stage, Laplace noise is added to the anonymized data that satisfies ε differential privacy. As soon as the sanitized data is published, then data usability measures are applied to check the quality of the data set. 3.2 Utility Measures There is a trade-off between data sanitization (for publishing) process and the usability of the published data set, as the sanitization of data may lead to have noise or data loss to the actual data set. To measure the utility of the published data set, the following two measures will be used: Classification Accuracy of the sanitized data Re-identification Risk Measurement Classification Accuracy Classification is a fundamental problem in statistics, machine learning and pattern recognition. Let a data set have N classifiable attributes and set of L labels. Then the task of classification is the assignment of a specific tag or label L i L to every attribute in a consistent way so that data groups are identified according to their common attributes/characteristics. 50

65 3.2.2 Re-identification Risk This risk of re-identification measures how vulnerable a particular record to be re-identified from a sanitized data set. Three scenarios are considered to estimate the re-identification risk. These are [34]: Prosecutor, Journalist, and Marketer scenarios. According to Prosecutor scenario (PS), an attacker has background knowledge of the targeted person already in the data set. In contrast, in the Journalist scenario (JS), an attacker does not have any background knowledge about the targeted person whether he/she is in the data set or not. In the Marketer scenario (MS), an attacker is not interested in identifying a specific person, but curious to successfully identify a significant percentage of records from a data set. The risk of re-identification measures the probability that a record (or a given number of records) previously sanitized may be correctly associated with whom they originally correspond. Let, there be the n number of records in a data set i.e., i = 1, 2, 3,...n and ρ i represents the probability the correct identification of the record i. Let C represents an equivalence class of a published data set that has J number of records. The probability of each record in C is represented by ρ i=j such that j J. The probability of a correct identification of a record is defined by [34]: ρ j = 1 C j (3.16) where, C j represents the size of the the equivalence class in the published data set. 51

66 The number of records at higher risk than a threshold value from a published data set is measured by the following equation [34]: R r = 1 C i I(ρ j > t) (3.17) n j J where, the function I returns the value 1 if the condition is true or 0 otherwise. The highest risk involved with a record or a set of records is given by the following equation [34]: R max = max j J (ρ j) (3.18) The rate of correctly identified records on average is called the success rate and is given by the following equation [34]: R s = 1 C i ρ j (3.19) n j J 3.3 Conclusion This chapter discusses the theories of differential privacy and related techniques (e.g., noise generation) those are used in the proposed algorithm and its evaluation. Differential privacy paradigm provides the highest privacy guarantee and independent of adversarys background knowledge attack; that s why this research adopts differential privacy to design the proposed algorithms. 52

67 The measure of classification accuracy reflects the quality of the data set; higher accuracy means good quality sanitized data. On the other hand, the risk of re-identification indicates if the sanitized data is safe. The value of risk of re-identification is lower than the threshold means the data is secure to publish. 53

68 Chapter 4 Sanitizing and Publishing Electronic Health Record 4.1 Introduction With the development and integration of information and communication technology (ICT), collection, management, and sharing of electronic health record (EHR) are very common nowadays. The regulations on EHR in Canada is called Personal Health Information Protection Act (PHIPA) [2], and in the United States of America, called Health Insurance Portability and Accountability Act (HIPAA) [43] encourage diverse use of EHR without disclosing data donors privacy [60]. It was mentioned earlier that sharing and exchange of electronic health record (EHR) is beneficial, however, the health information exchange (HIE) program between the USA and Canada is not quite successful. Emam in 2013 [34] 54

69 states that 29% of health care providers suffer from data breaches of their customer or employee. Another finding from the same author [34] that 38% health care providers do not report data breaches to their patients. Also in [8], Almoaber and Amyot reported that 33 barriers need to be overcome to make the HIE viable. Privacy was one of the topmost concerns for the HIE project. Those privacy concerns include: 1) theft of identity or fraud 2) information may be used other than for care of a patient, and 3) illegitimate use of patients information e.g., mental health condition or genome data. 4.2 Problem Definition The key challenge for a privacy preserving data publishing (PPDP) technique is to guarantee data donors privacy while maintaining the data usability for further processing by the data miners or other interested parties. The purpose of this research is to develop a framework that satisfies differential privacy standards defined by Dwork and Roth [33], neutralize the risk of re-identification, and maximize data usability to deal with the classification task for knowledge miners. This proposed work has the following two phases: Sanitize HIPAA compliant and/or micro-data (that contains identifiable and sensitive information about a person) to its anonymous form that satisfies ε-differential privacy Measure the risk of re-identification and the classification accuracy to judge how safe the sanitized data set would be and the usability of the sanitized and published secure data, respectively 55

70 To ensure the ease of availability of high quality data to encourage collaborative scientific research to achieve new findings is one of the main benefits of this work. 4.3 Related work The most privacy preserving partition-based models found in literature [23][41][64][95] are variations of k-anonymous algorithm. When a partition-based algorithm sanitizes a data set, it imposes syntactic constraints (on the raw data) to ensure privacy. It partitions the records of a data set into groups, so that k different sensitive items need to be present in that certain group to protect data items from identification. These algorithms are unable to protect data donors from adversarys background knowledge attacks. Data donors are becoming more vulnerable to background knowledge attacks because of the availability of external data sets like voter lists or publicly available data from social networks accompanied with powerful data mining tools. Another limitation of most existing algorithms is identifying and choosing quasi-identifiers (QIDs) from a data set before sanitizing. Partition based algorithms [41][98] such as k- anonymity, l-diversity, M-map fully suppress QID attributes. As a result, due to loss of information, the utility (e.g., classification accuracy) of the data is reduced significantly. Some researchers integrated privacy to machine learning algorithms [88][22] to publish privacy preserving results instead of publishing secure data sets for sharing. 56

71 The Above discussion presents limitations of existing published works. The proposed algorithm will address those limitations to publish sanitized data set with better usability. Classification task maps a data item to its predefined class; however, in the area of privacy preserving data publishing, not much work was found in the literature that addresses this (task of classification) [18][41]. Some of the recent works on privacy preserving data publishing are reported below: In [74], Qin et al., proposed a local differential privacy algorithm called LDPMiner to anoymize set-valued data. They claimed that the algorithmic analysis of their method is practical in terms of efficiency. In [38], Fan and Jin implemented two methods: Handpicked algorithm (HPA), and Simple random algorithm (SRA) which are alterations of the l diversity [23] approach that sanitizes data by adding Laplace noise [33]. Wu et al., [96] proposed a privacy preserving algorithm by changing quasi-identifiers, and anonymization, and evaluated the published data using classification accuracy, and F -measure. In [64], Mohammad et al., proposed a DP based algorithm, and measured classification accuracy of the sanitized data set. Several researchers adopted privacy by modifying existing machine learning approaches for publishing privacy preserving results e.g., classification results [88], histogram [22] of data sets. Those techniques do not fulfill the criteria of sanitized data sharing rather they only publish useful results. 57

72 4.4 Proposed Algorithm This research work is proposing a 2-Layer Differential Privacy (2LPP) algorithm that satisfies ε-differential Privacy guarantee. Algorithm 1 shows the 2LPP algorithm. In the first layer (lines 1-7), the proposed algorithm applies the generalization technique to transfer attributes of the input data set to its generalized form. For example, the attribute age 34 is generalized with a range [30-35). In the second layer (lines 9-16), the proposed algorithm adds Laplace noise to the generalized data to make data items anonymous. In the layer 1 of the proposed algorithm, data generalization takes place. From line 1 to 7, the algorithm generalizes the raw data set to its generalized form to add a layer of privacy to prevent data breaches. Taxonomy tree helps find the hierarchical relations between the actual attribute to its general form. The Taxonomy tree will never be published with the sanitized data set. The proposed algorithm then (line 3) groups the generalized data set based on the similarities of the predictor attributes and Taxonomy Tree. Also, if there is any value that does not support the taxonomy tree, it is discarded (line 5 to 7). In the layer 2, the algorithm selects the initial privacy budget (line 9). Next, it recalculates the privacy budget based on the size of the group to add Laplace noise to that certain group (line 11 to 13). This process repeats until noise is added to all groups of the generalized data set. As soon as the algorithm ends the noise addition process, it merges all the sub-groups to form anonymized and differential private sanitized data set (line 15). Finally, this data set is published for interested parties (e.g., data miners). 58

73 Algorithm 1: The Proposed 2LPP Algorithm Input : Raw data set: DB [Predictor attributes: A P r, Class attribute: A Cl ], Privacy budget: ε, Taxonomy tree depth: d Output: Generalized data set: DB /* Layer-1: by generalization */ 1 if d > 0 then /* Predictor Attribute Generalization */ 2 Generalize A P r ÂP r, based on the Taxonomy Tree 3 Split the generalized data set, DB g based on the predictor attributes similarities i.e., 4 DB g =DB g1 DB g2... DB gn 5 [where, DB gi DB g and i = 1, 2, 3,..., n] 6 else /* Remove attributes not in a taxonomy tree */ 7 Discard the attribute 8 end /* Layer-2: by adding noise for randomization */ 9 Setup initial privacy budget: ε = 0.1 /* A small value 0.25 or 0.5 or ln 2 etc. */ 10 START for: i = 1 to n 11 Count the frequency, f r of the each generalized group 12 Recalculate the Privacy budget for DB gi :ε i =ε/2( f r + d) 13 Add Laplace noise to the frequency as f r + lap(1/ε i ) 14 END for 15 Merge subgroups with new frequencies: DB= DB g1 DB g2... DB gn 16 The sanitized data set for publishing: DB 59

74 4.5 Working Example Table 4.1 describes a small data set that will be used to explain the working example for the proposed algorithm. Table 4.1: Sample small data set Job Age Class Doctor 34 Y Nurse 50 N Doctor 33 N Nurse 33 Y Actor 20 Y Dramatist 31 N Dramatist 32 Y Actor 25 N By following taxonomy trees for attributes: Job: {any job{health professional{doctor} {nurse}} {media artist{actor} {dramatist}}} and Age: {20-50{20-35}{35-50}}, with the taxonomy tree depth of d = 2, an anonymize table (Table 4.2) is created that contains anonymize attributes. In the Table 4.2, there are 5 groups of anonymize data based on the attribute age. Table 4.3 shows those 5 groups with their frequencies. 60

75 Table 4.2: Anonymize form of the Sample data set Job Age Class Health professional Y Health professional N Health professional N Health professional Y Media artist Y Media artist N Media artist Y Media artist N Table 4.3: Anonymize form of the Sample data set Job Age Class Frequency Health professional Y 2 Health professional N 1 Health professional N 1 Media artist Y 2 Media artist N 2 61

76 Let, the initial privacy budget is ε = 0.1, and the taxonomy tree depth is d = 2. Now for the group health professional, 20 35, Y. In the Algorithm 1, at the line 12 the privacy budget is recalculated as: ˆε = (2 + 2) = 0.1 = (4.1) 8 ˆε represents the privacy budget for the group of data: Group health professional,20 35,Y. Then the amount of noise is calculated using the following equation: N r = ln ( 1 2 ( ) ) 1 ˆε (4.2) According to the equation 4.2, the amount of noise is calculated as: N r = ( ( ) ) 1 ln 1 2 = (4.3) I n the similar way, the noise for all other groups are calculated. The Table 4.4 shows the noisy frequencies for all 5 groups. 4.6 Data Sets T here are two different data sets are used to test the proposed algorithm for sanitizing and publishing secure data. The Haberman s Survival Data Set is from the UCI machine 62

77 Table 4.4: Noisy frequencies for the sanitized data Job Age Class Noisy Frequency Health professional Y 2+5 Health professional N 1+4 Health professional N 1+4 Media artist Y 2+5 Media artist N 2+5 learning repository [57], and the Doctor s Bills Data Set (DBMS) Version 1 (V1) is created using the doctor s bill from the Multimedia Analysis and Data Mining Research Group, German Research Center for Artificial Intelligence [39]. The DBDS is freely available for downloading from the mentioned research group s website [39]. The DBDS is a distinctive case of the publicly available micro data set. The Haberman s Survival Data Set consists of 306 tuples; and it contains 3 numerical, and 1 class attributes. The DBMS V1 data set consists of 46 tuples; and it contains 3 numerical, 3 categorical, and 1 class attributes. Table 4.5: Data Set Descriptions Data Sets Numerical Categorical Missing Class Infor- Attributes Attributes Values mation Doctor s Bills V1 3 3 None > 60, 60 Haberman s Survival 3 None None 1=Survived, 2=Died 63

78 Table 4.6: Attributes of the Doctor s Bills Data Set V1 Attributes Sex City Age Disease Year Diagnostic Spending Class Types Categorical Categorical Numerical Categorical Numerical Numerical > 60K, 60K Table 4.7: Attributes of the Haberman s Survival Data Set Attributes Age Patient s year of operation Number of positive axillary Types Numerical Numerical Numerical nodes detected Class 1 = the patient survived 5 years or longer; 2 = the patient died within 5 year 64

79 4.6.1 Data Set Preprocessing T he DBDS contains collections of Portable Document Format (PDF) files and scanned images of doctor bills. Every bill contains patient information, such as Name, address with zip code, date of birth, types of diagnosis, completion date, diagnosis results, and cost for the diagnosis etc. It also contains address and the name of the doctor involve in the diagnostic processes. All those provided information in the German Language. First of all we converted all those bills to English by using Google Translator [46] and picked those personal information to create a data file in comma-separated values (CSV) format. The prepared data set consists of 46 tuples. It contains real-life data with 3 numeric attributes, 3 categorical/non-numeric attributes, and a class information to classify two different income levels as > 60K, 60K. To make the DBMS V1 HIPAA complaint we excluded the zip code attribute from the data set. Table 4.6 presents all attributes of the Medical Bills data set with their types. 4.7 Results and Discussions T he sanitized, published data set is evaluated to measure the usability in case of data classification, and associated re-identification risk. Table 4.8 and 4.9 show the classification accuracy numerically, and the corresponding graphs are showed at Figures 4.2 and 4.3. The same experiment ran five times at a certain taxonomy tree (TT) depth, d =2, 4, 8, 12, or 16, that corresponded different values of the privacy budget ε (e.g., 0.5, 0.25, 0.1 etc ) to generate the sanitized data. Then, to classify the sanitized data, the classification 65

80 Figure 4.1: A sample Doctor s Bill from the Data Set 66

81 algorithm, decision tree [78] has been applied. Each time the parameters d and ε were varied and the experiment was ran for five times to produce anonymize data. Then the data set was classified to measure the accuracy. The average (arithmetic mean) of the classification accuracy from five runs is reported in this thesis. The classification accuracy for the sanitized data using the proposed algorithm achieved above 83% at the lowest privacy budget ε = 0.1 with the higher values of d, for example 16, and 12. Table 4.8: Classification Accuracy for the Doctor s Bill V1 Data Set TTD ε=0.1 ε=0.25 ε=0.5 ε=1 ε=2 ε= Table 4.9: Classification Accuracy for the Haberman s Survival Data Set TTD ε=0.1 ε=0.25 ε=0.5 ε=1 ε=2 ε=

82 Figure 4.2: Classification Accuracy for the Doctor s Bill V1 Data Set 68

83 Figure 4.3: Classification Accuracy for the Haberman s Survival Data Set 69

The performance of the proposed algorithm, 2LPP, has been compared with five other anonymization algorithms: DiffGen [64], k anonymity (k = 5), k-map (k = 5), δ-presence (0.5 δ 1.

84 The performance of the proposed algorithm, 2LPP, has been compared with five other anonymization algorithms: DiffGen [64], k anonymity (k = 5), k-map (k = 5), δ-presence (0.5 δ 1.0), and (ε, δ)-differential Privacy (ε = 2, δ = 1E 6) algorithms [37][36]. In the case of decision tree classifier, 2LPP algorithm showed better performance compared to five other algorithms, shown in Fig 4.4. The major reason to drop the classification accuracy of the partitioned based algorithms is the difficulty in choosing QIDs. In those algorithms, QIDs are suppressed fully, and due to loss of information the usability of sanitized data dropped significantly. Figure 4.4: Comparisons among proposed algorithm and five other algorithms Risk of Re-identification The Table 4.10 presents the re-identification risk measured in three different attack models mentioned in the Chapter 3. The ARX [37][36] software is used to measure the risk of re-identification of the sanitized and raw data sets. The risk of re-identification for the 70

85 sanitized data by the 2-LPP is close to 0 (see Table 4.10), which means, the sanitized data is safe to publish. Table 4.10: Comparisons of re-identification the risk between sanitized, and non-sanitized data sets Data Sets Doctor s Bill V1 Haberman s Re-identification Risk of HIPPA Compliant Data in % PS=97.82; JS=97.82; MS=97.82 PS=60.45; JS=60.45; MS=60.45 Re-identification Risk of Sanitized data by 2LPP in % PS=0.05; JS=0.05; MS=0.05 PS=0.11; JS=0.09; MS=0.09 Doctor s Bill V1 Figure 4.5 and 4.6 show the risk of re-identification before and after sanitized the Doctor s Bill data set V1. In the first figure 4.5, the success rate of the re-identification risk of records are approximately 98% for all three attack models namely, Prosecutor, journalist, and Marketer attacker models. On the other hand, the risk of re-identification is dropped to 0.11% for the Prosecutor attacker model, 0.09% for the Journalist and Marketer attacker models. 71

86 Figure 4.5: Risk of Re-identification for the Raw Doctor s Bill V1 Data Set Figure 4.6: Risk of Re-identification for the Sanitized Doctor s Bill V1 Data Set 72

87 Haberman s Survival Data Set Figure 4.7 and 4.8 show the risk of re-identification before and after sanitized the Haberman s Survival data set. In the first figure 4.7, the success rate of the re-identification risk of records are approximately 60% for all three attack models namely, Prosecutor, journalist, and Marketer attacker models. On the other hand, the risk of re-identification is reduced to 0.05% for all three attacker models mentioned above. Figure 4.7: Risk of Re-identification for the Raw Haberman s Survival Data Set Scalability The run time of the proposed 2LPP algorithm is presented in Figure 4.9 by changing the size of a data set. In this measure we used Doctor s Bill data set V1. The proposed 2LPP 73

Figure 4.8: Risk of Re-identification for the Sanitized Haberman s Survival Data Set algorithm took more time to read and generalize the data set.

88 Figure 4.8: Risk of Re-identification for the Sanitized Haberman s Survival Data Set algorithm took more time to read and generalize the data set. In an average, the data sanitization and writing time is above 20 seconds. For this experiment, a laptop with an Intel Core i7 (2.10GHz) processor, 8 GB of RAM, and Windows 8.1 operating system were used. 4.8 Conclusion This chapter demonstrates the 2-layer privacy preserving (2LPP) algorithm that utilizes generalization technique and Laplace noise for sanitizing and then publishing anonymized data set from a micro data set. Also, the proposed algorithm ensures ε-differential privacy guarantee. Using this algorithm, the Haberman s Survival data set, and a new data set 74

89 Figure 4.9: Runtime for the 2LPP Algorithm called the Doctor s Bills data set were used to perform experiments for testing, and to evaluate the performance of the proposed algorithm. Experiments indicate that the proposed 2 layer privacy preserving algorithm is capable of publishing useful sanitized data set while significantly reducing the risk of re-identification. 75

90 Chapter 5 Sanitizing and Publishing Real-World Data Set 5.1 Introduction This Chapter represents the performance evaluation of the proposed algorithm, Adaptive Differential Privacy algorithm (ADiffP) in the case of data classification. The sanitized published data using the proposed algorithm shows better classification accuracy compared to other existing algorithms; this indicates that the proposed algorithm is robust. 5.2 Problem Definition Publishing high quality sanitized data is challenging due to the different nature of data (e.g., census, transaction, and location) and the easy availability of external data sources from 76

91 the internet. The aim of this research is to develop a framework that satisfies differential privacy standards [33] and produces high quality data. High quality refers to mainly two aspects: firstly, robust sanitization of raw data to prevent data breaches and secondly, the published data set has a similar quality or flavor of the original data set so that the data miners find it useful. This proposed work has the following key goals: Sanitize census data set to its anonymous form that satisfies ε-differential privacy Sanitize raw data that contains personally identifiable information (PII) and not HIPAA compliant Evaluate the sanitized data to measure the data usability (e.g., classification accuracy) and the ability of preventing data breaches by measuring the risk of record re-identification. 5.3 Related Works There are various algorithms for privacy preserving data mining (PPDM) and privacy preserving data publishing (PPDP), however, not much is found in literature that addresses the privacy preservation to achieve the goal of classification [18][41]. Some of the recent research done on privacy preserving data publishing are reported below: In [38], Fan and Jin implemented two methods: Hand-picked algorithm (HPA) and Simple random algorithm (SRA) which are variations of l diversity [23] technique, and use Laplace Mechanism [33] for adding noise to make data secure. The authors also claimed 77

92 that their methods satisfies ɛ-differential privacy. The authors used four real-world data sets: Gowalla, Foursquare, Netflix, and Movie-Lens and performed empirical studies to evaluate their work. The authors reported some data loss while imposing privacy on the raw data. In [61], Loukides et al., implemented a disassociation algorithm for electronic health record privacy. They used anonymization technique along with horizontal partitioning, vertical partitioning, and refining operations on the data set as needed to impose privacy. The authors used EHR dataset for their experiments. The proposed algorithm is called k m anonymity, a variation of k anonymization algorithm and they follow an interactive model for their implementation, both (k anonymization and interactive model) are limitations [33][41] of their work. In the paper [7], Al-Hussaeni, Fung, and Cheung implemented Incremental Trajectory Stream Anonymizer (ITSA) algorithm to publish private trajectory data (e.g., GPS data of a moving entity). The authors use anonymization and LKC-privacy (L:a positive integer, K: an anonymity threshold K 1, and C: a confidence threshold 0 C 1) to develop the propose technique. They test their algorithm with two different data sets: MetroData and Oldenburg. The authors compared their result with k anonymity algorithm and show that their algorithm works better. Kisilevich et al., [50] presented a multidimensional hybrid approach called kactus-2 which achieves privacy by utilizing suppression and swapping techniques, this method is developed by adopting k anonymization model. The authors investigated data anonymization 78

93 for data classification. The authors adopted five data sets: Adult, German Credit, TTT, Glass Identification, and Waveform for their experiments. They claim that their work produces better classification accuracy of anonymized data. As the propose algorithm based on k anonymization model, it inherits all limitations [41] of k anonymity model, also as the suppression technique is applied, then one of the major drawbacks is that, sparse data results in high information loss [59]. Li et al., [54] proposed and demonstrated two k-anonymity based algorithms: Information based Anonymization for Classification given k (IACk) and, a variant of IACk for a given distributional constraints (IACc). They utilized global attribute generalization, and local value suppression techniques to produce anonymized data for classification. The authors adopted the Adult data set for their experiments. The authors report that IACk algorithm shows better classification performance compared to InfoGain Mondrian [52]. Again, as the proposed algorithm is based on k anonymization model, it inherits [41] all limitations of k anonymity model. 5.4 Proposed Algorithm This research work proposes an Adaptive Differential Privacy (ADiffP) algorithm that satisfies ε-differential Privacy guarantee. Algorithm 2 represents the ADiffP algorithm. In line 3, the algorithm generalizes the raw data set to its generalized form to add a layer of privacy to prevent data breaches. Taxonomy tree helps find the hierarchical relations between the actual attribute for its general form. Taxonomy tree will never be 79

94 Algorithm 2: ADiffP Algorithm 1 Inputs : Raw data set: DB, Predictor attributes: A P r, Class attribute: A Cl, Privacy budget: ε, Taxonomy Tree depth (TT d): d 2 Output: Generalized data set DB 3 Predictor Attribute Generalization: A P r ÂP r, based on the Taxonomy Tree 4 Split the generalized data set, DB g by traversing the taxonomy tree, and predictor attributes similarities i.e., DB g = DB g1 DB g2... DB gn [where, DB gi DB g and i = 1, 2, 3,..., n] 5 Setup initial privacy budget: ˆε = ε/( ÂP r ) /* ε is a small number like 0.1 or 0.25 or 0.5 etc. initially */ 6 START: for i = 1 to n 7 Count the frequency, f r of the each generalized group 8 Set the adaptive privacy budget for DB gi : ε i =ˆε/( f r + d) 9 Add Laplace noise to the frequency as f i + lap(1/ε i ) 10 END for 11 Merge subgroups with new frequencies as DB= DB g1 DB g2... DB gn 12 The output is differential privacy preserved anonymized data set, DB 80

95 published with the sanitized data set. The proposed algorithm (in line 4) then partitions 1 the generalized data set based on the similarities of the predictor attributes and Taxonomy Tree. At this stage, the algorithm also counts the frequency of each group (number of rows in that group). In line 5, the algorithm calculates the initial privacy budget based on the number of predictor attributes in the input data set. The final privacy budget is then calculated for a certain group in line 8. Next, the Laplace noise is generated to add the frequency of that certain group. This process repeats until noise is added to all groups of the generalized data set (line 6 to 10). As the proposed algorithm recalculates the privacy budget depending on the number of predictor attributes and the size of a group, we consider this procedure as an adaptive noise addition. As soon as the algorithm ends the noise addition process, it merges all the sub-groups to form anonymized and differential private sanitized data set (line 12). Finally, this data set is published for interested parties (e.g., data miners) Working Example This section represents a working example of the proposed algorithm. The generalization process is explained in Section 4.5. Let us consider a data set having 5 predictor attributes and 1 class attribute. After applying the generalization process, Table 5.1 is generated with the frequencies of two different groups: 1 non-synthetically 81

96 Table 5.1: Anonymize form of the Sample Data Set with Group Frequencies City Job Age Year Expense Class Frequency Koln Health prof Y 8 Berlin Media artist N 3 Let, the initial privacy budget be ε = 0.1, and the taxonomy tree depth be d = 2. As there are 5 predictor attributes (except the class attribute) that belong to that data set, the privacy budget is recalculated according to the ADiffP algorithm (line 5) as: ˆε = = 0.02 (5.1) Now, for the group g 1 = Koln, health prof., 20 40, , 50 80, Y, in the Table 5.1, has the frequency, f i = g 1 = 8. In Algorithm 2, on line 8 the privacy budget for the group, g 1 is recalculated as: ε i = ˆε/( f r + d) (5.2) According to the equation 5.2, the amount of privacy budget is recalculated as: ε i=1 = 0.02 = (5.3) (8 + 2) Then the amount of noise is calculated using the following equation: 82

97 ( N r = ln 1 2 ( ) ) 1 ε (5.4) According to the equation 5.4, the amount of noise is calculated as: N r = ( ( ) ) 1 ln 1 2 = (5.5) Similarly, the noise for the other group g 2 = Berlin, Media Artist., 40 60, , 30 60, N is calculated as 6. The Table 5.2 represents the noisy frequencies for both groups. Table 5.2: Noisy frequencies for the sanitized data City Job Age Year Expense Class Frequency Koln Health prof Y 8+7 Berlin Media artist N Data Sets There are two different data sets are used to test the proposed algorithm. Data sets are: The Adult Data Set [57] The Doctor s BIll Data Set V2 [39] The adult data set consists of 45,222 tuples and is 5.4 MB in size. It is a census data set and publicly available for download. It contains real-life data with 6 numeric attributes, 83

98 8 categorical/non-numerical attributes, and a class information to classify two different income levels as > 50K, 50K. Table 5.3 presents all attributes of the Adult data set with their types. Table 5.3: Attributes of the Adult Data Set Attributes Work Class Marital Status Occupation Race Sex Relationship Native Country Education Age Capital-gain Capital-loss Hours-per-week Final-Weight Education Number of Year Class Type Categorical Categorical Categorical Categorical Categorical Categorical Categorical Categorical Numerical Numerical Numerical Numerical Numerical Numerical > 50K, 50K In Chapter 4 on page 65 discusses the preprocessing of Doctor s Bill data set. The Doctor s Bill data set Version 2 (DBMS V2) has an extra attribute, zip code compared to the version 84

99 1 (v1). Table 5.4 represents the attributes and their types of the DBMS V2. The DBMS V2 data set consists of 46 tuples; and it contains 3 numerical, 3 categorical, 1 set-valued, and 1 class attributes. Table 5.4: Attributes of the Doctor s Bills Data Set V2 Attributes Sex City Age Disease Year Zip Code Diagnostic Spending Class Types Categorical Categorical Numerical Categorical Numerical Set-valued Numerical > 60K, 60K 5.6 Result and Discussion The findings of classification accuracies of the sanitized data sets for both the Adult data set and the Doctor s Bill data set V2 can be found in Tables 5.5 and 5.6 respectively as determined by the proposed algorithm. The corresponding graphs for the mentioned tables are shown at Figures 5.1 and 5.2 respectively. The produced accuracy results displayed in the Tables 5.5 and 5.6 using the Decision Tree [44] methods. To evaluate the utility of the produced differentially private 85

100 anonymized data set, there are four sets of experiments completed at the different values of taxonomy tree depths, d =2, 4, 8, 12 and 16. For every depth d, values of ε are changed to 0.1, 0.25, 0.5, 1, 2, 3 and 4 to have the classification accuracy. Then every set of experiments repeated five times to generate sanitized data and their classification accuracies. The average or the arithmetic means of the accuracies are reported in the tables shown above. In doing experiments, out of 45K data instances, we used 34% of the data set as a test data set, and remaining 66% as training data set. Table 5.5: Showing classification accuracy using Decision Tree Classifier for the Adult Data Set TTd ε=0.1 ε=0.25 ε=0.5 ε=1 ε=2 ε= Table 5.6: Classification Accuracy for the Doctor s Bill V2 data set TTD ε=0.1 ε=0.25 ε=0.5 ε=1 ε=2 ε=

101 Figure 5.1: Classification Accuracy for the Adult data set 87

102 Figure 5.2: Classification Accuracy for the Doctor s Bill V2 data set Figure 5.3: Comparisons among proposed algorithm and five other algorithms 88

103 5.6.1 Risk of Re-identification The Table 5.7 shows the re-identification risk measured in three different attack models mentioned in the Chapter 3. The ARX [37][36] software is used to measure the risk of re-identification of the sanitized and raw data sets. The risk of re-identification for the sanitized data by the ADiffP is very low (see Table 5.7), which means, the sanitized data is safe to publish. Table 5.7: Comparisons of re-identification the risk between sanitized, and non-sanitized data sets Data Sets Re-identification Re-identification Risk raw/hippa Compliant in % of Data Risk of Sanitized data by ADiffP in % Doctor s Bill V2 Adult PS=100; JS=100; MS=100 PS=0.003; JS=0.003; MS=0.003 PS=0.004; JS=0.004; MS=0.004 PS=0.001; JS=0.001; MS=

Doctor s Bill V2 Figures 5.4 and 5.5 show the risk of re-identification before and after sanitization the Doctor s Bill data set V2. In the first Figure 5.

104 Doctor s Bill V2 Figures 5.4 and 5.5 show the risk of re-identification before and after sanitization the Doctor s Bill data set V2. In the first Figure 5.4, the success rate of the re-identification risk of records are 100% for all three attack models namely, Prosecutor, journalist, and Marketer attacker models. This means that every record of this data set is identifiable, as this data set contains the zip code of every patient. It is mentioned earlier that to test robustness of the proposed algorithm, we intentionally keep that attribute in that data set. On the other hand, the risk of re-identification is dropped to 0.004% for all three attacker models. The risk of re-identification is below the threshold and minimum (0.004%) that it is close to zero; this means that the proposed algorithm robustly sanitizes data. Figure 5.4: Risk of Re-identification for the Raw Doctor s Bill V2 Data Set 90

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness Data Security and Privacy Topic 18: k-anonymity, l-diversity, and t-closeness 1 Optional Readings for This Lecture t-closeness: Privacy Beyond k-anonymity and l-diversity. Ninghui Li, Tiancheng Li, and