Privacy Preserving Data Sanitization and Publishing. A. N. K. Zaman

Size: px
Start display at page:

Download "Privacy Preserving Data Sanitization and Publishing. A. N. K. Zaman"

Transcription

1 Privacy Preserving Data Sanitization and Publishing by A. N. K. Zaman A Thesis presented to The University of Guelph In partial fulfillment of requirements for the degree of Doctor of Philosophy in Computer Science Guelph, Ontario, Canada c A. N. K. Zaman, December, 2017

2 ABSTRACT Privacy Preserving Data Sanitization and Publishing A. N. K. Zaman University of Guelph, 2017 Advisor: Dr. Charlie Obimbo Recent trends have shown a drastic increase in large data repositories by corporations, governments, and healthcare organizations. According to Bernard Marr of the Forbes Tech magazine (2015), the growth in data in 2014/15 alone was twice that created in the entire history of the human race. Data sharing is beneficial in areas such as healthcare services, and collaborative research works. However, there is a significant risk of compromising sensitive information, for example through de-anonymization. Privacy Preserving Data Publishing (PPDP) is a way to allow one to share sanitized data while ensuring protection against identity disclosure of an individual. Removing explicit identifiers/personally identifiable information (PII) from a data set and making the data set compliant according to the Health Insurance Portability and Accountability Act (HIPAA) does not guarantee the privacy of data donors. Data sanitization may be achieved in different ways, by k anonymization, l diversity or δ presence, to name but a few, however, differential privacy paradigm provides the strongest privacy guarantee for sanitized data publishing. This research proposes

3 two privacy preserving algorithms that satisfy the ε differential privacy requirement and adopts the non-interactive privacy model for sanitizing and publishing data. Along with the differential privacy, generalization and suppression of attributes is applied to impose privacy and to prevent re-identification of records of a data set. The key contributions of this thesis are: 1) the proposed algorithm adopts the non-interactive model for data publishing; as a result data miners have full access to the published data set for further processing, to promote data sharing in a safe way; 2) the algorithm can sanitize micro and/or HIPPA compliance data sets for publishing; 3) the published data is independent of adversary s background knowledge; 4) the algorithm is independent of the choice of quasi-identifiers (QIDs), and finally, 5) it protects published data set from the re-identification risk. The published sanitized data using the proposed algorithm is shown to have higher data usability in the case of data classification accuracy compared to other existing works, and significantly reduces the risk of re-identification.

4 Dedication Dedicated to those who lost their lives and those who survived, but had become handicapped physically and/or mentally for rest of their lives. The Rana Plaza Tragedy (2013) * Savar * Dhaka * Bangladesh iv

5 Acknowledgements In the name of God, the Most Beneficent, the Most Merciful. First and foremost, I would like to express my sincere appreciation and gratitude to my advisor, Dr. Charlie Obimbo, who always offered valuable support, understanding, and encouragement. His enthusiasm inspired me to research and write this thesis, and I will forever be grateful to him. I would also like to extend my sincere gratitude to Dr. Rozita Dara for her advice and support during this journey. I am sincerely grateful to the members of my advisory committee: Dr. David Chiu and Dr. Radu Muresan, who provided me with their feedback throughout this journey. Next, I like to extend my admiration to my loving wife, Majida, my beloved son, Rafan and my loving daughter, Safa, for their constant love, encouragement, support, and sacrifice. Finally, I like to express my gratefulness to my caring mother, Mabia Khatun, for her love, well wishes, and inspiration. I also would like to remember my beloved father late Mr. Saifuddin Ahmad for his love and courage throughout my entire life. v

6 Contents 1 Introduction Motivation Contributions Organization Publications Related to This Thesis Literature Review Introduction Preliminaries Privacy Models and Different Attacks Record Linkage Attack k-anonymity (X, Y )-Anonymity vi

7 2.4.3 MultiRelational k-anonymity Discussion Attribute Linkage Attacks l-diversity t-closeness Confidence Bounding Attack Table Linkage Attacks δ Presence Probabilistic Attack (c, t)-isolation (d, λ)-privacy ε-differential Privacy Anonymization Mechanisms Generalization Suppression Bucketization Perturbation Conclusion vii

8 3 Methodology Proposed System and Experimental Design Privacy Constraint Laplace Mechanism Anonymization Data Flow Diagram of the Proposed System Utility Measures Classification Accuracy Re-identification Risk Conclusion Sanitizing and Publishing Electronic Health Record Introduction Problem Definition Related work Proposed Algorithm Working Example Data Sets Data Set Preprocessing viii

9 4.7 Results and Discussions Risk of Re-identification Scalability Conclusion Sanitizing and Publishing Real-World Data Set Introduction Problem Definition Related Works Proposed Algorithm Working Example Data Sets Result and Discussion Risk of Re-identification Scalability Conclusion Conclusion and Future Work Limitations of Existing Systems Summary of Contributions ix

10 6.3 Future Work A Mathematical Symbols Used in Thesis 117 x

11 List of Tables 2.1 Examples of Explicit Identifiers, QIDs, and Sensitive Attributes Patient Table External Table Contains Person Specific Data Anonymous Patient Table A published patient data table T A published patient data table T Data table formed by joining T 1 and T Patients Micro Data Patients Generalized Data anonymous Patient Table Different privacy preserving algorithms and attacks [33][41][97] Sample small data set xi

12 4.2 Anonymize form of the Sample data set Anonymize form of the Sample data set Noisy frequencies for the sanitized data Data Set Descriptions Attributes of the Doctor s Bills Data Set V Attributes of the Haberman s Survival Data Set Classification Accuracy for the Doctor s Bill V1 Data Set Classification Accuracy for the Haberman s Survival Data Set Comparisons of re-identification the risk between sanitized, and non-sanitized data sets Anonymize form of the Sample Data Set with Group Frequencies Noisy frequencies for the sanitized data Attributes of the Adult Data Set Attributes of the Doctor s Bills Data Set V Showing classification accuracy using Decision Tree Classifier for the Adult Data Set Classification Accuracy for the Doctor s Bill V2 data set Comparisons of re-identification the risk between sanitized, and non-sanitized data sets xii

13 List of Figures 2.1 Data collection, anonymization, and application areas Presenting quasi-identifiers, linked to re-identify personal data [1][82] Taxonomy trees for profession, gender, and age Suppression of Zip Codes of two German Cities Data Flow of the Proposed Algorithms A sample Doctor s Bill from the Data Set Classification Accuracy for the Doctor s Bill V1 Data Set Classification Accuracy for the Haberman s Survival Data Set Comparisons among proposed algorithm and five other algorithms Risk of Re-identification for the Raw Doctor s Bill V1 Data Set Risk of Re-identification for the Sanitized Doctor s Bill V1 Data Set Risk of Re-identification for the Raw Haberman s Survival Data Set xiii

14 4.8 Risk of Re-identification for the Sanitized Haberman s Survival Data Set Runtime for the 2LPP Algorithm Classification Accuracy for the Adult data set Classification Accuracy for the Doctor s Bill V2 data set Comparisons among proposed algorithm and five other algorithms Risk of Re-identification for the Raw Doctor s Bill V2 Data Set Risk of Re-identification for the Sanitized Doctor s Bill V2 Data Set Risk of Re-identification for the Raw Adult Data Set Risk of Re-identification for the Sanitized Adult Data Set Runtime for the ADiffP Algorithm xiv

15 Chapter 1 Introduction The huge increase in large data repositories by corporations, governments, and healthcare organizations has given credence to developing information-based decision-making systems. Various interested parties are mining trends, and patterns from the data set to improve/design customer service. As a result, data sharing is essential. However, data custodians have legal and ethical responsibilities to maintain the privacy of the data donors. 1.1 Motivation Data breaches have also increased tremendously, which has not only been alarming, but also affected personal lives, governments, and businesses in many ways. Some of the effects include identity theft, financial losses, and interference with political elections. According to the report of the Verizon Data Breach Investigations [89], in 2016, there were 3,141 confirmed cases of data breaches. In a recent data-breach case, a malicious user publicly re- 1

16 leased personally identifiable information (PII) of 112,000 French police officers in a Google drive [14]. Revelation of indirect information such as postal code, gender, and race can also make a person vulnerable to exposure by an intruder. These are called quasi-identifiers (QIDs). Data breaches are occurred in all areas such as healthcare, academia, banking, and retail; however, our focus will be on healthcare data. This research proposes a privacy preserving algorithm to publish sanitized data to promote data sharing for designing and implementing public-spirited policies to expedite effective services and development. Removing explicit identifiers from a data set and making the dataset compliant with the Health Insurance Portability and Accountability Act (HIPAA) [43] or similar regulation does not guarantee the privacy of data donors. To extract knowledge from data, different parties such as researchers and marketers need to process and share data for their own benefits. Data sharing methods and the use of the shared data among interested parties are controlled by certain guidelines and policies. To protect data donors privacy and to prevent misuse of data, removing identifiable attributes such as names, social insurance numbers and addresses of individuals is a common practice before releasing any data. However, this simplified method is not adequate to ensure the privacy of record owners/donors. The following section will represent some real-world examples to highlight the necessity for privacy preserving methods and to clarify the obstacles of developing such techniques to preserve the person-specific data privacy. 2

17 Montjoye et al., [29] studied a three-month credit card report of 1.1 million individuals and they uniquely identified 90% of the record owners by analyzing the spatiotemporal information. They also reported that by knowing the exact price of an item increases the re-identification risk by 22%. The authors also reported that women are more identifiable than men in the case of the metadata of credit cards. The buying pattern by using a credit card makes a person vulnerable to his/her privacy. Another example explains the de-identification [47] of Resident Registration Number (RRN) for South Koreans. RRN is a 13-digit number that encodes demographic information and the pattern is publicly known. Sweeney and Yoo [83] conducted an experiment on the 23,163 prescription data that contains weakly encrypted RRN codes. The authors reported that they were able to de-anonymize the data 100%, and concluded that encrypted national identifiers are also vulnerable. In a similar study Song et al., [79] showed that improper use of RRN makes the Korean individuals vulnerable. In 2013, Sweeney [81] collected a health data set for the year 2011 from Washington State that did not contain patient s name or address (zip code). However, the author linked the newspaper stories in the same year with the keyword hospitalized, and was able to identify 43% of individuals from that data set. Earlier, Sweeney [80] presented an attack to break person-specific privacy by linking the medical data (state employees) collected by the Group Insurance Commission (GIC), Massachusetts, US and the Massachusetts voter registration list. A medical data set was distributed by the GIC for researchers that contained demographic information such as 3

18 gender, postal code, and date of birth. A copy of the voter registration list of Massachusetts was bought by the author and then combined with the GIC health dataset. She was able to identify the former governor of the state of Massachusetts whose name was William Weld. Sweeney showed that on the basis of gender, 5 digit postal codes, and date of birth, 87% of the U.S. population are unique, i.e., 13% of the population have a common zip code and/or gender and/or date of birth. An attack that uses an external data set to identify a person from an anonymized data set is referred to as a linking attack. These kinds of attacks have become widespread and are a source of concern since it is now fairly easy to collect external data from the internet. In 2006, a compressed text file containing twenty million keywords from the search history of more than 650,000 AOL users in a three-month time slot was released by AOL research [12]. A numeric key was assigned as an ID for every searcher; however, A Georgian widow, Thelma Arnold, age 62 was identified by The New York Times using her numeric ID In this case, meta-data (here the search keywords) of the data, discloses the identity of the user. The search keywords were landscapers in Lilburn, GA, used by a number of persons having the last names Arnold. Another query homes sold in Shadow Lake Subdivision Gwinnett county Georgia, helped to identify Thelma. Netflix, one of the largest movie rental companies in the world, once made its users vulnerable. Netflix released a data set [68] of its 500,000 subscribers to publish anonymous movie ratings were referred as the Netflix Prize data set. According to the Netflix website, To protect customer privacy, all personal information identifying individual customers 4

19 have been removed and all customer IDs have been replaced by randomly-assigned IDs. Narayanan and Shmatikov [65] applied the de-anonymization technique on the Netflix data using the background knowledge from the International Movie Database (IMDb) site where users posted non-anonymous reviews. They were able to identify 99% of the users precisely. From the examples above and discussion, it is clear that mere removal of person-specific information does not guarantee the privacy of a data donor. Robust data sanitization techniques are needed to preserve person-specific privacy while keeping the data useful for knowledge mining. Privacy preserving data publishing [70] is important due to the following reasons: To adhere to legal obligations to prevent data breaches by following laws such as the Health Insurance Portability and Accountability Act (HIPAA) in the U.S. [43] and the Personal Health Information and Protection Act (PHIPA) [2] in Canada. To share data among organizations and their partners without disclosing the privacy of an individual. 1.2 Contributions This research has the following key contributions: Where as current systems are interactive, in other words, users have to query the data set and await the response, sometimes being limited in the response they get, 5

20 the system built adopts a non-interactive model for data sanitization and release, so that the data-miners have complete access of the sanitized release data for further processing. The proposed algorithm fulfills ε-differential privacy [33] and Laplace noise is added to sanitize data sets. Differential Privacy prevents an attacker from knowing any information associated to a particular person from a data set. In Chapter 3, ε-differential privacy is discussed in details. Generalization and suppression techniques are applied to achieve anonymization this helps preventing association of the sanitized data set with the external data set (e.g., data from a social network). Generalization is done by substituting an original attribute value with a more generalized form of that value according to the characteristics of that attribute (age: 47 may be substituted by a range 45-50). A suppression operation replaces an attribute value fully or partially by a special symbol (e.g., * or Any ), which indicates that the value has been suppressed. Any data set sanitized with the proposed algorithm will be free from re-identification using quasi-identifiers (QIDs). QIDs are a set of attributes in a data set those are used to identify an individual with the help of external knowledge (please see Figure 2.2). The proposed algorithms do not consider any attribute as QIDs to avoid syntactic processing of a data set. 6

21 The proposed algorithm can handle real life data sets containing categorical, numerical, and set valued attributes. The sanitized and published data sets using the proposed algorithm keeps data usable in case of classification. The proposed algorithm de-identifies a data set in a secure way so the the risk of re-identification is very low, means the the data set is safe to publish. 1.3 Organization The rest of the thesis is organized as follows: Chapter 2 represents the literature reviews of the area of privacy preserving data publishing (PPDP) includes existing privacy models, distinct privacy preserving algorithms, various types of attacks, and privacy breaches. Chapter 3 represents the theoretical background of the differential privacy paradigm, and Laplace noise. This also discussed the techniques used to measure the data usability of the sanitized and published data sets. Chapter 4 represents the proposed two layer privacy preserving (2LPP) algorithm for sanitizing and publishing health care data sets. The usability of the sanitized and published data sets are also presented here. 7

22 Chapter 5 represents the proposed adaptive differential privacy (ADiffP) algorithm for sanitizing and publishing census data set and an another data set that contains a set valued attribute. The usability of the sanitized and published data sets are also presented here. Chapter 6 presents the concluding remarks of this thesis, and suggest some future directions of this research. 1.4 Publications Related to This Thesis 1. A. N. K. Zaman, C. Obimbo, and R. A. Dara, An improved differential privacy algorithm to protect re- identification of data, in Proceedings of the IEEE Canada International Humanitarian Technology Conference (IHTC 2017), Toronto, Ontario, Canada, July 20-22, 2017, pp (Best Paper Award) 2. A. N. K. Zaman, C. Obimbo, and R. A. Dara, An Improved Data Sanitization Algorithm for Privacy Preserving Medical Data Publishing, in Proceedings of the Advances in Artificial Intelligence: 30th Canadian Conference on Artificial Intelligence, Canadian AI 2017, Edmonton, AB, Canada, May 16-19, 2017 pp Cham: Springer International Publishing, A. N. K. Zaman, C. Obimbo, and R. A. Dara, A novel differential privacy approach that enhances classification accuracy, in Proceedings of the Ninth International C 8

23 Conference on Computer Science & Software Engineering, ACM C3S2E 16, Porto, Portugal, July 20-22, 2016, pp A. N. K. Zaman and C. Obimbo, Privacy preserving data publishing: A classification perspective, International Journal of Advanced Computer Science and Applications(IJACSA), vol. 5, no. 9, pp , A. N. K. Zaman, C. Obimbo, R. A. Dara, and David Chiu Minimizing re-identification risk of personal medial data, IEEE Consumer Electronic Magazine, Special Issue on Humanitarian Technology. (To be Submitted) 9

24 Chapter 2 Literature Review 2.1 Introduction Over the past decade, the rate at which Government and Corporations respectively have collected their citizens and customers data, containing private information has grown exponentially. These data, create opportunities for developing knowledge and information-based decision making systems by means of the data mining. Thus, the publication of these data enables it to be shared with varying parties. For example, all California-based, licensed hospitals have to submit person-specific data (Date of Birth, admission and release dates, Zip code, principal language spoken etc.) of all discharged patients to the California Health Facilities Commission to make that data available to interested parties (e.g., insurance, researchers) to promote Equitable Healthcare Accessibility for California [11]. In 2004, the Information Technology Advisory Committee of the President of the United States 10

25 published a report with the title Revolutionizing Health Care through Information Technology [26], to emphasize the importance of implementing a nationwide electronic medical record (EMR) system to promote and to encourage medical knowledge sharing throughout the computerized clinical decision support system. Publishing data is beneficial in many other areas. As discussed earlier (in Chapter 1), in 2006 Netflix (online DVD Rental Company) published a 500,000 movie-ratings data set of subscribers to encourage research to improve the movie recommendation accuracy on the basis of personal movie preferences [68]. From Oct 2012, Canada, and the United States governments started a pilot project called Entry/Exit pilot project [20]. The intent of this project is to share the biographic data of travelers who cross the USA/Canada border between the Canada Border Services Agency (CBSA) and the Department of Homeland Security (DHS). This is an example of data sharing between two governments. In general, an individual may not be willing to share his/her personal information with another person, group or society due to his/her privacy concerns. Privacy should be considered as a privilege, so that an individual will be able to prevent his/her information from being public. Privacy has many aspects such as physical, organizational, intellectual, and informational. This Thesis deals with informational privacy related to personal data. In 1890, Warren and Brandeis [92] published their concern about privacy due to technological improvement of photography and quicker newspaper printing. The authors regarded privacy as related to the inviolate personality of an individual. Nowadays, privacy is 11

26 regarded as a basic human right; however, the notion of privacy varies in different contexts. Here are a few definitions of privacy given by leading researchers in this area: Westin wrote in [93]: Privacy is the claim of individuals, groups or institutions to determine for themselves when, how, and to what extent information about them is communicated to others. Gavison wrote in [42]: A loss of privacy occurs as others obtain information about an individual, pay attention to him, or gain access to him. These three elements of secrecy, anonymity, and solitude are distinct and independent, but interrelated, and the complex concept of privacy is richer than any definition centered around only one of them. Barth et al., in [13]: Privacy as a right to appropriate flows of personal information. Bertino et al., in [16]: The right of an entity to be secure from unauthorized disclosure of sensitive information that are contained in an electronic repository or that can be derived as aggregate and complex information from data stored in an electronic repository. The above definitions focus on the concept of privacy as release of information in a controlled way. One can summarize this as privacy determines what type of personal information should be released and which group or person can access and use it. For the purposes of this Thesis, we give a few relevant definitions: Definition 2.1. Privacy Preserving Data Publishing (PPDP) encompasses privacy models and techniques, which allow one to share anonymous data to ensure protection against identity disclosure. Data anonymization is a technique for PPDP, which makes sure the 12

27 published data is practically useful for processing (mining) while preserving individuals sensitive information [40]. Classification is a fundamental problem in statistics, machine learning and pattern recognition. Definition 2.2. Let a data set have N classifiable attributes and a set L of labels. The task of classification can be defined as the assignment of a specific label L i L to every attribute in a consistent predefined way so that data groups are identified according to their common attributes/characteristics [51]. Definition 2.3. Differential privacy is a privacy model that ensures the highest level of privacy for a record owner while providing actual information about the data set. The definitions above will be discussed in detail in further chapters. The following sections discuss the fundamental ideas of privacy preserving models, different attacks, and their classifications. 2.2 Preliminaries Researchers in data mining, statistics, database, and security communities have been worked on the privacy of data for last few decades [41][5]. The task of preserving the privacy of a data set can be categorized [41][5][48] as: Interactive frameworks 13

28 Non-interactive frameworks In the interactive framework, a privacy-preserving mechanism resides between users and/or researchers queries and raw a data set. The responses (against the queries) and/or queries are evaluated by the privacy preserving mechanism to guarantee privacy. Examples of query responses in case of interactive frameworks are SUM, COUNT, etc., so this method is also called statistical disclosure control (SDC). The interactive framework encompasses two different techniques: auditing queries, and perturbation of outputs. In the case of query auditing, if any response to a query, discloses any sensitive information then the query will be denied or disclosed exact answer otherwise. On the other hand, the privacy mechanism alters the exact answer of a query in a perturbed form (e.g., by the addition of noise) for publication. In non-interactive paradigm, sanitization (e.g., anonymization) is applied to the raw data set to make it anonymous to preserve individual s privacy and then the altered data is published for analysis or processing. As soon as the data is published, the publisher of the data has no further control of the data set. The non-interactive privacy preserving framework is also called PPDP. Existing works on data privacy can also be categorized in two ways. Centralized model, and Distributed model In the centralized approach, a single owner publishes the data set in which the key challenge is to alter the data to preserve privacy and to process the modified data set to 14

29 mine results. One of the most common methods used in the centralized approach is randomization[6][35]. The key idea behind the distributed approach is that different parties were willing to collaborate to obtain aggregate results, but they do not trust each other to distribute their own data set. The main challenge is to execute multiparty computation while preserving data privacy and inputs-outputs. This paradigm is called Secure Multiparty Computation (SMC). Privacy and correctness are two main requirements for SMC [15][17]. PPDP is a leader in many application areas including: Microdata publishing (for sharing, classification, etc.) Data outsourcing (for cloud computing) Collaborative computing (e.g., multiparty computations) Mobile computing (e.g., location privacy and location-based service quality) etc. Figure 2.1 presents the data flow diagram of a PPDP system for the data collection through out the application phases. Figure 2.1: Data collection, anonymization, and application areas 15

30 When an anonymous data set is published, it is expected to be used by the researcher for lawful data analysis. However, there is a high risk that illegitimate users could analyze the published data and discover someone s personal sensitive information. For example, in 2013 the news about PRISM was published by Edward Snowden, a former National Security Agency (NSA) Tech contractor. PRISM is the code name of data mining program used by NSA to access big tech giants like Google, Yahoo, Skype, Face-book servers to collect users data includeing messages, log-in activity, voice and video chat etc. [58]. A data publisher needs a robust data anonymization algorithm that can protect from different attacks as well as keep the data useful for further processing. It is important to note a few definitions [19] which will be useful for the following sections: Definition 2.4. An identifier that helps to recognize an individual explicitly from a data collection using a set of attributes is called an explicit identifier, e.g., social insurance number (SIN) and name. Definition 2.5. If the values of a set of attributes are linked to locate or identify a person from a data set, then these attributes are called quasi-identifiers (QIDs) e.g., postal code, gender, and date of birth. Definition 2.6. Some attributes are considered sensitive and person-specific; these attributes are called sensitive attributes e.g., salary, disease, and disability status. All other attributes will be considered non-sensitive. In this document, the term victim refers to an individual (data donor/owner) who is targeted by an attacker. Table 2.1 presents the above-mentioned terms as examples. 16

31 Table 2.1: Examples of Explicit Identifiers, QIDs, and Sensitive Attributes Explicit Identifier Quasi-Identifier (QID) Sensitive Attributes Name Social Insurance Number (SIN) Date of Birth (dd/mm/yy) Gender Zip Code Disease Ruby /11/74 Female Dengue Jenny /11/84 Female Flu Dan /12/89 Male Cancer Ella /09/81 Female Broken Leg Max /02/85 Male Asthma Figure 2.2 shows how QIDs are used to link with external data to identify an individual. In this example, a medical data set is linked to the voter list to identify a targeted individual. 2.3 Privacy Models and Different Attacks Before talking about the privacy models and different attacks, it is necessary to know the definition of privacy protection [41][28]. If an attacker has full access to a published data set and also has background knowledge (from different sources) about a person from the same data set; however, the attacker will not be able to find the targeted person if the privacy control is effectively implemented for that published data set. The following sections discuss different privacy preserving models and attacks. 17

32 Figure 2.2: Presenting quasi-identifiers, linked to re-identify personal data [1][82] 18

33 2.4 Record Linkage Attack In a record linkage attack [23][94][24][25], an attacker might have some auxiliary knowledge about an individual from other sources like a telephone directory. Let T is a published table, QID represents values of all quasi-identifiers of T, and qid represents a smaller number of records belongs to QID in the Table T. Then in the published Table T, some values of qid on QID (qid QID) separate records that are smaller in number, is defined as a group. If the victim belonging to that smaller group is easily identified by the attacker. So, the record linkage attack makes data donors vulnerable. Tables 2.2, 2.3, and 2.4 present examples of various attacks. Table 2.2: Patient Table Job Sex Age Disease Pilot Male 34 Hepatitis B Pilot Male 39 Hepatitis B Professor Male 37 Influenza Filmmaker Female 31 Dengue Filmmaker Female 31 Influenza Singer Female 31 Influenza Singer Female 31 Influenza Let us consider a hospital released Table 2.2 for research purposes. If an attacker has access to Table 2.3 and (s)he knows the victim, then by combining two tables, the attacker can easily identify the victim s disease. In both Tables 2.2 and 2.3, there are common 19

34 Table 2.3: External Table Contains Person Specific Data Name Job Sex Age Cindy Filmmaker Female 31 Lolo Singer Female 31 Kim Filmmaker Female 31 Ruby Singer Female 33 Sara Singer Female 31 Bobby Pilot Male 34 Max Professor Male 37 Peter Pilot Male 39 Joe Professor Male 39 Table 2.4: 3-Anonymous Patient Table Job Sex Age Disease Artist Female [30-35) Dengue Artist Female [30-35) Influenza Artist Female [30-35) Influenza Artist Female [30-35) Influenza Professional Male [35-40) Hepatitis B Professional Male [35-40) Hepatitis B Professional Male [35-40) Influenza 20

35 attributes job, sex, age. For example, Max, a male professor who is 37 years old, is identified as an Influenza patient by qid= professor, male, 37 by joining two given tables. This is an example of a record linkage attack. k-anonymity [76][80] is a technique that protects from record linkage attacks k-anonymity The idea of k-anonymity was introduced by Samarati and Sweeney [76][80] to protect privacy of the data donors against record linkage using QID. Sweeney explains k-anonymity [71][80] as The information for each person contained in the released table cannot be distinguished from at least k 1 individuals whose information also appears in the release. This definition could be explained as at least k records must have the same quasi-identifier in the publicly released Table T. Table 2.4 represents a 3-anonymous table by the generalization (published more generalized values of attributes) of QIDs. In Table 2.2, there are person-specific information; however, in Table 2.4 there is no person specific information, as this table is generalized according to k-anonymity, and at least 2 records have the same QID. Taxonomy trees of the attributes, profession, age and sex, for the generalization of Table 2.2 are given in Figure (X, Y )-Anonymity To overcome the limitation of k-anonymity and for the easy sequential data release, the idea of (X,Y )-Anonymity was introduced by Wang and Fung [90]. The sequential release 21

36 Figure 2.3: Taxonomy trees for profession, gender, and age 22

37 of a data set is a way to release different attributes as subsets in a sequential manner. For example, in Table 2.5, a data publisher published a Table T 1 earlier and after that the publisher decided to publish another Table T 2 (see on Table 2.6) of the same data set for classification analysis. The column Pid (person identifier) is added for the sake of discussion, not for publication. According to the privacy requirement, from a published table, a data donor s record should not be identified by an attacker. However, if an attacker joins Tables T 1 and T 2 he can identify Sam, Dengue group by analyzing name, disease matching, as this group size is 1. In the same way the attacker is also able to identify Jay, Flu with 100% confidence for the Jay group of persons. From the above discussion, it is found that sequential publication of data makes individuals privacy vulnerable. Table 2.5: A published patient data table T 1 T 1 Pid Job Disease 1 Driver Flu 2 Driver Flu 3 Chef Dengue 4 Teacher Asthma 5 Pilot Dengue In (X,Y )-anonymity, X and Y are present two disjoint sets that have their own attributes. According to (X,Y )-anonymity, at least k different values of Y are linked with every value of X. The k-anonymity could be expressed as a special form of (X, Y )-anonymity, 23

38 Table 2.6: A published patient data table T 2 T 2 Pid Name Job Class 1 Jay Driver CL1 2 Jay Driver CL1 3 Sam Chef CL2 4 Sam Teacher CL3 5 Rosy Pilot CL4 Table 2.7: Data table formed by joining T 1 and T 2 T 3, After Joining above tables as T 1.Job=T 2.Job Pid Name Job Disease Class 1 Jay Driver Flu CL1 2 Jay Driver Flu CL1 3 Sam Chef Dengue CL2 4 Sam Teacher Asthma CL3 5 Rosy Pilot Dengue CL4 24

39 where, X and Y are QIDs and a key in Table T to identify record owners uniquely. A uniform and flexible way are provided by (X, Y )-anonymity to limit the link-ability between attributes of X and Y while the joining of tables takes place MultiRelational k-anonymity One of the major limitations of k-anonymity algorithm is that it only deals with a single data table. To overcome this limitation Nergiz et al., [67] proposed the multir anonymity (multi-relational anonymity) algorithm to achieve privacy while publishing data from a data set that consists of multiple tables. A relational database is presented by the multir anonymity algorithm using P T which stands a person-specific table, and a collection of n tables T 1, T 2,..., T n, where, P id stands for person identifier, and sensitive attributes are contained by P T ; on the other hand each table T i contains foreign keys, QID attributes, and other sensitive attributes. If all given tables are joined together as P T T 1... T n, each record owner (every group of tuples) RO who shares the QID, multir anonymity ensures that same QID belongs to k-1 other record-owners. MultiR anonymity could be expressed as (X,Y )-anonymity, if X=QID, and Y =P id Discussion Anonymity based techniques such as MultiR k-anonymity, k-anonymity and (X,Y )-Anonymity algorithms are proposed to protect from record linkage attacks by making data donors information anonymous in a data set. However, an attacker is still able to locate a targeted 25

40 owner of a record, without identifying her/his record precisely. For example, in Table 2.4, if an attacker targets this arrangement qid= Artist, f emale, [30 35), the chance of having Influenza is 75% for the targeted person, because, for this group, out of 4 records, the owners of 3 have Influenza. Algorithms designed so far to protect record linkage attacks are not sufficient enough to preserve the privacy of record owners. As a result, researchers propose another privacy preserving model to protect attribute linkage. 2.5 Attribute Linkage Attacks If there is a k-anonymous Table T and a collection of similar attributes form a prominent group, then an attacker can easily be able to locate the targeted record holder that belongs to that group with higher confidence [40]. For example, Table 2.4 represents a 3-anonymous table; however, an attacker can easily draw a decision with 100% confidence that Max is a professor who has Influenza as P rofessor, Male, 37 Influenza. The attacker used his knowledge to locate Max from Table 2.2 and 2.3. The following sections discuss algorithms to prevent attribute linkage attacks l-diversity In order to protect attribute linkage, a privacy-preserving algorithm called l-diversity is proposed by Machanavajjhala et al., [63]. Even if a data table is k-anonymous, due to lack of diversity in any QID group, information of a data donor leaks. Machanavajjhala et al., [63] showed that background knowledge of a malicious user makes data donors vulnerable. 26

41 There are two attacks called homogeneity attack and background-knowledge attacks that are demonstrated by the authors. Table 2.8 represents a micro data set, and Table 2.9 represents the generalized form of the Table 2.8. For example, Table 2.9 represents a case of homogeneity attack. Ruby and Cory are neighbors and Ruby knows Cory is a 22 year old girl who is living in the city-area of the postal code If Ruby knows Cory in Table 2.9 then she can easily guess that Cory s information is in the first four rows in the table. Ruby cannot identify Cory uniquely; however, she breaches Cory s privacy by knowing that Cory has Dengue. Another incident can be described from Table 2.9 to present a background attack. Say, Jon and Karl are pen pals. Karl is a Japanese guy of age 35 who lives in the Zip code 14068, and Jon knows all this information. If Joe knows that Karl in Table 2.9 then he can easily guess Karl has either heart disease or viral infection. However, it is medically proven that due to their food habit young Japanese have less chance to have heart disease. Thus, Joe can draw a conclusion that Karl has a viral infection and breach his privacy. According to the l-diversity [77] method, all sensitive values of every QID group is well represented so that at least l diverse values are assigned to each group. Ohrn and Ohno-Machado [69] also proposed a similar idea previously. The understanding of the idea well represented may differ in different instances (in terms of data set). The p-sensitive k-anonymity [87] method is also same as l-diversity privacy model. Two different versions of l-diversity algorithms are known as disclosure-recursive (c, l)-diversity and negative/positive disclosure-recursive (c, l)-diversity also proposed by Machanavajjhala et al., [63]. To satisfy 27

42 Table 2.8: Patients Micro Data Serial Name Age Gender Zip code Nationality Disease 1 Ana 27 F American Dengue 2 Rocky 28 M American Dengue 3 Cory 22 F American Dengue 4 Paul 24 M American Dengue 5 Stewart 53 M Korean Influenza 6 Fillip 56 M Japanese Hepatitis B 7 Gale 45 M Indian Flu 8 Hadi 48 F Chinese Influenza 9 Ian 32 M Russian Heart Failure 10 Julia 36 F Chinese Heart Failure 11 Karl 35 M Japanese Viral Infection 12 Leo 36 M American Viral Infection 28

43 Table 2.9: Patients Generalized Data Name Serial Age Gender Zip code Nationality Disease (Ana) F 1405* American Dengue (Rocky) M 1406* American Dengue (Cory) F 1406* American Dengue (Paul) M 1405* American Dengue (Stewart) M 15*** Asian Influenza (Fillip) M 15*** Asian Hepatitis B (Gale) M 15*** Asian Flu (Hadi) F 15*** Asian Influenza (Ian) M 140** Any Heart Failure (Julia) F 140** Any Heart Failure (Karl) M 140** Any Viral Infection (Leo) M 140** Any Viral Infection 29

44 the recursive (c, l)-diversity mechanism, all QID groups of a table must be (c, l)-diverse, where c is a constant specified by the publisher and l represents sensitive values in rows. For a specific data set, a collection of sensitive values occurs more often compared to other values belonging to a group; this scenario helps an attacker to draw a conclusion there is a high possibility (very likely) that a certain record in that group has those values. This is called probabilistic inference attack. Different versions of l-diversity method are unable to protect probabilistic inference attacks t-closeness l-diversity suffers with similarity attacks, and skewnes, and also in certain cases difficult and unnecessary to achieve, for details [56]. To overcome those limitations Li et al., [56] proposed a method called t-closeness. In t-closeness algorithm, t is a certain threshold value. The t-closeness [77] calculates Earth Mover Distance (EMD) between two distributions for the attributes in the entire data set and a sensitive attribute. Additionally, the calculated distance must be within the given threshold value t. Let, P and Q represent the distribution of a sensitive attribute in the equivalence class, and the distribution of sensitive attributes in the entire data set, respectively, then, according to the t-closeness, EMD(P, Q) t. The t-closeness algorithm has a number of limitations [41]. Firstly, there is a correlation between sensitive attributes of a data set and QIDs; t-closeness degrades utility of privacy preserved data to enforce t-closeness by wiping out the correlation. Secondly, for sensitive numerical data, the t-closeness is unable to prevent attribute linkage attacks [55]. Thirdly, t- 30

45 closeness uses EMD measurement that is not perfect and flexible enough to impose different privacy levels on different sensitive attributes, although it is an alternative of generalization and suppression for data anonymization Confidence Bounding Attack To protect attribute linkage attacks, Wang et al., [91] proposed a new privacy model called confidence bounding. According to this method, for every qid group, privacy templates are created with this form QID s, h where, QID is a quasi-identifier, s is a sensitive attribute, and h is a threshold. The confidence bounding algorithm works to limit the confidence which infers the sensitive properties gained by the ability of a data miner. The value of the confidence is denoted by Conf(QID (s)) and the calculated value of the confidence is represented as a percentage. A data table satisfies confidence bounding if it fulfills this, Conf(QID (s)) h condition. Table 2.10 presents a 3-anonymous table and provides an example of confidence binding. For example, for a sensitive attribute Flu, the threshold is considered 15% for the given data in the Table For the given data, the confidence inferring for Flu is 75%, which is a violation of the given template for confidence binding method in the group Artist, Male, [35 40). 31

46 Table 2.10: 3-anonymous Patient Table Job Sex Age Disease Health Professional Female [40-45) Dengue Health Professional Female [40-45) Dengue Health Professional Female [40-45) Flu Artist Male [35-40) HIV Artist Male [35-40) Flu Artist Male [35-40) Flu Artist Male [35-40) Flu 2.6 Table Linkage Attacks In the case of attribute linkage, and record linkage attacks, it is considered that an attacker is already aware that the targeted person s record is in the published table. A table linkage attack takes place if an attacker tries to confirm whether the targeted person s record is there or not in the published and anonymized data set δ Presence k anonymity based algorithms are designed to deter an attacker from locating a targeted person s record, but the attacker might know the presence of the targeted person in a certain data set; i.e., table linkage attacks are not preventable by k-anonymity algorithms. To overcome this limitation, Nergiz et al., [66] proposed the δ-presence algorithm, where δ represents a satisfactory range of probability (threshold), δ = (δ min, δ max ). Let us consider 32

47 two tables, an external public table (say a voter list) and a private table, T E and T P respectively, where, T P T E. An anonymized table with generalization is ˆT P, fulfilling the threshold δ = (δ min, δ max ) for any targeted person t, if (δ min P r(t T P ˆT P ) δ max ) and t T E. One of the limitations of δ-presence is that it assumes that an external file T E, is used for the data breach is available to both attacker, and publisher and this assumption may not be realistic [41]. 2.7 Probabilistic Attack The Probabilistic attack privacy model [41] deals with the probabilistic belief of an attacker in order to identify a certain record of a targeted person from a data set. In addition, the above mentioned models deal with records, sensitive attributes, and tables to protect linkage attacks that are established by an attacker. Algorithms of this category are presented below: (c, t)-isolation Chawla et al., [22][21] proposed a method called (c, t)-isolation that prevents an attacker from isolating a numerical value from a real database (RDB), where, t is a numerical value in the real database and the degree of similarity is defined by c. The main limitation of this model is that it is only applicable on numeric data. 33

48 2.7.2 (d, λ)-privacy Rastogi et al., [75] proposed the (d, λ)-privacy model that deals with prior and posterior beliefs of an attacker on a tuple t, from a table T which contains r numbers of tuples, where, d (0, 1) and γ represent an attacker s posterior probability. According to the (d, λ)-privacy model, the value of the prior probability, P (r), for every tuple, t, is either equal to 1 or smaller. If P (r)=1, it reflects the matter that an attacker is sure about the presence of t is in the table T and the algorithm is unable to hide information. On the other hand, the algorithm hides the tuple from an attacker if the value of P (r) is smaller than 1. Calculating the dependency of an attacker s knowledge in terms of d may not be feasible for many real life applications [40][62], which is an important limitation of this algorithm ε-differential Privacy All aforementioned privacy models are called partition-based models. They provide privacy protection by enforcing certain syntactic requirements of the released data. Recent research indicates that the partition-based [56][95] privacy models are unable to prevent an attacker s background knowledge. In contrast, differential privacy [30] is a more semantic definition, which accommodates strong guarantees for privacy regardless of an attacker s background knowledge and computational expertise/power [49]. Dwork et al., [33] proposed the ε-differential privacy model, where, ε specifies the degree of privacy for this algorithm. According to this model, the result of an analysis is not affected extensively, even with the addition or deletion of a single record to or from the database. With the same notion, even 34

49 if an attacker joins different databases together, there is no chance to breach the privacy of any data donor. In chapter 3, ε-differential privacy is discussed in details. Table 2.11 summarizes the various privacy models and attacks handled by the corresponding models. Table 2.11: Different privacy preserving algorithms and attacks [33][41][97] Privacy Model Record Various Attacks Attributes Table Probabilistic Linkage Linkage Linkage Linkage k Anonymity MultiR k Anonymity Y Y l Diversity Y Y Confidence Bounding Y (X, Y ) Privacy Y Y t closeness Y Y δ Presence Y (c, t)-isolation Y Y (d, γ) Privacy Y Y ε-differential privacy Y Y Y Y 35

50 2.8 Anonymization Mechanisms Normally, a given raw data set is very unlikely to satisfy a specified privacy model. Certain anonymization mechanisms need to be applied to the raw data set, making it less precise, in order to support a privacy model. In every privacy model, a trade-off takes place between privacy guarantee and data usability due to various anonymization operations. It is worth mentioning that there is more than one anonymization mechanism to achieve a specific privacy model. However, in many cases, it is important to choose the right anonymization mechanism [10] in order to obtain a better trade-off. So far, four kinds of anonymization mechanisms have been widely used, namely generalization, suppression, bucketization, and perturbation Generalization The generalization mechanism generates anonymous releases by replacing some attribute values by their more general forms. In the case of a numerical value, an interval takes place instead of an exact value and that interval covers the exact value. On the other hand, a certain categorical value is replaced with a more generalized value based on the certain taxonomy used to design the privacy algorithm. Usually, no predetermined taxonomy is assigned to a numerical attribute. Taxonomy trees for numerical and categorical attributes are presented in Figure 2.3 (a, b and c). In Table anonymity is applied by generalizing QID according to the taxonomy trees in Figure 2.3. Generalization can be performed using either global recoding scheme or local recoding scheme. The global recoding scheme fur- 36

51 ther includes full-domain generalization scheme, sub-tree generalization scheme, and sibling generalization scheme Suppression Suppression is a straightforward anonymization mechanism. A specialized symbol (e.g., * or Any ) is used to replace values of an attribute for releasing a candidate, which indicates that the attribute/value has been suppressed. 2.9 Bucketization The basic idea of the bucketization mechanism is to break the correlation between sensitive values of a data set and quasi-identifiers (QIDs). It first partitions the records in the actual data table into non-overlapping buckets, each of which is assigned a unique bucket identifier (BID). Then for each bucket, it randomly permutes the sensitive attribute values, and publishes its projection on the quasi-identifier attributes and also its projection on the permuted sensitive attributes Perturbation For a long time, perturbation mechanisms have been used in the field of statistical disclosure control. Adam and Workman [5] have provided a complete summary of perturbation mechanisms that have been widely employed. There are two standard perturbation mechanisms 37

52 that are used for implementing differential privacy algorithms, namely Laplace mechanism and exponential mechanism. The Statistical disclosure control mechanism uses perturbation to sanitize data because this technique is efficient, simple and has the ability to preserve statistical information. In general, perturbation technique uses a synthetic data value instead of an original value. As a result, there is no remarkable difference in statistical information between perturbed and original data [32] Conclusion This Chapter summaries the concept of data privacy, different existing data privacy models, and various attacks for privacy breaches with examples. In this work, privacy refers to an individual s privacy. The models and algorithms discussed above have following key limitations: Partition-based privacy preserving data sanitization algorithms [95][23][41][64] are variations of k-anonymous algorithm. Partition-based algorithm imposes syntactic constraints (on the raw data) to ensure privacy. These algorithms are unable to protect data donors from adversarys background knowledge attacks. Those algorithms are only able to prevent record linkage [23][94][24][25] and attribute linkage attacks [55]. Another limitation of most existing algorithms is identifying and choosing quasi-identifiers (QIDs) from a data set before sanitizing. Partition based algorithms [41][98] such as k- anonymity, l-diversity, M-map etc., fully suppress QID attributes. As a result, due to loses 38

53 of information, the utility (such as classification accuracy) of the data is reduced significantly. 39

54 Chapter 3 Methodology 3.1 Proposed System and Experimental Design This Chapter presents the mathematical background of the proposed algorithms. The formal definition of differential privacy and the idea of Laplace noise are introduced here. The data flow of the proposed algorithms is also explained here. Methods used to measure the risk of re-identification are also described here. The following sections will discuss the detailed implementation of the proposed system Privacy Constraint Current privacy preserving models (such as partition based models and interactive models) [23][95] are vulnerable to different privacy-breaching attacks. In the proposed system, 40

55 ε differential privacy will be used. It is capable of protecting published data sets from different privacy breach attacks. Differential privacy is a new paradigm that provides a strong privacy guarantee [33]. Partitionbased privacy models [23][95] ensure privacy by imposing syntactic constraints on the output. For example, the output may be required to be indistinguishable among k records, or the sensitive value to be well represented in every equivalence group. Instead, differential privacy makes sure that a malicious user will not be able to get any information about a targeted person, whether a data set contains that person s record or not. Informally, a differentially private output is insensitive to any particular record. Thus, while preserving the privacy of an individual, the output of the differential privacy method is computed as if from a data set that does not contain targeted person s record. Current research shows that ε-differential privacy is able to protect from most attacks. ε differential privacy In this Section the formal definition of the ε-differential privacy [33] will be given. Before that the definition of the difference between two databases is given below: Let a data set DB be a collection of records from a universal sample space χ. A histogram is a convenient way to represent a data set. This may be represented by DB N χ, where every entry DB i represents the number of elements in the database DB, for every i χ and N = {0, 1, 2,...}. Let DB 1 and DB 2 be two databases, the distance between them will be their norm distance. 41

56 Definition 3.1. (Distance between databases). The l 1 norm of a database DB is presented as DB 1, and defined as: χ DB 1 = DB i (3.1) i=1 The l 1 distance between two databases, DB 1, and DB 2 is DB 1 DB 2 1. It is important to note that, DB 1 represents the database size i.e., how many records it has, where as DB 1 DB 2 1 represents the number of records that differ between DB 1 and DB 2. Definition 3.2. (Differential Privacy). Let M be a randomization algorithm over the domain N χ. Then M is (ε, δ)-differentially private if ( S Range(M) DB1, DB 2 N χ DB 1 DB ) : P r[m(db 1 ) S] exp(ε)p r[m(db 2 ) S] + δ (3.2) Now, if δ = 0 then the randomization algorithm M becomes ε differentially private P r[m(db 1 ) S] exp(ε) (3.3) P r[m(db 2 ) S] There is a significant difference between (ε, δ), (ε, 0) differential privacy. In every step of data processing using M(DB), (ε, 0)/ε differential privacy ensures that the outputs of two neighboring databases are mostly equally likely simultaneously [32][33]. 42

57 A stronger privacy guarantee may be achieved by choosing a lower value of ε. The values could be 0.01, 0.1, or may be ln 2 or ln 3 [31]. In this research, the value of ε is used in the range: 0.01 ε 1.0. If it is a very small ε then exp(ε) 1 + ε (3.4) To process numeric and non-numeric data with the differential privacy model, the following techniques will be needed Laplace Mechanism Dwork et al., [32] proposed the Laplace mechanism to add noise for numerical values and ensure differential privacy. The Laplace mechanism takes a database DB as input and consists of a function f and the privacy parameter λ. The privacy parameter λ specifies how much noise should be added to produce the privacy preserved output. The mechanism first computes the true output f(db), and then perturbs the noisy output. A Laplace distribution having a probability density function, π: ( x π = λ) 1 exp( ( x /λ)) (3.5) 2λ generates noise, where, x is a random variable; its variance is 2λ 2 and mean is 0. The sensitivity of the noise is defined by the following formula: 43

58 ˆf(DB) = f(db) + lap (λ) (3.6) where, lap(λ) is sampled from Laplace distribution. The expected magnitude of lap (λ) is approximately 1 λ. In a similar way, the following mechanism ˆf(DB) = f(db) + lap ( ) 1 ε ensures ε differential privacy. For a random variable v, the random noise N r = lap is generated using the following equation [72]: (3.7) ( ) 1 ε N r = sign(v) ln (1 2 v ) (3.8) For this research, v = group of data. equation: ( ) 1 as the value of ε ( ) 1 is always positive and random for every ε Finally, the random Laplace noise, N r, is generated using the following N r = sign ( ) ( 1 ln 1 2 ε ( ) ) ( 1 = 1 ln 1 2 ε ( ) ) 1 ε = ln ( 1 2 ( ) ) 1 ε (3.9) 44

59 Thus: N r = ln ( 1 2 ( ) ) 1 ε (3.10) Within the last five years, some recently published works [9][33][53] also prove that adding Laplace noise secures data from the adversary. Theorem 3.3 ([33]). The Laplace mechanism satisfies (ε, 0)-differential privacy. Proof. Let us consider DB 1 N χ, and DB 2 N χ such that DB 1 DB Let f be some function f : N χ R k, and let P DB1 denote the probably density function (π) of M L (DB 1, f, ε) and let P DB2 denotes the probably density function of M L (DB 2, f, ε). At some arbitrary point x R k, then, P DB1 (x) k P DB2 (x) = exp ( ε f(db 1) i x i f ) i=1 exp ( ε f(db 2) i x i f ) = k ( ) exp(ε f(db1 ) i x i f(db 2 ) i x i ) i=1 f k ( ) ε f(db1 ) i f(db 2 ) i exp f i=1 ( ) ε. f(db1 ) f(db 2 ) 1 = exp f exp(ε) (3.11) 45

60 Theorem 3.4 ([33]). Let M L satisfies (ε, 0)-differential privacy, then it also satisfies (kε, 0)- differential privacy for any group of size k. That means, f(db 1 ) f(db 2 ) 1 k and for all S R (R is the range of M L ), then: P r[m L (DB 1 ) S] exp(kε)p r[m L (DB 2 ) S] Proof. Let any pair of data sets DB 1 and DB 2 satisfies this condition f(db 1 ) f(db 2 ) 1 k. Let, there are databases d 0, d 1,..., d k such that d 0 = DB 1, d k = DB 2, d i d i+1 i 1, and for any event, S Ŕ (Ŕ is the range of M L). Then: P r[m L (DB 1 ) S] = P r[m L (d 0 ) S] exp(ε)p r[m L (d 1 ) S] exp(ε)exp(ε)p r[m L (d 2 ) S] = exp(2ε)p r[m L (d 2 ) S]... exp(kε)p r[m L (d k ) S] = exp(kε)p r[m L (DB 2 ) S] (3.12) 46

61 3.1.3 Anonymization Data anonymization is a procedure that converts data to a new form that produces secure data and prevents information leakage from that data set. However, the anonymized data should still be able to be data mined to obtain useful information/pattern. Data anonymization may be achieved in different ways; however, data suppression and generalization are standard methods to perform data anonymization. In this research, generalization and suppression are used to achieve data anonymization. Generalization To anonymize a data set DB, the process of generalization takes place by substituting an original value of an attribute with a more general form of a value. The general value is chosen according to the characteristics of an attribute. For example, in this work, profession: filmmaker and singer are generalized with the artist, and the age 34 is generalized with a range [30-35). Definition 3.5. Let DB = r 1, r 2,..., r n (3.13) be a set of records, where every record r i represent the information of an individual with attributes A = A 1, A 2,..., A d (3.14) 47

62 It is assumed that each attribute A i has a finite domain, denoted by Ω(A i ). The domain of DB is defined as Ω(DB) = Ω(A 1 ) Ω(A 2 )... Ω(A d ) (3.15) Suppression Suppression is a straightforward anonymization mechanism. A suppression operation replaces an attribute value fully or partially by a special symbol (e.g., * or Any ), which indicates that the value has been suppressed. Suppression is used to prevent disclosure of any value from a data set. Figure 3.1 shows the taxonomy tree (TT) of the Zip code suppression of German cities. The first two digits of a Zip code represents a city; for example the number 42 represents the City of Velbert (At the root of the TT, it is showing any/*****, and a full zip code is placed at the leaf of the TT). Taxonomy tree depth (TTD), plays a role in the usability of the published data Data Flow Diagram of the Proposed System The data flow diagram of the proposed system is presented in the Figure 3.2. Data donors provide their personal data for various reasons, for example for online shopping. As soon as the proposed system gets the raw data, it removes the personally identifiable information (PII) from the raw data set. 48

63 Figure 3.1: Suppression of Zip Codes of two German Cities Figure 3.2: Data Flow of the Proposed Algorithms 49

64 Then, generalization and/or suppression technique is used to anonymize data. At the final stage, Laplace noise is added to the anonymized data that satisfies ε differential privacy. As soon as the sanitized data is published, then data usability measures are applied to check the quality of the data set. 3.2 Utility Measures There is a trade-off between data sanitization (for publishing) process and the usability of the published data set, as the sanitization of data may lead to have noise or data loss to the actual data set. To measure the utility of the published data set, the following two measures will be used: Classification Accuracy of the sanitized data Re-identification Risk Measurement Classification Accuracy Classification is a fundamental problem in statistics, machine learning and pattern recognition. Let a data set have N classifiable attributes and set of L labels. Then the task of classification is the assignment of a specific tag or label L i L to every attribute in a consistent way so that data groups are identified according to their common attributes/characteristics. 50

65 3.2.2 Re-identification Risk This risk of re-identification measures how vulnerable a particular record to be re-identified from a sanitized data set. Three scenarios are considered to estimate the re-identification risk. These are [34]: Prosecutor, Journalist, and Marketer scenarios. According to Prosecutor scenario (PS), an attacker has background knowledge of the targeted person already in the data set. In contrast, in the Journalist scenario (JS), an attacker does not have any background knowledge about the targeted person whether he/she is in the data set or not. In the Marketer scenario (MS), an attacker is not interested in identifying a specific person, but curious to successfully identify a significant percentage of records from a data set. The risk of re-identification measures the probability that a record (or a given number of records) previously sanitized may be correctly associated with whom they originally correspond. Let, there be the n number of records in a data set i.e., i = 1, 2, 3,...n and ρ i represents the probability the correct identification of the record i. Let C represents an equivalence class of a published data set that has J number of records. The probability of each record in C is represented by ρ i=j such that j J. The probability of a correct identification of a record is defined by [34]: ρ j = 1 C j (3.16) where, C j represents the size of the the equivalence class in the published data set. 51

66 The number of records at higher risk than a threshold value from a published data set is measured by the following equation [34]: R r = 1 C i I(ρ j > t) (3.17) n j J where, the function I returns the value 1 if the condition is true or 0 otherwise. The highest risk involved with a record or a set of records is given by the following equation [34]: R max = max j J (ρ j) (3.18) The rate of correctly identified records on average is called the success rate and is given by the following equation [34]: R s = 1 C i ρ j (3.19) n j J 3.3 Conclusion This chapter discusses the theories of differential privacy and related techniques (e.g., noise generation) those are used in the proposed algorithm and its evaluation. Differential privacy paradigm provides the highest privacy guarantee and independent of adversarys background knowledge attack; that s why this research adopts differential privacy to design the proposed algorithms. 52

67 The measure of classification accuracy reflects the quality of the data set; higher accuracy means good quality sanitized data. On the other hand, the risk of re-identification indicates if the sanitized data is safe. The value of risk of re-identification is lower than the threshold means the data is secure to publish. 53

68 Chapter 4 Sanitizing and Publishing Electronic Health Record 4.1 Introduction With the development and integration of information and communication technology (ICT), collection, management, and sharing of electronic health record (EHR) are very common nowadays. The regulations on EHR in Canada is called Personal Health Information Protection Act (PHIPA) [2], and in the United States of America, called Health Insurance Portability and Accountability Act (HIPAA) [43] encourage diverse use of EHR without disclosing data donors privacy [60]. It was mentioned earlier that sharing and exchange of electronic health record (EHR) is beneficial, however, the health information exchange (HIE) program between the USA and Canada is not quite successful. Emam in 2013 [34] 54

69 states that 29% of health care providers suffer from data breaches of their customer or employee. Another finding from the same author [34] that 38% health care providers do not report data breaches to their patients. Also in [8], Almoaber and Amyot reported that 33 barriers need to be overcome to make the HIE viable. Privacy was one of the topmost concerns for the HIE project. Those privacy concerns include: 1) theft of identity or fraud 2) information may be used other than for care of a patient, and 3) illegitimate use of patients information e.g., mental health condition or genome data. 4.2 Problem Definition The key challenge for a privacy preserving data publishing (PPDP) technique is to guarantee data donors privacy while maintaining the data usability for further processing by the data miners or other interested parties. The purpose of this research is to develop a framework that satisfies differential privacy standards defined by Dwork and Roth [33], neutralize the risk of re-identification, and maximize data usability to deal with the classification task for knowledge miners. This proposed work has the following two phases: Sanitize HIPAA compliant and/or micro-data (that contains identifiable and sensitive information about a person) to its anonymous form that satisfies ε-differential privacy Measure the risk of re-identification and the classification accuracy to judge how safe the sanitized data set would be and the usability of the sanitized and published secure data, respectively 55

70 To ensure the ease of availability of high quality data to encourage collaborative scientific research to achieve new findings is one of the main benefits of this work. 4.3 Related work The most privacy preserving partition-based models found in literature [23][41][64][95] are variations of k-anonymous algorithm. When a partition-based algorithm sanitizes a data set, it imposes syntactic constraints (on the raw data) to ensure privacy. It partitions the records of a data set into groups, so that k different sensitive items need to be present in that certain group to protect data items from identification. These algorithms are unable to protect data donors from adversarys background knowledge attacks. Data donors are becoming more vulnerable to background knowledge attacks because of the availability of external data sets like voter lists or publicly available data from social networks accompanied with powerful data mining tools. Another limitation of most existing algorithms is identifying and choosing quasi-identifiers (QIDs) from a data set before sanitizing. Partition based algorithms [41][98] such as k- anonymity, l-diversity, M-map fully suppress QID attributes. As a result, due to loss of information, the utility (e.g., classification accuracy) of the data is reduced significantly. Some researchers integrated privacy to machine learning algorithms [88][22] to publish privacy preserving results instead of publishing secure data sets for sharing. 56

71 The Above discussion presents limitations of existing published works. The proposed algorithm will address those limitations to publish sanitized data set with better usability. Classification task maps a data item to its predefined class; however, in the area of privacy preserving data publishing, not much work was found in the literature that addresses this (task of classification) [18][41]. Some of the recent works on privacy preserving data publishing are reported below: In [74], Qin et al., proposed a local differential privacy algorithm called LDPMiner to anoymize set-valued data. They claimed that the algorithmic analysis of their method is practical in terms of efficiency. In [38], Fan and Jin implemented two methods: Handpicked algorithm (HPA), and Simple random algorithm (SRA) which are alterations of the l diversity [23] approach that sanitizes data by adding Laplace noise [33]. Wu et al., [96] proposed a privacy preserving algorithm by changing quasi-identifiers, and anonymization, and evaluated the published data using classification accuracy, and F -measure. In [64], Mohammad et al., proposed a DP based algorithm, and measured classification accuracy of the sanitized data set. Several researchers adopted privacy by modifying existing machine learning approaches for publishing privacy preserving results e.g., classification results [88], histogram [22] of data sets. Those techniques do not fulfill the criteria of sanitized data sharing rather they only publish useful results. 57

72 4.4 Proposed Algorithm This research work is proposing a 2-Layer Differential Privacy (2LPP) algorithm that satisfies ε-differential Privacy guarantee. Algorithm 1 shows the 2LPP algorithm. In the first layer (lines 1-7), the proposed algorithm applies the generalization technique to transfer attributes of the input data set to its generalized form. For example, the attribute age 34 is generalized with a range [30-35). In the second layer (lines 9-16), the proposed algorithm adds Laplace noise to the generalized data to make data items anonymous. In the layer 1 of the proposed algorithm, data generalization takes place. From line 1 to 7, the algorithm generalizes the raw data set to its generalized form to add a layer of privacy to prevent data breaches. Taxonomy tree helps find the hierarchical relations between the actual attribute to its general form. The Taxonomy tree will never be published with the sanitized data set. The proposed algorithm then (line 3) groups the generalized data set based on the similarities of the predictor attributes and Taxonomy Tree. Also, if there is any value that does not support the taxonomy tree, it is discarded (line 5 to 7). In the layer 2, the algorithm selects the initial privacy budget (line 9). Next, it recalculates the privacy budget based on the size of the group to add Laplace noise to that certain group (line 11 to 13). This process repeats until noise is added to all groups of the generalized data set. As soon as the algorithm ends the noise addition process, it merges all the sub-groups to form anonymized and differential private sanitized data set (line 15). Finally, this data set is published for interested parties (e.g., data miners). 58

73 Algorithm 1: The Proposed 2LPP Algorithm Input : Raw data set: DB [Predictor attributes: A P r, Class attribute: A Cl ], Privacy budget: ε, Taxonomy tree depth: d Output: Generalized data set: DB /* Layer-1: by generalization */ 1 if d > 0 then /* Predictor Attribute Generalization */ 2 Generalize A P r ÂP r, based on the Taxonomy Tree 3 Split the generalized data set, DB g based on the predictor attributes similarities i.e., 4 DB g =DB g1 DB g2... DB gn 5 [where, DB gi DB g and i = 1, 2, 3,..., n] 6 else /* Remove attributes not in a taxonomy tree */ 7 Discard the attribute 8 end /* Layer-2: by adding noise for randomization */ 9 Setup initial privacy budget: ε = 0.1 /* A small value 0.25 or 0.5 or ln 2 etc. */ 10 START for: i = 1 to n 11 Count the frequency, f r of the each generalized group 12 Recalculate the Privacy budget for DB gi :ε i =ε/2( f r + d) 13 Add Laplace noise to the frequency as f r + lap(1/ε i ) 14 END for 15 Merge subgroups with new frequencies: DB= DB g1 DB g2... DB gn 16 The sanitized data set for publishing: DB 59

74 4.5 Working Example Table 4.1 describes a small data set that will be used to explain the working example for the proposed algorithm. Table 4.1: Sample small data set Job Age Class Doctor 34 Y Nurse 50 N Doctor 33 N Nurse 33 Y Actor 20 Y Dramatist 31 N Dramatist 32 Y Actor 25 N By following taxonomy trees for attributes: Job: {any job{health professional{doctor} {nurse}} {media artist{actor} {dramatist}}} and Age: {20-50{20-35}{35-50}}, with the taxonomy tree depth of d = 2, an anonymize table (Table 4.2) is created that contains anonymize attributes. In the Table 4.2, there are 5 groups of anonymize data based on the attribute age. Table 4.3 shows those 5 groups with their frequencies. 60

75 Table 4.2: Anonymize form of the Sample data set Job Age Class Health professional Y Health professional N Health professional N Health professional Y Media artist Y Media artist N Media artist Y Media artist N Table 4.3: Anonymize form of the Sample data set Job Age Class Frequency Health professional Y 2 Health professional N 1 Health professional N 1 Media artist Y 2 Media artist N 2 61

76 Let, the initial privacy budget is ε = 0.1, and the taxonomy tree depth is d = 2. Now for the group health professional, 20 35, Y. In the Algorithm 1, at the line 12 the privacy budget is recalculated as: ˆε = (2 + 2) = 0.1 = (4.1) 8 ˆε represents the privacy budget for the group of data: Group health professional,20 35,Y. Then the amount of noise is calculated using the following equation: N r = ln ( 1 2 ( ) ) 1 ˆε (4.2) According to the equation 4.2, the amount of noise is calculated as: N r = ( ( ) ) 1 ln 1 2 = (4.3) I n the similar way, the noise for all other groups are calculated. The Table 4.4 shows the noisy frequencies for all 5 groups. 4.6 Data Sets T here are two different data sets are used to test the proposed algorithm for sanitizing and publishing secure data. The Haberman s Survival Data Set is from the UCI machine 62

77 Table 4.4: Noisy frequencies for the sanitized data Job Age Class Noisy Frequency Health professional Y 2+5 Health professional N 1+4 Health professional N 1+4 Media artist Y 2+5 Media artist N 2+5 learning repository [57], and the Doctor s Bills Data Set (DBMS) Version 1 (V1) is created using the doctor s bill from the Multimedia Analysis and Data Mining Research Group, German Research Center for Artificial Intelligence [39]. The DBDS is freely available for downloading from the mentioned research group s website [39]. The DBDS is a distinctive case of the publicly available micro data set. The Haberman s Survival Data Set consists of 306 tuples; and it contains 3 numerical, and 1 class attributes. The DBMS V1 data set consists of 46 tuples; and it contains 3 numerical, 3 categorical, and 1 class attributes. Table 4.5: Data Set Descriptions Data Sets Numerical Categorical Missing Class Infor- Attributes Attributes Values mation Doctor s Bills V1 3 3 None > 60, 60 Haberman s Survival 3 None None 1=Survived, 2=Died 63

78 Table 4.6: Attributes of the Doctor s Bills Data Set V1 Attributes Sex City Age Disease Year Diagnostic Spending Class Types Categorical Categorical Numerical Categorical Numerical Numerical > 60K, 60K Table 4.7: Attributes of the Haberman s Survival Data Set Attributes Age Patient s year of operation Number of positive axillary Types Numerical Numerical Numerical nodes detected Class 1 = the patient survived 5 years or longer; 2 = the patient died within 5 year 64

79 4.6.1 Data Set Preprocessing T he DBDS contains collections of Portable Document Format (PDF) files and scanned images of doctor bills. Every bill contains patient information, such as Name, address with zip code, date of birth, types of diagnosis, completion date, diagnosis results, and cost for the diagnosis etc. It also contains address and the name of the doctor involve in the diagnostic processes. All those provided information in the German Language. First of all we converted all those bills to English by using Google Translator [46] and picked those personal information to create a data file in comma-separated values (CSV) format. The prepared data set consists of 46 tuples. It contains real-life data with 3 numeric attributes, 3 categorical/non-numeric attributes, and a class information to classify two different income levels as > 60K, 60K. To make the DBMS V1 HIPAA complaint we excluded the zip code attribute from the data set. Table 4.6 presents all attributes of the Medical Bills data set with their types. 4.7 Results and Discussions T he sanitized, published data set is evaluated to measure the usability in case of data classification, and associated re-identification risk. Table 4.8 and 4.9 show the classification accuracy numerically, and the corresponding graphs are showed at Figures 4.2 and 4.3. The same experiment ran five times at a certain taxonomy tree (TT) depth, d =2, 4, 8, 12, or 16, that corresponded different values of the privacy budget ε (e.g., 0.5, 0.25, 0.1 etc ) to generate the sanitized data. Then, to classify the sanitized data, the classification 65

80 Figure 4.1: A sample Doctor s Bill from the Data Set 66

81 algorithm, decision tree [78] has been applied. Each time the parameters d and ε were varied and the experiment was ran for five times to produce anonymize data. Then the data set was classified to measure the accuracy. The average (arithmetic mean) of the classification accuracy from five runs is reported in this thesis. The classification accuracy for the sanitized data using the proposed algorithm achieved above 83% at the lowest privacy budget ε = 0.1 with the higher values of d, for example 16, and 12. Table 4.8: Classification Accuracy for the Doctor s Bill V1 Data Set TTD ε=0.1 ε=0.25 ε=0.5 ε=1 ε=2 ε= Table 4.9: Classification Accuracy for the Haberman s Survival Data Set TTD ε=0.1 ε=0.25 ε=0.5 ε=1 ε=2 ε=

82 Figure 4.2: Classification Accuracy for the Doctor s Bill V1 Data Set 68

83 Figure 4.3: Classification Accuracy for the Haberman s Survival Data Set 69

84 The performance of the proposed algorithm, 2LPP, has been compared with five other anonymization algorithms: DiffGen [64], k anonymity (k = 5), k-map (k = 5), δ-presence (0.5 δ 1.0), and (ε, δ)-differential Privacy (ε = 2, δ = 1E 6) algorithms [37][36]. In the case of decision tree classifier, 2LPP algorithm showed better performance compared to five other algorithms, shown in Fig 4.4. The major reason to drop the classification accuracy of the partitioned based algorithms is the difficulty in choosing QIDs. In those algorithms, QIDs are suppressed fully, and due to loss of information the usability of sanitized data dropped significantly. Figure 4.4: Comparisons among proposed algorithm and five other algorithms Risk of Re-identification The Table 4.10 presents the re-identification risk measured in three different attack models mentioned in the Chapter 3. The ARX [37][36] software is used to measure the risk of re-identification of the sanitized and raw data sets. The risk of re-identification for the 70

85 sanitized data by the 2-LPP is close to 0 (see Table 4.10), which means, the sanitized data is safe to publish. Table 4.10: Comparisons of re-identification the risk between sanitized, and non-sanitized data sets Data Sets Doctor s Bill V1 Haberman s Re-identification Risk of HIPPA Compliant Data in % PS=97.82; JS=97.82; MS=97.82 PS=60.45; JS=60.45; MS=60.45 Re-identification Risk of Sanitized data by 2LPP in % PS=0.05; JS=0.05; MS=0.05 PS=0.11; JS=0.09; MS=0.09 Doctor s Bill V1 Figure 4.5 and 4.6 show the risk of re-identification before and after sanitized the Doctor s Bill data set V1. In the first figure 4.5, the success rate of the re-identification risk of records are approximately 98% for all three attack models namely, Prosecutor, journalist, and Marketer attacker models. On the other hand, the risk of re-identification is dropped to 0.11% for the Prosecutor attacker model, 0.09% for the Journalist and Marketer attacker models. 71

86 Figure 4.5: Risk of Re-identification for the Raw Doctor s Bill V1 Data Set Figure 4.6: Risk of Re-identification for the Sanitized Doctor s Bill V1 Data Set 72

87 Haberman s Survival Data Set Figure 4.7 and 4.8 show the risk of re-identification before and after sanitized the Haberman s Survival data set. In the first figure 4.7, the success rate of the re-identification risk of records are approximately 60% for all three attack models namely, Prosecutor, journalist, and Marketer attacker models. On the other hand, the risk of re-identification is reduced to 0.05% for all three attacker models mentioned above. Figure 4.7: Risk of Re-identification for the Raw Haberman s Survival Data Set Scalability The run time of the proposed 2LPP algorithm is presented in Figure 4.9 by changing the size of a data set. In this measure we used Doctor s Bill data set V1. The proposed 2LPP 73

88 Figure 4.8: Risk of Re-identification for the Sanitized Haberman s Survival Data Set algorithm took more time to read and generalize the data set. In an average, the data sanitization and writing time is above 20 seconds. For this experiment, a laptop with an Intel Core i7 (2.10GHz) processor, 8 GB of RAM, and Windows 8.1 operating system were used. 4.8 Conclusion This chapter demonstrates the 2-layer privacy preserving (2LPP) algorithm that utilizes generalization technique and Laplace noise for sanitizing and then publishing anonymized data set from a micro data set. Also, the proposed algorithm ensures ε-differential privacy guarantee. Using this algorithm, the Haberman s Survival data set, and a new data set 74

89 Figure 4.9: Runtime for the 2LPP Algorithm called the Doctor s Bills data set were used to perform experiments for testing, and to evaluate the performance of the proposed algorithm. Experiments indicate that the proposed 2 layer privacy preserving algorithm is capable of publishing useful sanitized data set while significantly reducing the risk of re-identification. 75

90 Chapter 5 Sanitizing and Publishing Real-World Data Set 5.1 Introduction This Chapter represents the performance evaluation of the proposed algorithm, Adaptive Differential Privacy algorithm (ADiffP) in the case of data classification. The sanitized published data using the proposed algorithm shows better classification accuracy compared to other existing algorithms; this indicates that the proposed algorithm is robust. 5.2 Problem Definition Publishing high quality sanitized data is challenging due to the different nature of data (e.g., census, transaction, and location) and the easy availability of external data sources from 76

91 the internet. The aim of this research is to develop a framework that satisfies differential privacy standards [33] and produces high quality data. High quality refers to mainly two aspects: firstly, robust sanitization of raw data to prevent data breaches and secondly, the published data set has a similar quality or flavor of the original data set so that the data miners find it useful. This proposed work has the following key goals: Sanitize census data set to its anonymous form that satisfies ε-differential privacy Sanitize raw data that contains personally identifiable information (PII) and not HIPAA compliant Evaluate the sanitized data to measure the data usability (e.g., classification accuracy) and the ability of preventing data breaches by measuring the risk of record re-identification. 5.3 Related Works There are various algorithms for privacy preserving data mining (PPDM) and privacy preserving data publishing (PPDP), however, not much is found in literature that addresses the privacy preservation to achieve the goal of classification [18][41]. Some of the recent research done on privacy preserving data publishing are reported below: In [38], Fan and Jin implemented two methods: Hand-picked algorithm (HPA) and Simple random algorithm (SRA) which are variations of l diversity [23] technique, and use Laplace Mechanism [33] for adding noise to make data secure. The authors also claimed 77

92 that their methods satisfies ɛ-differential privacy. The authors used four real-world data sets: Gowalla, Foursquare, Netflix, and Movie-Lens and performed empirical studies to evaluate their work. The authors reported some data loss while imposing privacy on the raw data. In [61], Loukides et al., implemented a disassociation algorithm for electronic health record privacy. They used anonymization technique along with horizontal partitioning, vertical partitioning, and refining operations on the data set as needed to impose privacy. The authors used EHR dataset for their experiments. The proposed algorithm is called k m anonymity, a variation of k anonymization algorithm and they follow an interactive model for their implementation, both (k anonymization and interactive model) are limitations [33][41] of their work. In the paper [7], Al-Hussaeni, Fung, and Cheung implemented Incremental Trajectory Stream Anonymizer (ITSA) algorithm to publish private trajectory data (e.g., GPS data of a moving entity). The authors use anonymization and LKC-privacy (L:a positive integer, K: an anonymity threshold K 1, and C: a confidence threshold 0 C 1) to develop the propose technique. They test their algorithm with two different data sets: MetroData and Oldenburg. The authors compared their result with k anonymity algorithm and show that their algorithm works better. Kisilevich et al., [50] presented a multidimensional hybrid approach called kactus-2 which achieves privacy by utilizing suppression and swapping techniques, this method is developed by adopting k anonymization model. The authors investigated data anonymization 78

93 for data classification. The authors adopted five data sets: Adult, German Credit, TTT, Glass Identification, and Waveform for their experiments. They claim that their work produces better classification accuracy of anonymized data. As the propose algorithm based on k anonymization model, it inherits all limitations [41] of k anonymity model, also as the suppression technique is applied, then one of the major drawbacks is that, sparse data results in high information loss [59]. Li et al., [54] proposed and demonstrated two k-anonymity based algorithms: Information based Anonymization for Classification given k (IACk) and, a variant of IACk for a given distributional constraints (IACc). They utilized global attribute generalization, and local value suppression techniques to produce anonymized data for classification. The authors adopted the Adult data set for their experiments. The authors report that IACk algorithm shows better classification performance compared to InfoGain Mondrian [52]. Again, as the proposed algorithm is based on k anonymization model, it inherits [41] all limitations of k anonymity model. 5.4 Proposed Algorithm This research work proposes an Adaptive Differential Privacy (ADiffP) algorithm that satisfies ε-differential Privacy guarantee. Algorithm 2 represents the ADiffP algorithm. In line 3, the algorithm generalizes the raw data set to its generalized form to add a layer of privacy to prevent data breaches. Taxonomy tree helps find the hierarchical relations between the actual attribute for its general form. Taxonomy tree will never be 79

94 Algorithm 2: ADiffP Algorithm 1 Inputs : Raw data set: DB, Predictor attributes: A P r, Class attribute: A Cl, Privacy budget: ε, Taxonomy Tree depth (TT d): d 2 Output: Generalized data set DB 3 Predictor Attribute Generalization: A P r ÂP r, based on the Taxonomy Tree 4 Split the generalized data set, DB g by traversing the taxonomy tree, and predictor attributes similarities i.e., DB g = DB g1 DB g2... DB gn [where, DB gi DB g and i = 1, 2, 3,..., n] 5 Setup initial privacy budget: ˆε = ε/( ÂP r ) /* ε is a small number like 0.1 or 0.25 or 0.5 etc. initially */ 6 START: for i = 1 to n 7 Count the frequency, f r of the each generalized group 8 Set the adaptive privacy budget for DB gi : ε i =ˆε/( f r + d) 9 Add Laplace noise to the frequency as f i + lap(1/ε i ) 10 END for 11 Merge subgroups with new frequencies as DB= DB g1 DB g2... DB gn 12 The output is differential privacy preserved anonymized data set, DB 80

95 published with the sanitized data set. The proposed algorithm (in line 4) then partitions 1 the generalized data set based on the similarities of the predictor attributes and Taxonomy Tree. At this stage, the algorithm also counts the frequency of each group (number of rows in that group). In line 5, the algorithm calculates the initial privacy budget based on the number of predictor attributes in the input data set. The final privacy budget is then calculated for a certain group in line 8. Next, the Laplace noise is generated to add the frequency of that certain group. This process repeats until noise is added to all groups of the generalized data set (line 6 to 10). As the proposed algorithm recalculates the privacy budget depending on the number of predictor attributes and the size of a group, we consider this procedure as an adaptive noise addition. As soon as the algorithm ends the noise addition process, it merges all the sub-groups to form anonymized and differential private sanitized data set (line 12). Finally, this data set is published for interested parties (e.g., data miners) Working Example This section represents a working example of the proposed algorithm. The generalization process is explained in Section 4.5. Let us consider a data set having 5 predictor attributes and 1 class attribute. After applying the generalization process, Table 5.1 is generated with the frequencies of two different groups: 1 non-synthetically 81

96 Table 5.1: Anonymize form of the Sample Data Set with Group Frequencies City Job Age Year Expense Class Frequency Koln Health prof Y 8 Berlin Media artist N 3 Let, the initial privacy budget be ε = 0.1, and the taxonomy tree depth be d = 2. As there are 5 predictor attributes (except the class attribute) that belong to that data set, the privacy budget is recalculated according to the ADiffP algorithm (line 5) as: ˆε = = 0.02 (5.1) Now, for the group g 1 = Koln, health prof., 20 40, , 50 80, Y, in the Table 5.1, has the frequency, f i = g 1 = 8. In Algorithm 2, on line 8 the privacy budget for the group, g 1 is recalculated as: ε i = ˆε/( f r + d) (5.2) According to the equation 5.2, the amount of privacy budget is recalculated as: ε i=1 = 0.02 = (5.3) (8 + 2) Then the amount of noise is calculated using the following equation: 82

97 ( N r = ln 1 2 ( ) ) 1 ε (5.4) According to the equation 5.4, the amount of noise is calculated as: N r = ( ( ) ) 1 ln 1 2 = (5.5) Similarly, the noise for the other group g 2 = Berlin, Media Artist., 40 60, , 30 60, N is calculated as 6. The Table 5.2 represents the noisy frequencies for both groups. Table 5.2: Noisy frequencies for the sanitized data City Job Age Year Expense Class Frequency Koln Health prof Y 8+7 Berlin Media artist N Data Sets There are two different data sets are used to test the proposed algorithm. Data sets are: The Adult Data Set [57] The Doctor s BIll Data Set V2 [39] The adult data set consists of 45,222 tuples and is 5.4 MB in size. It is a census data set and publicly available for download. It contains real-life data with 6 numeric attributes, 83

98 8 categorical/non-numerical attributes, and a class information to classify two different income levels as > 50K, 50K. Table 5.3 presents all attributes of the Adult data set with their types. Table 5.3: Attributes of the Adult Data Set Attributes Work Class Marital Status Occupation Race Sex Relationship Native Country Education Age Capital-gain Capital-loss Hours-per-week Final-Weight Education Number of Year Class Type Categorical Categorical Categorical Categorical Categorical Categorical Categorical Categorical Numerical Numerical Numerical Numerical Numerical Numerical > 50K, 50K In Chapter 4 on page 65 discusses the preprocessing of Doctor s Bill data set. The Doctor s Bill data set Version 2 (DBMS V2) has an extra attribute, zip code compared to the version 84

99 1 (v1). Table 5.4 represents the attributes and their types of the DBMS V2. The DBMS V2 data set consists of 46 tuples; and it contains 3 numerical, 3 categorical, 1 set-valued, and 1 class attributes. Table 5.4: Attributes of the Doctor s Bills Data Set V2 Attributes Sex City Age Disease Year Zip Code Diagnostic Spending Class Types Categorical Categorical Numerical Categorical Numerical Set-valued Numerical > 60K, 60K 5.6 Result and Discussion The findings of classification accuracies of the sanitized data sets for both the Adult data set and the Doctor s Bill data set V2 can be found in Tables 5.5 and 5.6 respectively as determined by the proposed algorithm. The corresponding graphs for the mentioned tables are shown at Figures 5.1 and 5.2 respectively. The produced accuracy results displayed in the Tables 5.5 and 5.6 using the Decision Tree [44] methods. To evaluate the utility of the produced differentially private 85

100 anonymized data set, there are four sets of experiments completed at the different values of taxonomy tree depths, d =2, 4, 8, 12 and 16. For every depth d, values of ε are changed to 0.1, 0.25, 0.5, 1, 2, 3 and 4 to have the classification accuracy. Then every set of experiments repeated five times to generate sanitized data and their classification accuracies. The average or the arithmetic means of the accuracies are reported in the tables shown above. In doing experiments, out of 45K data instances, we used 34% of the data set as a test data set, and remaining 66% as training data set. Table 5.5: Showing classification accuracy using Decision Tree Classifier for the Adult Data Set TTd ε=0.1 ε=0.25 ε=0.5 ε=1 ε=2 ε= Table 5.6: Classification Accuracy for the Doctor s Bill V2 data set TTD ε=0.1 ε=0.25 ε=0.5 ε=1 ε=2 ε=

101 Figure 5.1: Classification Accuracy for the Adult data set 87

102 Figure 5.2: Classification Accuracy for the Doctor s Bill V2 data set Figure 5.3: Comparisons among proposed algorithm and five other algorithms 88

103 5.6.1 Risk of Re-identification The Table 5.7 shows the re-identification risk measured in three different attack models mentioned in the Chapter 3. The ARX [37][36] software is used to measure the risk of re-identification of the sanitized and raw data sets. The risk of re-identification for the sanitized data by the ADiffP is very low (see Table 5.7), which means, the sanitized data is safe to publish. Table 5.7: Comparisons of re-identification the risk between sanitized, and non-sanitized data sets Data Sets Re-identification Re-identification Risk raw/hippa Compliant in % of Data Risk of Sanitized data by ADiffP in % Doctor s Bill V2 Adult PS=100; JS=100; MS=100 PS=0.003; JS=0.003; MS=0.003 PS=0.004; JS=0.004; MS=0.004 PS=0.001; JS=0.001; MS=

104 Doctor s Bill V2 Figures 5.4 and 5.5 show the risk of re-identification before and after sanitization the Doctor s Bill data set V2. In the first Figure 5.4, the success rate of the re-identification risk of records are 100% for all three attack models namely, Prosecutor, journalist, and Marketer attacker models. This means that every record of this data set is identifiable, as this data set contains the zip code of every patient. It is mentioned earlier that to test robustness of the proposed algorithm, we intentionally keep that attribute in that data set. On the other hand, the risk of re-identification is dropped to 0.004% for all three attacker models. The risk of re-identification is below the threshold and minimum (0.004%) that it is close to zero; this means that the proposed algorithm robustly sanitizes data. Figure 5.4: Risk of Re-identification for the Raw Doctor s Bill V2 Data Set 90

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness Data Security and Privacy Topic 18: k-anonymity, l-diversity, and t-closeness 1 Optional Readings for This Lecture t-closeness: Privacy Beyond k-anonymity and l-diversity. Ninghui Li, Tiancheng Li, and

More information

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy Xiaokui Xiao Nanyang Technological University Outline Privacy preserving data publishing: What and Why Examples of privacy attacks

More information

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Dr.K.P.Kaliyamurthie HOD, Department of CSE, Bharath University, Tamilnadu, India ABSTRACT: Automated

More information

K-Anonymity and Other Cluster- Based Methods. Ge Ruan Oct. 11,2007

K-Anonymity and Other Cluster- Based Methods. Ge Ruan Oct. 11,2007 K-Anonymity and Other Cluster- Based Methods Ge Ruan Oct 11,2007 Data Publishing and Data Privacy Society is experiencing exponential growth in the number and variety of data collections containing person-specific

More information

Emerging Measures in Preserving Privacy for Publishing The Data

Emerging Measures in Preserving Privacy for Publishing The Data Emerging Measures in Preserving Privacy for Publishing The Data K.SIVARAMAN 1 Assistant Professor, Dept. of Computer Science, BIST, Bharath University, Chennai -600073 1 ABSTRACT: The information in the

More information

Survey Result on Privacy Preserving Techniques in Data Publishing

Survey Result on Privacy Preserving Techniques in Data Publishing Survey Result on Privacy Preserving Techniques in Data Publishing S.Deebika PG Student, Computer Science and Engineering, Vivekananda College of Engineering for Women, Namakkal India A.Sathyapriya Assistant

More information

K ANONYMITY. Xiaoyong Zhou

K ANONYMITY. Xiaoyong Zhou K ANONYMITY LATANYA SWEENEY Xiaoyong Zhou DATA releasing: Privacy vs. Utility Society is experiencing exponential growth in the number and variety of data collections containing person specific specific

More information

SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER

SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER 31 st July 216. Vol.89. No.2 25-216 JATIT & LLS. All rights reserved. SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER 1 AMANI MAHAGOUB OMER, 2 MOHD MURTADHA BIN MOHAMAD 1 Faculty of Computing,

More information

USER CORPORATE RULES. These User Corporate Rules are available to Users at any time via a link accessible in the applicable Service Privacy Policy.

USER CORPORATE RULES. These User Corporate Rules are available to Users at any time via a link accessible in the applicable Service Privacy Policy. These User Corporate Rules are available to Users at any time via a link accessible in the applicable Service Privacy Policy. I. OBJECTIVE ebay s goal is to apply uniform, adequate and global data protection

More information

Privacy Challenges in Big Data and Industry 4.0

Privacy Challenges in Big Data and Industry 4.0 Privacy Challenges in Big Data and Industry 4.0 Jiannong Cao Internet & Mobile Computing Lab Department of Computing Hong Kong Polytechnic University Email: csjcao@comp.polyu.edu.hk http://www.comp.polyu.edu.hk/~csjcao/

More information

Secured Medical Data Publication & Measure the Privacy Closeness Using Earth Mover Distance (EMD)

Secured Medical Data Publication & Measure the Privacy Closeness Using Earth Mover Distance (EMD) Vol.2, Issue.1, Jan-Feb 2012 pp-208-212 ISSN: 2249-6645 Secured Medical Data Publication & Measure the Privacy Closeness Using Earth Mover Distance (EMD) Krishna.V #, Santhana Lakshmi. S * # PG Student,

More information

ECEN Security and Privacy for Big Data. Introduction Professor Yanmin Gong 08/22/2017

ECEN Security and Privacy for Big Data. Introduction Professor Yanmin Gong 08/22/2017 ECEN 5060 - Security and Privacy for Big Data Introduction Professor Yanmin Gong 08/22/2017 Administrivia Course Hour: T/R 3:30-4:45 pm @ CLB 101 Office Hour: T/R 2:30-3:30 pm Any question besides assignment

More information

Security Control Methods for Statistical Database

Security Control Methods for Statistical Database Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security Statistical Database A statistical database is a database which provides statistics on subsets of records OLAP

More information

Privacy Preserving Data Sharing in Data Mining Environment

Privacy Preserving Data Sharing in Data Mining Environment Privacy Preserving Data Sharing in Data Mining Environment PH.D DISSERTATION BY SUN, XIAOXUN A DISSERTATION SUBMITTED TO THE UNIVERSITY OF SOUTHERN QUEENSLAND IN FULLFILLMENT OF THE REQUIREMENTS FOR THE

More information

Privacy Preserving Data Mining: An approach to safely share and use sensible medical data

Privacy Preserving Data Mining: An approach to safely share and use sensible medical data Privacy Preserving Data Mining: An approach to safely share and use sensible medical data Gerhard Kranner, Viscovery Biomax Symposium, June 24 th, 2016, Munich www.viscovery.net Privacy protection vs knowledge

More information

Privacy-Preserving Data Publishing: A Survey of Recent Developments

Privacy-Preserving Data Publishing: A Survey of Recent Developments Privacy-Preserving Data Publishing: A Survey of Recent Developments BENJAMIN C. M. FUNG Concordia University, Montreal KE WANG Simon Fraser University, Burnaby RUI CHEN Concordia University, Montreal 14

More information

Achieving k-anonmity* Privacy Protection Using Generalization and Suppression

Achieving k-anonmity* Privacy Protection Using Generalization and Suppression UT DALLAS Erik Jonsson School of Engineering & Computer Science Achieving k-anonmity* Privacy Protection Using Generalization and Suppression Murat Kantarcioglu Based on Sweeney 2002 paper Releasing Private

More information

Data Anonymization. Graham Cormode.

Data Anonymization. Graham Cormode. Data Anonymization Graham Cormode graham@research.att.com 1 Why Anonymize? For Data Sharing Give real(istic) data to others to study without compromising privacy of individuals in the data Allows third-parties

More information

RelayHealth Legal Notices

RelayHealth Legal Notices Page 1 of 7 RelayHealth Legal Notices PRIVACY POLICY Revised August 2010 This policy only applies to those RelayHealth services for which you also must accept RelayHealth s Terms of Use. RelayHealth respects

More information

Survey of k-anonymity

Survey of k-anonymity NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA Survey of k-anonymity by Ankit Saroha A thesis submitted in partial fulfillment for the degree of Bachelor of Technology under the guidance of Dr. K. S. Babu Department

More information

An Adaptive Algorithm for Range Queries in Differential Privacy

An Adaptive Algorithm for Range Queries in Differential Privacy Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 6-2016 An Adaptive Algorithm for Range Queries in Differential Privacy Asma Alnemari Follow this and additional

More information

NON-CENTRALIZED DISTINCT L-DIVERSITY

NON-CENTRALIZED DISTINCT L-DIVERSITY NON-CENTRALIZED DISTINCT L-DIVERSITY Chi Hong Cheong 1, Dan Wu 2, and Man Hon Wong 3 1,3 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong {chcheong, mhwong}@cse.cuhk.edu.hk

More information

An Efficient Clustering Method for k-anonymization

An Efficient Clustering Method for k-anonymization An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management

More information

ecare Vault, Inc. Privacy Policy

ecare Vault, Inc. Privacy Policy ecare Vault, Inc. Privacy Policy This document was last updated on May 18, 2017. ecare Vault, Inc. owns and operates the website www.ecarevault.com ( the Site ). ecare Vault also develops, operates and

More information

Secure Messaging Mobile App Privacy Policy. Privacy Policy Highlights

Secure Messaging Mobile App Privacy Policy. Privacy Policy Highlights Secure Messaging Mobile App Privacy Policy Privacy Policy Highlights For ease of review, Everbridge provides these Privacy Policy highlights, which cover certain aspects of our Privacy Policy. Please review

More information

Virginia Commonwealth University School of Medicine Information Security Standard

Virginia Commonwealth University School of Medicine Information Security Standard Virginia Commonwealth University School of Medicine Information Security Standard Title: Scope: Personnel Security Standard This standard is applicable to all VCU School of Medicine personnel. Approval

More information

Preserving Privacy in High-Dimensional Data Publishing

Preserving Privacy in High-Dimensional Data Publishing Preserving Privacy in High-Dimensional Data Publishing Khalil Al-Hussaeni A Thesis in The Department of Electrical and Computer Engineering Presented in Partial Fulfillment of the Requirements for the

More information

YOUR PRIVACY RIGHTS Privacy Policy General Col ection and Use voluntarily

YOUR PRIVACY RIGHTS Privacy Policy General Col ection and Use voluntarily YOUR PRIVACY RIGHTS Privacy Policy The Travel Society (DBA The Travel Society, LLC ) (AKA: Company ) in addition to the Members (AKA: Affiliates ) of The Travel Society values your privacy. This Privacy

More information

I. INFORMATION WE COLLECT

I. INFORMATION WE COLLECT PRIVACY POLICY USIT PRIVACY POLICY Usit (the Company ) is committed to maintaining robust privacy protections for its users. Our Privacy Policy ( Privacy Policy ) is designed to help you understand how

More information

Beam Technologies Inc. Privacy Policy

Beam Technologies Inc. Privacy Policy Beam Technologies Inc. Privacy Policy Introduction Beam Technologies Inc., Beam Dental Insurance Services LLC, Beam Insurance Administrators LLC, Beam Perks LLC, and Beam Insurance Services LLC, (collectively,

More information

A Review on Privacy Preserving Data Mining Approaches

A Review on Privacy Preserving Data Mining Approaches A Review on Privacy Preserving Data Mining Approaches Anu Thomas Asst.Prof. Computer Science & Engineering Department DJMIT,Mogar,Anand Gujarat Technological University Anu.thomas@djmit.ac.in Jimesh Rana

More information

Service-Oriented Architecture for Privacy-Preserving Data Mashup

Service-Oriented Architecture for Privacy-Preserving Data Mashup Service-Oriented Architecture for Privacy-Preserving Data Mashup Thomas Trojer a Benjamin C. M. Fung b Patrick C. K. Hung c a Quality Engineering, Institute of Computer Science, University of Innsbruck,

More information

Privacy Policy Mobiliya Technologies. All Rights Reserved. Last Modified: June, 2016

Privacy Policy Mobiliya Technologies. All Rights Reserved. Last Modified: June, 2016 Privacy Policy Last Modified: June, 2016 Your privacy is important to us. Through this document, we would like to give transparency to you on how Mobiliya Technologies Ltd. ( Mobiliya ) handle private

More information

mhealth: Privacy Challenges in Smartphone-based Personal Health Records and a Conceptual Model for Privacy Management

mhealth: Privacy Challenges in Smartphone-based Personal Health Records and a Conceptual Model for Privacy Management mhealth: Privacy Challenges in Smartphone-based Personal Health Records and a Conceptual Model for Privacy Management ehealth Workshop 28-29 Oct 2014 Middlesex University, London, UK Edeh Esther Omegero

More information

Subject: University Information Technology Resource Security Policy: OUTDATED

Subject: University Information Technology Resource Security Policy: OUTDATED Policy 1-18 Rev. 2 Date: September 7, 2006 Back to Index Subject: University Information Technology Resource Security Policy: I. PURPOSE II. University Information Technology Resources are at risk from

More information

1 Privacy Statement INDEX

1 Privacy Statement INDEX INDEX 1 Privacy Statement Mphasis is committed to protecting the personal information of its customers, employees, suppliers, contractors and business associates. Personal information includes data related

More information

Privacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S.

Privacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Introduction to Privacy-Preserving Data Publishing Concepts and Techniques Benjamin C M Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S Yu CRC

More information

Universal Patient Key

Universal Patient Key Universal Patient Key Overview The Healthcare Data Privacy (i.e., HIPAA Compliance) and Data Management Challenge The healthcare industry continues to struggle with two important goals that many view as

More information

We reserve the right to modify this Privacy Policy at any time without prior notice.

We reserve the right to modify this Privacy Policy at any time without prior notice. This Privacy Policy sets out the privacy policy relating to this site accessible at www.battleevents.com and all other sites of Battle Events which are linked to this site (collectively the Site ), which

More information

Comparative Analysis of Anonymization Techniques

Comparative Analysis of Anonymization Techniques International Journal of Electronic and Electrical Engineering. ISSN 0974-2174 Volume 7, Number 8 (2014), pp. 773-778 International Research Publication House http://www.irphouse.com Comparative Analysis

More information

The Two Dimensions of Data Privacy Measures

The Two Dimensions of Data Privacy Measures The Two Dimensions of Data Privacy Measures Abstract Orit Levin Page 1 of 9 Javier Salido Corporat e, Extern a l an d Lega l A ffairs, Microsoft This paper describes a practical framework for the first

More information

Adding Differential Privacy in an Open Board Discussion Board System

Adding Differential Privacy in an Open Board Discussion Board System San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-26-2017 Adding Differential Privacy in an Open Board Discussion Board System Pragya Rana San

More information

Differential Privacy. Seminar: Robust Data Mining Techniques. Thomas Edlich. July 16, 2017

Differential Privacy. Seminar: Robust Data Mining Techniques. Thomas Edlich. July 16, 2017 Differential Privacy Seminar: Robust Techniques Thomas Edlich Technische Universität München Department of Informatics kdd.in.tum.de July 16, 2017 Outline 1. Introduction 2. Definition and Features of

More information

Privacy Policy... 1 EU-U.S. Privacy Shield Policy... 2

Privacy Policy... 1 EU-U.S. Privacy Shield Policy... 2 Privacy Policy... 1 EU-U.S. Privacy Shield Policy... 2 Privacy Policy knows that your privacy is important to you. Below is our privacy policy for collecting, using, securing, protecting and sharing your

More information

Cybersecurity and Hospitals: A Board Perspective

Cybersecurity and Hospitals: A Board Perspective Cybersecurity and Hospitals: A Board Perspective Cybersecurity is an important issue for both the public and private sector. At a time when so many of our activities depend on information systems and technology,

More information

AMCTHEATRES.COM - PRIVACY POLICY

AMCTHEATRES.COM - PRIVACY POLICY Thank you for visiting AMCTheatres.com. AMC Entertainment Inc. values its relationship with guests, members and clients, and is committed to responsible information handling practices. This privacy policy

More information

What information is collected from you and how it is used

What information is collected from you and how it is used Richmond Road Runners Club PRIVACY POLICY Board Approved: 10/11/2017 Our Commitment to Privacy Richmond Road Runners Club (RRRC) is the sole owner of the information collected on its sites and through

More information

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin Airavat: Security and Privacy for MapReduce Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin Computing in the year 201X 2 Data Illusion of

More information

Data Compromise Notice Procedure Summary and Guide

Data Compromise Notice Procedure Summary and Guide Data Compromise Notice Procedure Summary and Guide Various federal and state laws require notification of the breach of security or compromise of personally identifiable data. No single federal law or

More information

HIPAA Federal Security Rule H I P A A

HIPAA Federal Security Rule H I P A A H I P A A HIPAA Federal Security Rule nsurance ortability ccountability ct of 1996 HIPAA Introduction - What is HIPAA? HIPAA = The Health Insurance Portability and Accountability Act A Federal Law Created

More information

ETSY.COM - PRIVACY POLICY

ETSY.COM - PRIVACY POLICY At Etsy, we value our community. You trust us with your information, and we re serious about that responsibility. We believe in transparency, and we re committed to being upfront about our privacy practices,

More information

An Iterative Approach to Examining the Effectiveness of Data Sanitization

An Iterative Approach to Examining the Effectiveness of Data Sanitization An Iterative Approach to Examining the Effectiveness of Data Sanitization By ANHAD PREET SINGH B.Tech. (Punjabi University) 2007 M.S. (University of California, Davis) 2012 DISSERTATION Submitted in partial

More information

Anonymization Case Study 1: Randomizing Names and Addresses

Anonymization Case Study 1: Randomizing Names and Addresses Anonymization Case Study 1: Randomizing Names and Addresses The PrivacyAnalytics Tool is being developed as part of a joint research project between the Children s Hospital of Eastern Ontario Research

More information

How to Respond to a HIPAA Breach. Tuesday, Oct. 25, 2016

How to Respond to a HIPAA Breach. Tuesday, Oct. 25, 2016 How to Respond to a HIPAA Breach Tuesday, Oct. 25, 2016 This Webinar is Brought to You By. About HealthInsight and Mountain-Pacific Quality Health HealthInsight and Mountain-Pacific Quality Health are

More information

GROUPON.COM - PRIVACY POLICY

GROUPON.COM - PRIVACY POLICY PRIVACY STATEMENT Last Updated: September 13, 2012 This Privacy Statement ( Privacy Statement ) explains how Groupon, Inc. ( Groupon, us, our, and we ) uses your information and applies to all who use

More information

2017 RIMS CYBER SURVEY

2017 RIMS CYBER SURVEY 2017 RIMS CYBER SURVEY This report marks the third year that RIMS has surveyed its membership about cyber risks and transfer practices. This is, of course, a topic that only continues to captivate the

More information

The State of Privacy in Washington State. August 16, 2016 Alex Alben Chief Privacy Officer Washington

The State of Privacy in Washington State. August 16, 2016 Alex Alben Chief Privacy Officer Washington The State of Privacy in Washington State August 16, 2016 Alex Alben Chief Privacy Officer Washington I. The Global Privacy Environment Snowden Revelations of NSA surveillance International relations EU

More information

Records Management and Retention

Records Management and Retention Records Management and Retention Category: Governance Number: Audience: University employees and Board members Last Revised: January 29, 2017 Owner: Secretary to the Board Approved by: Board of Governors

More information

Eagles Charitable Foundation Privacy Policy

Eagles Charitable Foundation Privacy Policy Eagles Charitable Foundation Privacy Policy Effective Date: 1/18/2018 The Eagles Charitable Foundation, Inc. ( Eagles Charitable Foundation, we, our, us ) respects your privacy and values your trust and

More information

Data Inventory and Classification, Physical Devices and Systems ID.AM-1, Software Platforms and Applications ID.AM-2 Inventory

Data Inventory and Classification, Physical Devices and Systems ID.AM-1, Software Platforms and Applications ID.AM-2 Inventory Audience: NDCBF IT Security Team Last Reviewed/Updated: March 2018 Contact: Henry Draughon hdraughon@processdeliveysystems.com Overview... 2 Sensitive Data Inventory and Classification... 3 Applicable

More information

CruiseSmarter PRIVACY POLICY. I. Acceptance of Terms

CruiseSmarter PRIVACY POLICY. I. Acceptance of Terms I. Acceptance of Terms This Privacy Policy describes CRUISE SMARTER policies and procedures on the collection, use and disclosure of your information. CRUISE SMARTER LLC (hereinafter referred to as "we",

More information

Privacy Preserving Health Data Mining

Privacy Preserving Health Data Mining IJCST Vo l. 6, Is s u e 4, Oc t - De c 2015 ISSN : 0976-8491 (Online) ISSN : 2229-4333 (Print) Privacy Preserving Health Data Mining 1 Somy.M.S, 2 Gayatri.K.S, 3 Ashwini.B 1,2,3 Dept. of CSE, Mar Baselios

More information

Best Practices. Contents. Meridian Technologies 5210 Belfort Rd, Suite 400 Jacksonville, FL Meridiantechnologies.net

Best Practices. Contents. Meridian Technologies 5210 Belfort Rd, Suite 400 Jacksonville, FL Meridiantechnologies.net Meridian Technologies 5210 Belfort Rd, Suite 400 Jacksonville, FL 32257 Meridiantechnologies.net Contents Overview... 2 A Word on Data Profiling... 2 Extract... 2 De- Identification... 3 PHI... 3 Subsets...

More information

Privacy-preserving machine learning. Bo Liu, the HKUST March, 1st, 2015.

Privacy-preserving machine learning. Bo Liu, the HKUST March, 1st, 2015. Privacy-preserving machine learning Bo Liu, the HKUST March, 1st, 2015. 1 Some slides extracted from Wang Yuxiang, Differential Privacy: a short tutorial. Cynthia Dwork, The Promise of Differential Privacy.

More information

Virtua Health, Inc. is a 501 (c) (3) non-profit corporation located in Marlton, New Jersey ( Virtua ).

Virtua Health, Inc. is a 501 (c) (3) non-profit corporation located in Marlton, New Jersey ( Virtua ). myvirtua.org Terms of Use PLEASE READ THESE TERMS OF USE CAREFULLY Virtua Health, Inc. is a 501 (c) (3) non-profit corporation located in Marlton, New Jersey ( Virtua ). Virtua has partnered with a company

More information

A Review of Privacy Preserving Data Publishing Technique

A Review of Privacy Preserving Data Publishing Technique A Review of Privacy Preserving Data Publishing Technique Abstract:- Amar Paul Singh School of CSE Bahra University Shimla Hills, India Ms. Dhanshri Parihar Asst. Prof (School of CSE) Bahra University Shimla

More information

Remote Access to a Healthcare Facility and the IT professional s obligations under HIPAA and the HITECH Act

Remote Access to a Healthcare Facility and the IT professional s obligations under HIPAA and the HITECH Act Remote Access to a Healthcare Facility and the IT professional s obligations under HIPAA and the HITECH Act Are your authentication, access, and audit paradigms up to date? Table of Contents Synopsis...1

More information

All Aboard the HIPAA Omnibus An Auditor s Perspective

All Aboard the HIPAA Omnibus An Auditor s Perspective All Aboard the HIPAA Omnibus An Auditor s Perspective Rick Dakin CEO & Chief Security Strategist February 20, 2013 1 Agenda Healthcare Security Regulations A Look Back What is the final Omnibus Rule? Changes

More information

Privacy Preserving Data Mining. Danushka Bollegala COMP 527

Privacy Preserving Data Mining. Danushka Bollegala COMP 527 Privacy Preserving ata Mining anushka Bollegala COMP 527 Privacy Issues ata mining attempts to ind mine) interesting patterns rom large datasets However, some o those patterns might reveal inormation that

More information

Website Privacy Policy

Website Privacy Policy Website Privacy Policy Last updated: May 12, 2016 This privacy policy (the Privacy Policy ) applies to this website and all services provided through this website, including any games or sweepstakes (collectively,

More information

Comparison and Analysis of Anonymization Techniques for Preserving Privacy in Big Data

Comparison and Analysis of Anonymization Techniques for Preserving Privacy in Big Data Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 2 (2017) pp. 247-253 Research India Publications http://www.ripublication.com Comparison and Analysis of Anonymization

More information

Re: Special Publication Revision 4, Security Controls of Federal Information Systems and Organizations: Appendix J, Privacy Control Catalog

Re: Special Publication Revision 4, Security Controls of Federal Information Systems and Organizations: Appendix J, Privacy Control Catalog April 6, 2012 National Institute of Standards and Technology 100 Bureau Drive, Stop 1070 Gaithersburg, MD 20899-1070 Re: Special Publication 800-53 Revision 4, Security Controls of Federal Information

More information

ISAO SO Product Outline

ISAO SO Product Outline Draft Document Request For Comment ISAO SO 2016 v0.2 ISAO Standards Organization Dr. Greg White, Executive Director Rick Lipsey, Deputy Director May 2, 2016 Copyright 2016, ISAO SO (Information Sharing

More information

BCN Telecom, Inc. Customer Proprietary Network Information Certification Accompanying Statement

BCN Telecom, Inc. Customer Proprietary Network Information Certification Accompanying Statement BCN Telecom, Inc. Customer Proprietary Network Information Certification Accompanying Statement BCN TELECOM, INC. ( BCN" or "Company") has established practices and procedures adequate to ensure compliance

More information

Privacy & Information Security Protocol: Breach Notification & Mitigation

Privacy & Information Security Protocol: Breach Notification & Mitigation The VUMC Privacy Office coordinates compliance with the required notification steps and prepares the necessary notification and reporting documents. The business unit from which the breach occurred covers

More information

TREND MICRO PRIVACY POLICY (Updated May 2012)

TREND MICRO PRIVACY POLICY (Updated May 2012) TREND MICRO PRIVACY POLICY (Updated May 2012) Trend Micro Incorporated and its subsidiaries and affiliates (collectively, "Trend Micro") are committed to protecting your privacy and ensuring you have a

More information

HF Markets SA (Pty) Ltd Protection of Personal Information Policy

HF Markets SA (Pty) Ltd Protection of Personal Information Policy Protection of Personal Information Policy Protection of Personal Information Policy This privacy statement covers the website www.hotforex.co.za, and all its related subdomains that are registered and

More information

SANMINA CORPORATION PRIVACY POLICY. Effective date: May 25, 2018

SANMINA CORPORATION PRIVACY POLICY. Effective date: May 25, 2018 SANMINA CORPORATION PRIVACY POLICY Effective date: May 25, 2018 This Privacy Policy (the Policy ) sets forth the privacy principles that Sanmina Corporation and its subsidiaries (collectively, Sanmina

More information

Adopter s Site Support Guide

Adopter s Site Support Guide Adopter s Site Support Guide Provincial Client Registry Services Version: 1.0 Copyright Notice Copyright 2016, ehealth Ontario All rights reserved No part of this document may be reproduced in any form,

More information

Jeffrey Friedberg. Chief Trust Architect Microsoft Corporation. July 12, 2010 Microsoft Corporation

Jeffrey Friedberg. Chief Trust Architect Microsoft Corporation. July 12, 2010 Microsoft Corporation Jeffrey Friedberg Chief Trust Architect Microsoft Corporation July 2, 200 Microsoft Corporation Secure against attacks Protects confidentiality, integrity and availability of data and systems Manageable

More information

Differential Privacy. CPSC 457/557, Fall 13 10/31/13 Hushiyang Liu

Differential Privacy. CPSC 457/557, Fall 13 10/31/13 Hushiyang Liu Differential Privacy CPSC 457/557, Fall 13 10/31/13 Hushiyang Liu Era of big data Motivation: Utility vs. Privacy large-size database automatized data analysis Utility "analyze and extract knowledge from

More information

74% 2014 SIEM Efficiency Report. Hunting out IT changes with SIEM

74% 2014 SIEM Efficiency Report. Hunting out IT changes with SIEM 2014 SIEM Efficiency Report Hunting out IT changes with SIEM 74% OF USERS ADMITTED THAT DEPLOYING A SIEM SOLUTION DIDN T PREVENT SECURITY BREACHES FROM HAPPENING Contents Introduction 4 Survey Highlights

More information

Janie Appleseed Network Privacy Policy

Janie Appleseed Network Privacy Policy Last Updated: April 26, 2017 Janie Appleseed Network Privacy Policy The Janie Appleseed Network respects and values your privacy. This Privacy Policy describes how Janie Appleseed Network, a Rhode Island

More information

Hippocratic Databases and Fine Grained Access Control

Hippocratic Databases and Fine Grained Access Control Hippocratic Databases and Fine Grained Access Control Li Xiong CS573 Data Privacy and Security Review Anonymity - an individual (or an element) not identifiable within a well-defined set Confidentiality

More information

Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust

Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust G.Mareeswari 1, V.Anusuya 2 ME, Department of CSE, PSR Engineering College, Sivakasi, Tamilnadu,

More information

General Data Protection Regulation Frequently Asked Questions (FAQ) General Questions

General Data Protection Regulation Frequently Asked Questions (FAQ) General Questions General Data Protection Regulation Frequently Asked Questions (FAQ) This document addresses some of the frequently asked questions regarding the General Data Protection Regulation (GDPR), which goes into

More information

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Partition Based Perturbation for Privacy Preserving Distributed Data Mining BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 2 Sofia 2017 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2017-0015 Partition Based Perturbation

More information

Microdata Publishing with Algorithmic Privacy Guarantees

Microdata Publishing with Algorithmic Privacy Guarantees Microdata Publishing with Algorithmic Privacy Guarantees Tiancheng Li and Ninghui Li Department of Computer Science, Purdue University 35 N. University Street West Lafayette, IN 4797-217 {li83,ninghui}@cs.purdue.edu

More information

Privacy Policy Manhattan Neighborhood Network Policies 2017

Privacy Policy Manhattan Neighborhood Network Policies 2017 Privacy Policy Manhattan Neighborhood Network Policies 2017 Table of Contents Manhattan Neighborhood Network Policies 3 MNN s Privacy Policy 3 Information Collection, Use and Sharing 4 Your Access to and

More information

Elements of a Swift (and Effective) Response to a HIPAA Security Breach

Elements of a Swift (and Effective) Response to a HIPAA Security Breach Elements of a Swift (and Effective) Response to a HIPAA Security Breach Susan E. Ziel, RN BSN MPH JD Krieg DeVault LLP Past President, The American Association of Nurse Attorneys Disclaimer The information

More information

Keeping It Under Wraps: Personally Identifiable Information (PII)

Keeping It Under Wraps: Personally Identifiable Information (PII) Keeping It Under Wraps: Personally Identifiable Information (PII) Will Robinson Assistant Vice President Information Security Officer & Data Privacy Officer Federal Reserve Bank of Richmond March 14, 2018

More information

VIACOM INC. PRIVACY SHIELD PRIVACY POLICY

VIACOM INC. PRIVACY SHIELD PRIVACY POLICY VIACOM INC. PRIVACY SHIELD PRIVACY POLICY Last Modified and Effective as of October 23, 2017 Viacom respects individuals privacy, and strives to collect, use and disclose personal information in a manner

More information

Content. Privacy Policy

Content. Privacy Policy Content 1. Introduction...2 2. Scope...2 3. Application...3 4. Information Required...3 5. The Use of Personal Information...3 6. Third Parties...4 7. Security...5 8. Updating Client s Information...5

More information

HIPAA COMPLIANCE AND DATA PROTECTION Page 1

HIPAA COMPLIANCE AND DATA PROTECTION Page 1 HIPAA COMPLIANCE AND DATA PROTECTION info@resultstechnology.com 877.435.8877 Page 1 CONTENTS Introduction..... 3 The HIPAA Security Rule... 4 The HIPAA Omnibus Rule... 6 HIPAA Compliance and RESULTS Cloud

More information

Village Software. Security Assessment Report

Village Software. Security Assessment Report Village Software Security Assessment Report Version 1.0 January 25, 2019 Prepared by Manuel Acevedo Helpful Village Security Assessment Report! 1 of! 11 Version 1.0 Table of Contents Executive Summary

More information

Securing the Internet of Things (IoT) at the U.S. Department of Veterans Affairs

Securing the Internet of Things (IoT) at the U.S. Department of Veterans Affairs Securing the Internet of Things (IoT) at the U.S. Department of Veterans Affairs Dominic Cussatt Acting Deputy Assistant Secretary / Chief Information Security Officer (CISO) February 20, 2017 The Cyber

More information

HIPAA AND SECURITY. For Healthcare Organizations

HIPAA AND  SECURITY. For Healthcare Organizations HIPAA AND EMAIL SECURITY For Healthcare Organizations Table of content Protecting patient information 03 Who is affected by HIPAA? 06 Why should healthcare 07 providers care? Email security & HIPPA 08

More information

Information Privacy Statement

Information Privacy Statement Information Privacy Statement Commitment to Privacy The University of Florida values individuals' privacy and actively seeks to preserve the privacy rights of those who share information with us. Your

More information

EU GDPR and . The complete text of the EU GDPR can be found at What is GDPR?

EU GDPR and  . The complete text of the EU GDPR can be found at  What is GDPR? EU GDPR and Email The EU General Data Protection Regulation (GDPR) is the new legal framework governing the use of the personal data of European Union (EU) citizens across all EU markets. It replaces existing

More information

Electronic Communication of Personal Health Information

Electronic Communication of Personal Health Information Electronic Communication of Personal Health Information A presentation to the Porcupine Health Unit (Timmins, Ontario) May 11 th, 2017 Nicole Minutti, Health Policy Analyst Agenda 1. Protecting Privacy

More information