ADVANCES in NATURAL and APPLIED SCIENCES

Size: px

Start display at page:

Download "ADVANCES in NATURAL and APPLIED SCIENCES"

Laura Palmer
5 years ago
Views:

1 ADVANCES in NATURAL and APPLIED SCIENCES ISSN: Published BYAENSI Publication EISSN: May 11(7): pages Open Access Journal A Privacy Preserving Data Mining Approach for Handling Data with Outliers 1 V.V. Vishnu Priya, 2 A.K. Ilavarasi, 3 Dr.B. Sathiya Bhama 1 PG Student- CSE Sona College of Technology Salem, India. 2 Assistant Professor CSE Sona College of Technology Salem, India. 3 HOD CSE Sona College of Technology Salem, India. Received 28 January 2017; Accepted 22 May 2017; Available online 28 May 2017 Address For Correspondence: V.V. Vishnu Priya, PG Student- CSE Sona College of Technology Salem, India. vishnupriya31.v@gmail.com Copyright 2017 by authors and American-Eurasian Network for ScientificInformation (AENSI Publication). This work is licensed under the Creative Commons Attribution International License (CC BY). ABSTRACT Organizations publish their private data for the research analysis. Publishing datasets for analysis causes serious concerns in the data privacy. The data published may contains outliers. Outliers are easily identifiable, therefore adversaries can capture their private information about an individual by linking with the other attribute published in external database. The motivation is to prevent the disclosure of sensitive information. Distinguishability attack occurs while publishing the datasets that contains outliers. The syntactic privacy models could not prevent the attack. The plain l-diversity could defend against the attack. The existing plain l-diversity preserves the dataset from the distinguishability attack but it results in information loss. In this paper we are going to improve the algorithm with minimal information loss using K Nearest Neighbour Algorithm. KEYWORDS: Privacy, Data Mining, Outliers, Data Sharing. INTRODUCTION Data Mining [1] is the process of extracting the knowledge from the data which is stored in the large repositories. Privacy Preserving Data Mining problem [2] has been considered more importantly in recent years due to the fact that huge amount of information about individuals are stored at different vendors for the research purposes. PPDM is an new research topic in Data Mining and in the Statistical databases in which the Data Mining algorithms are analyzed to check whether they acquire privacy in data. Privacy Preservation of individuals data from disclosure is considered as the important function inorder to maintain privacy. In this way privacy plays major role in the data mining process. The problem in the data mining output is it reveals the individuals personal data. It leads to threats in the privacy of the individuals. The motivation of the people is that their personal information should not be known to others without their knowledge. But data mining algorithms failed to protect the privacy of the individuals. Privacy is defined as the right of an individual person to keep their sensitive information from being disclosed. Privacy states that from an set of records the adversary should not identify the person associated with that record. The results of the data mining operations are sensitive. Privacy is one of the important properties [4] that an system needs to be satisfied. For this purpose, numerous efforts had been undertaken to devote the PPDM algorithms to protect the information from being disclosed. One of the basic data mining problem is Outlier Detection. An Outlier is an observation point that deviates from the other observations or from the rest of the data [6].Outliers can be novel, abnormal, unusual or noisy information. Outliers may be real or erroneous. ToCite ThisArticle: V.V. Vishnu Priya, A.K. Ilavarasi, Dr.B. Sathiya Bhama., A Privacy Preserving Data Mining Approach for Handling Data with Outliers. Advances in Natural and Applied Sciences. 11(7); Pages:

2 586 V.V. Vishnu Priya et al., 2017/Advances in Natural and Applied Sciences. 11(7) May 2017, Pages: Related work: In recent years the privacy preserving data publishing had drawn more attention. To protect[3] the privacy of the individuals the dataset must be anonymized before it is released. Previous[3] study shown that by removing the explicit identifiers such as name,ssn(social Security Number) from the dataset cannot maintain privacy. It is because the Quasi identifiers such as zip code, gender helps to jointly identify the person privately. The identity of the person can be revealed easily when it is compared with the public dataset (eg.voter list).sweeney[5] proposed k-anonymity method which is treated as the conventional method for anonymization.quasi Identifier consists of person specific sensitive attribute information. It achieves using generalization and suppression method so that the each individual is indistinguishable from the at least k- 1records.Generalization replaces the value less specific but it is also said to be semantically consistent. Suppression[8] reduces the exactness of applications and does not releases the value at all.this type of K- anonymity method prevents from the linkage attack. The authors[10][5] proved that removing Quasi-Identifiers from the dataset donot ensure the privacy so they suggested that the k-anonymity method is better for publishing the microdata. Author[9] suggested an novel approach based on bottom up method to group the quasi identifiers.k-anonymity model[5] is proved to be theoretically NP-Hard.two types of attack are possible such as Background Knowledge attack and Homogeneity attack. L-diversity [7] model is introduced to protect from the attribute disclosure.it consists of distinct well represented values in each equivalence class. The improved methods such as the t-closeness, p-sensitive anonymity, (k,e)anonymity[11] are described in it. As the L-diversity model, the several other approaches are proposed to achieve the principle of privacy in[13,15,16,18].they are Classified as partitioning method and randomization method. The dataset[15,18] is divided into Quasiidentifier groups and it publishes only the anonymized groups in the partitioning based method. To increase the utility of the anonymized dataset nonhomogeneous generalization method is proposed by the Koudos [12] In randomization approach the original values are replaced with the noise or duplicate values[15]li et al proposed that the distribution of sensitive values in the released dataset must be close to the original in t-closeness method[14] If outliers present in the original dataset they must be shown in the both the original and the modified dataset. In this way the outliers can be easily detected using the distribution. Few studies shows the possibility of attacks in the partition based schemes. Machanavajjhala et al[17] described some of the attacks in the k-anonymity and proposed the l-diversity. Our work adapts the l-diversity model. In the recent years an new model for privacy is emerged known as differential privacy [20] In this differential privacy method the removal or the addition of any one record will not affect the entire dataset[19].numerous techniques had been proposed to publish the different types of data to satisfy the differential privacy[21,23]barak et al proposes[21] the method to publish the marginals of the dataset. Blum et al [22]proposed an approach for releasing the one dimensional data which satisfies the differential privacy in non interactive way.hay et al[24] improves the performance of the[22]the wavelet based approach[25]is used by the Xiao et al for publishing the micro dimensional dataset. 3.Privacy preserving method for data containing Outliers: Organizations are increasingly publishing microdata that contains non aggregated information about the individuals.non-aggregated data that contains outliers raises serious concerns in data privacy.when outliers exist in the dataset, they are easier to be distinguished from the crowd and the privacy is breached. Distinguishability-based attack occurs by which the adversary can identify outliers and reveal their private information from an anonymized dataset. The existing plain l-diversity preserves the dataset from the distinguishability attack but it results in information loss since it hides only the hideable outliers present in the dataset. In l-diversity all records that share the same values of quasi identifiers should have distinct values for their sensitive attributes. In previous studies[3]using l-diversity method the QI attributes are generalized and the outliers present in the data are hided inorder to maintain privacy.but when we generalize and hide the data containing outliers it results in information loss since it hides only the hideable outliers and the unhideable outliers are eliminated. In the proposed system the information loss is reduced using KNN algorithm by enhancing the l-diversity model.the proposed system consists of five steps described in the fig 3.1

3 587 V.V. Vishnu Priya et al., 2017/Advances in Natural and Applied Sciences. 11(7) May 2017, Pages: Load Dataset Cluster records KNN Classifier Find Outliers Group outliers to its nearest cluster Fig. 3.1: System Architecture of Proposed System In this proposed system first we import the dataset and Fuzzy clustering is applied. Fuzzy clustering used to group the data into n clusters in which each datapoint in the dataset are belongs to each cluster to an certain degree.in simple words we can say that each point can belong to more than one cluster. Fuzzy clustering is applied because it helps the datapoint to move to its nearest cluster then KNN classifier is applied to find the outliers. Fuzzy Clustering helps to find combination weights, membership functions and cluster centres to minimize the objective function. Outliers are the observation point that deviates from others.knn classifies the new cases (outliers) based on its distance functions.the outliers present in the data are moved to its nearest bucket (cluster). KNN algorithm steps: Determine parameter K=number of nearest neighbours. Calculate the distance between the query instance and all training samples. Find the nearest neighbour by sorting and gather the category of the nearest neighbours. Use simple majority of the category of the nearest neighbours as the prediction value of the query instance. RESULT AND DISCUSSION The input given is adult dataset which is first loaded and then fuzzy Clustering method is applied to cluster the records.it is especially used for mapping the outliers to its nearest cluster. Fig. 4.1: Raw Dataset

588 V.V. Vishnu Priya et al., 2017/Advances in Natural and Applied Sciences. 11(7) May 2017, Pages: 585-591 Fig. 4.

Euclidian Distance Method is used to calculate the distance between the records.then the outliers are moved to its nearest bucket (cluster).

to its nearest clusters. Table 1 describes the performance of the Information loss metrics.

4 588 V.V. Vishnu Priya et al., 2017/Advances in Natural and Applied Sciences. 11(7) May 2017, Pages: Fig. 4.2: Fuzzy Clustering Fig. 4.3: Assigning K value for KNN classification Fig. 4.4: Outlier Detection Fig. 4.5: Outlier Mapping After that the KNN classifier is applied and found the outliers based on its distance. Euclidian Distance Method is used to calculate the distance between the records.then the outliers are moved to its nearest bucket (cluster). By this way the privacy of the dataset is maintained using l-diversity and information loss is reduced using KNN classifier by assigning the outliers to its nearest clusters. Table 1 describes the performance of the Information loss metrics. The information loss is analyzed in terms of outlier detection error ratio results in figure 4.6. Information loss is defined as, 1 Loss = D 2 O 1 + N DC 1 where, N DC ( D 2, N DC D ) 2

589 V.V. Vishnu Priya et al., 2017/Advances in Natural and Applied Sciences. 11(7) May 2017, Pages: 585-591 D: Dimensionality of the vector (2,3,4,5,.

5 589 V.V. Vishnu Priya et al., 2017/Advances in Natural and Applied Sciences. 11(7) May 2017, Pages: D: Dimensionality of the vector (2,3,4,5,...) O: Outlier NDc: The number of training samples per class (>D+1) Table 1: Comparisons of Information Loss during number of runs Methods Existing Proposed as, Table 2 describes the performance of the system using silhoutee metrics in figure 4.3. Accuracy is defined Accuracy(i) = b(i) a(i) max {a(i), b(i)} where, a(i) is the cluster similarity, b(i) be the lowest average dissimilarity of i to any other cluster, of which i is not a member. The cluster with this lowest average dissimilarity is said to be the neighbouring cluster of i because it is the next best fit cluster for point i. Table 2: Comparisons of Cluster Accuracy during number of runs Methods Existing Proposed Table 3 describes the performance of the system using Time metrics. Computational time metrics is analyzed in terms of Outlier detection in figure 4.8. Computational Time(CT) is defined as CT = Process Start Time - Process End Time Table 2: Comparisons of Computational Time (CPU seconds) during number of runs Methods Existing Proposed Fig. 4.6: Information loss Graph The information loss is reduced when comparing to the existing method.

6 590 V.V. Vishnu Priya et al., 2017/Advances in Natural and Applied Sciences. 11(7) May 2017, Pages: Fig. 4.7: Accuracy Graph Fig. 4.8: Compilation Time Graph Conclusion: In this paper the problem of publishing data with outliers in privacy fashion is studied. The microdata containing outliers are published in a privacy preserving way. The existing plain l-diversity system provides privacy only for the hideable outliers and it results in information loss. In this paper we improved the algorithm using K Nearest Neighbour Algorithm to reduce information loss. REFERENCES 1. Han, J. and M. Kamber, Data Mining: Concepts andtechniques, 2nd ed.,the Morgan Kaufmann Series in DataManagement Systems, Jim Gray, Series Editor. 2. AnkitaShrivastava, U. Dutta, An Emblematic Study of Different Techniques in PPDM, International Journal of Advanced Research in Computer Science and Software Engineering. 3. Hui(Wendy)Wang, RuilinLiu, Hiding outliers into crowd: Privacy Preserving data publishing with outliers,elsevier. 4. Elisa Bertino, Dan Lin, and Wei Jiang, A Survey of Quantification of Privacy Preserving Data Mining Algorithms 5. Sweeney, L., k-anonymity: a model for protecting privacy, Int. J. Uncertain. Fuzziness Knowl. Based Syst., 10(5): Williams, G., R. Baxter, H. He, S. Hawkins and L. Gu, A Comparative Study for RNN for OutlierDetection in Data Mining. In Proceedings of the 2ndIEEE International Conference on Data Mining, page709, Maebashi City, Japan. 7. Tiancheng Li, Ninghui Li, Jian Zhang, Ian Molloy, "Slicing: A New Approach for Privacy Preserving Data Publishing", IEEE Transactions on Knowledge & Data Engineering, 24(3): , doi: /tkde Samarati, P., Protectingrespondent s privacy in Microdata release, IEEE Transactions on Knowledge and Data Engineering, 13: TianchengLi, NinghuiLi, Towards Optimal k-anonymization, Data & Knowledge Engineering, Elsevier. 303.

7 591 V.V. Vishnu Priya et al., 2017/Advances in Natural and Applied Sciences. 11(7) May 2017, Pages: Machanavajjhala, A., J. Gehrke, D. Kifer, l-diversity: Privacy Beyond k-anonymity, ACM Transactions on Knowledge Discovery from Data, pp: SergioMartínez, David Sánchez, Aida Valls, A semantic framework to protect the privacy of electronic health records with non-numerical attributes, Journal of Biomedical Informatics, 46: Wong, W.K., N. Mamoulis, D.W.L. Cheung, Non-homogeneous generalization in privacy preserving data publishing, Proceedings of ACM International Conferenceon Special Interest Group on Management of Data (SIGMOD) pp: LeFevre, K., D.J. DeWitt, R. Ramakrishnan, Incognito: efficient full-domain k-anonymity, Proceedings of ACMInternational Conference on Special Interest Group on Management of Data (SIGMOD) pp: Li, N., T. Li, t-closeness: Privacy beyond k-anonymity and l-diversity, Proceedings of the International Conference on Data Engineering (ICDE) pp: Koudas, N., D. Srivastava, T. Yu, Q. Zhang, Distribution based microdata anonymization, Proc. VLDB Endow. 2(1): Li, J., Y. Tao, X. Xiao, Preservation of proximity privacy in publishing numerical sensitive data, Proceedings of ACM International Conference on Special InterestGroup on Management of Data (SIGMOD) pp: Li, N., T. Li, t-closeness: Privacy beyond k-anonymity and l-diversity, Proceedings of the IEEE 23rd International Conference on Data Engineering, pp: Chaytor, R., K. Wang, Small domain randomization: same privacy, more utility, Proc. VLDB Endow. 3(1-2): Dwork, C., Differential privacy, Proceedings of International Colloquium on Automata, Languages and Programming (ICALP) pp: Dwork, C., F. McSherry, K. Nissim, A. Smith, Calibrating noise to sensitivity in private data analysis, Proceedings of the Conference on Theory of Cryptography(TCC) pp: Barak, B., K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, K. Talwar, Privacy, accuracy, and consistency too: a holistic solution to contingency table release, Proceedings of ACM Symposium on Principles of Database Systems (PODS), pp: Blum, A., K. Ligett, A. Roth, A learning theory approach to non-interactive database privacy, Proceedings of the ACM Symposium on Theory of Computing (STOC), pp: Xiao, X., G. Bender, M. Hay, J. Gehrke, ireduct: differential privacy with reduced relative errors, Proceedings of ACM International Conference on Special InterestGroup on Management of Data (SIGMOD), pp: Li, C., M. Hay, V. Rastogi, G. Miklau, A. McGregor, Optimizing linear counting queries under differential privacy, Proceedings of ACM Symposium on Principles ofdatabase Systems (PODS), pp: Xiao, X., G. Wang, J. Gehrke, Differential privacy via wavelet transforms, IEEE Trans. Knowl. Data Eng., 23(8):

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique P.Nithya 1, V.Karpagam 2 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College,