Privacy Preserving Classification of heterogeneous Partition Data through ID3 Technique

Size: px

Start display at page:

Download "Privacy Preserving Classification of heterogeneous Partition Data through ID3 Technique"

Tyler Berry
5 years ago
Views:

1 Privacy Preserving Classification of heterogeneous Partition Data through ID3 Technique Saurabh Karsoliya 1 B.Tech. (CSE) MANIT, Bhopal, M.P., INDIA Abstract: The goal of data mining is to extract or mine knowledge from large amounts of data. For information Extraction this knowledge several data mining classification techniques are used. ID3 algorithm is widely used technique in this classification arena. ID3 Algorithm classifies data by creating decision tree over heterogeneously partitioned data. In this paper we propose vertically partitioned micro array data along with preserving privacy by different methods of privacy preserving i.e. secure multi party computation However, micro data is often collected by several different sites. Privacy, legal and commercial concerns restrict centralized access to this data. Together, these enable the secure mining of knowledge. We focus on the problem of decision tree learning with the popular ID3 algorithm. We consider that database is vertically Partitioned into two pieces. Database which is considered is Micro array data that is heterogeneously classified. Keywords: Privacy Preserving, ID3, Decision tree, Classification, Micro array Data. 1. INTRODUCTION In data mining knowledge are extracted through different technique such as classification, clustering, association etc. The ID3 algorithm is a standard, popular, and simple method for data classification and decision tree creation. it is developed by J. R. Quinlan, also known as Ross Quinlan [3]. Since privacy-preserving data mining should be taken into consideration, several secure multi-party computation protocols have been presented based on this technique [2]. In this paper every extraction of knowledge is comes out in terms of decision tree, the input for the decision tree creation is the micro array data. Decision tree is a rooted tree containing nodes and edges. In which each internal node is a test Node and corresponds to an attribute; the edges leaving a node correspond to the possible values taken on by that attribute. For example, the attribute Home-Owner would have two edges leaving it, one for Yes and one for No. Finally, the leaves of the tree contain the expected class value for transactions matching the path from the root to that leaf [3]. The basic building block of the ID3 algorithm is used through entropy and Gini index protocol for creation of the tree [3, 4]. There are two main operations during tree building to obtain the information Gain: Step 1: Evaluation of splits for each attribute and selection of the best split Step 2: Creation of partitions using the best split. Having determined the overall best split, partitions can be created by a simple application of the splitting criterion to the data. Entropy and Gini Index are two protocols which compute Information-Gain at each step for producing a decision tree. The Gini Index, however, has been less studied in privacy-preserving data mining for classifying the Micro array data. The formula used for calculation of Entropy and Gini are as follows Where Pj is the relative frequency of class j in S. Based on the entropy or the gini index, we can compute the information gain if attribute A is used to partition the data set S Where v represents any possible values of attribute A; Sv is the subset of S for which attribute A has value v; Sv is the number of elements in Sv; S is the number of elements in S. In Gini index splits are done in such that the largest class goes into one pure node while the other classes go into the other node. Entropy normally tries to create balanced tree. In this paper, we proposed that how Gini can be used in privacy-preserving classification of DNA Microarry data in ID3 algorithms to create decision tree. ID3 worked iteratively, it uses top-down traversing approach where initially all training cases belong to a single root node which is then successively split to form a tree. Building of decision tree with ID3 algorithm Volume 1, Issue 4 November - December 2012 Page 135

2 Step 1: Select the attribute with the most Information gain. Step 2: Create the subset for each value of the Attribute. Step3: For each subset If not all the elements of the subset belongs to some class repeat the step 1-3 for the subset. Empirical evidence suggests that a correct decision tree is usually found more quickly by this iterative method than by forming a tree directly from the entire training set. As its well known that ID3 was designed for the condition where there are many attributes and the training set contains many objects, but where a reasonably good decision tree is required without much computation, as in DNA micro array a typical glass slide is used in which DNA molecules are fixed in an orderly manner at specific locations called spots (or features). A micro array may contain thousands of spots and each spot may contain a few million copies of identical DNA molecules that uniquely correspond to a gene [5]. The DNA in a spot may either be genomic DNA or short stretch of oligo-nucleotide strands that correspond to a gene. The spots are printed on to the glass slide by a robot or are synthesized by the process of photolithography [5,6]. Micro arrays may be used to measure gene expression in many ways, but one of the most popular applications is to compare expression of a set of genes from a cell maintained in a particular condition to the same set of genes from a reference cell. Family of algorithms for Top down Induction of Decision Trees The DNA Microarray data classification is done in such a way that involved parties that can jointly compute the gain value of each normal attribute without revealing their own private information to each other, while the database is vertically partitioned over two or more parties. Micro arrays have opened the possibility of creating data sets of molecular information to represent many systems of biological or clinical interest. Gene expression profiles can be used as inputs to large-scale data analysis, for example, to serve as fingerprints to build more accurate molecular classification, to discover hidden taxonomies or to increase our understanding of normal and disease states. The main types of data analysis needed to for biomedical applications include: Gene Selection in data mining terms this is a process of attribute selection, which finds the genes most strongly related to a particular class. Classification classifying diseases or predicting outcomes based on gene expression patterns, and perhaps even identifying the best treatment for given genetic signature. Classification involves finding rules that partition the data into disjoint groups. The input for the classification is the training data set, whose class labels are already known. It analyzes the training data set and constructs a model based on the class label. It is a kind of supervised learning because class field is known Real life Example of classification: the diagnosis of a medical condition from symptoms, in which the classes could be either the various disease states or the possible therapies; determining the game-theoretic value of a chess position, with the classes won for white, lost for white, and drawn; and deciding from atmospheric observations whether a severe thunderstorm is unlikely, possible or probable. Clustering finding new biological classes or refining existing ones. Gene Selection: this method is also used in DNA micro arrays data. Because the microarray dataset has many more features than records, the common statistical and machine learning procedures such as classification can lead to true discoveries due to random chance. The highlights of the common errors is identifying informative features and developing accurate classifiers, and shows the correct approach [2]. [3] Author presents a review of methods available in Microarray classification, which cover the full spectrum of micro array data analysis, including data preprocessing, experimental design, quality control, gene selection and differential expression analysis, classification, and clustering. One would expect that different datasets representing the same biological system will display some amount of invariant biological characteristics independent of the idiosyncrasies or details of the sample sources, the preparation procedures and the technological platforms used to obtain the data. These invariant biological characteristics, when properly captured and exposed, can provide the basis to build more robust, general and accurate classification models. To classify heterogeneous factors is based on IFs (impact factors) addresses this problem. The IFs provide a way to measure the variations between individual classes in train and test samples and can be integrated into standard classifiers such as Weighted Voting or k-nn resulting in a significantly improvement in the accuracy for classifying heterogeneous samples. 2. RELATED WORK In data mining knowledge are extracted through different technique such as classification, clustering, association etc. In early work in the field of Privacy Preserving Data Mining. problem propose a solution to the privacy Volume 1, Issue 4 November - December 2012 Page 136

3 preserving classification problem using the oblivious transfer protocol, a powerful tool developed by the secure multi-party computation studies [4]. The solution, however, only deals with the horizontally partitioned data and targets only for the ID3 algorithm (because it only emulates the computation of the ID3 algorithm). Another approach for solving the privacy preserving classification problem was proposed and also studied in [4, 6]. In this approach, each individual data item is perturbed and the distribution of the all data is reconstructed at an aggregate level. The technique works for those data mining algorithms that use the probability distributions rather than individual records. An example of classification algorithm which uses such aggregate information is also discussed [7]. information, but about different entities. An example of that would be grocery shopping data collected by different supermarkets (also known as market-basket data in the data mining literature) [11]. Figure below illustrates horizontal partitioning and shows the credit card databases of two diffrent (local) credit Unions. Taken together, one may that fraudulent customers often have similar Transaction histories, etc. Horizontally partitioned data is data which is homogeneously distributed, meaning that all data tuples yield over the same item or feature set. Essentially this boils down to different data sites collecting the same kind of information over different individuals. In Horizontal partitioned data: the database scheme is looking like the Figure 3.1 shown below, There has been research considering preserving privacy for other type of data mining. For instance, proposed a solution to the privacy preserving distributed Association mining problem is discussed in [6]. Secure Multi-party Computation. The problem we are studying is actually a special case of a more general problem, the Secure Multi-party Computation (SMC) problem. Briefly, a SMC problem deals with computing any function on any input, in a distributed network where each participant holds one of the inputs, while ensuring that no more information is revealed to a participant in the computation than can be inferred from that participant s input and output [8]. The SMC problem literature is extensive, having been introduced by [7] and expanded [6, 9]. It has been proved that for any function, there is a secure multiparty computation solution [4]. The approach used is as follows the function F to be computed is first represented as a combinatorial circuit, and then the parties run a short protocol for every gate in the circuit. Every participant gets corresponding shares of the input wires and the output wires for every gate. This approach, though appealing in its generality and simplicity, means that the size of the protocol depends on the size of the circuit, which depends on the size of the input. This is highly inefficient for large inputs, as in data mining [8]. It has been well accepted that for special cases of computations, special solutions should be developed for efficiency reasons. Therefore in each and every case either horizontal or vertical partition are considered but we proposed to consider vertical partition of DNA Micro array data over ID3 classification by preserving privacy also. 3. HORIZONTAL AND VERTICAL PARTITIONING In horizontal partitioning (a.k.a. homogeneous distribution), different sites collect the same set of Figure 3.1 Example: Consider for instance a supermarket chain which gathers information on the buying behavior of its customers. Typically, such a company has different branches, implying data to be horizontally distributed. Horizontal partitioning involves putting different rows into different tables. Perhaps customers with ZIP codes less than are stored in Customers-East, while customers with ZIP codes greater than or equal to are stored in Customers-West. The two partition tables are then Customers-East and Customers-West, while a view with a union might be created over both of them to provide a complete view of all customers. In this paper we proposed heterogeneously distributed data that is also known a s vertically partitioned data, in the data base system database can be partitioned into different types of partitioned such as horizontal partitioning, vertical and grid partitioning, that is the combination of both the partitioning horizontal and vertical also. In Vertically partitioned data: the database scheme is looking like the Figure 3.2 shown below, Volume 1, Issue 4 November - December 2012 Page 137

4 and Figure 5.1, the graph shows the comparison result of DNA dataset are shown below. Table 5.1 Figure 3.2 Vertical partitioning involves creating tables with fewer columns and using additional tables to store the remaining columns. Concept of database such as Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized. Different physical storage might be used to realize vertical partitioning as well; storing infrequently used or very wide columns on a different device, for example, is a method of vertical partitioning. Done explicitly or implicitly, this type of partitioning is called "row splitting" (the row is split by its columns). A common form of vertical partitioning is to split (slow to find) dynamic data from (fast to find) static data in a table where the dynamic data is not used as often as the static. Creating a view across the two newly created tables restores the original table with a performance penalty, however performance will increase when accessing the static data e.g. for statistical analysis. Vertically distributed data is data which is heterogeneously distributed. Basically this means that data is collected by different sites or parties on the same individuals but with differing item or feature sets. Consider for instance financial institutions as banks and credit card companies, they both collect data on customers having a credit card but with differing item sets. Vertical partitioning is also known as heterogeneous distribution of data which implies that though different sites gather information about the same set of entities, they collect different feature sets. 4. IMPLEMENTATION To check the performance of the proposed algorithm, four different datasets are used to see how much communication overhead is caused by the proposed algorithm and algorithms by [1, 2]. 5. EXPERIMENTAL SETUP For testing the proposed algorithm four different datasets were used; DNA dataset taken from UCI Machine Learning Repository [11]. The DNA dataset consist of 150 entities, 3 classes and 4 attributes for each entity, the experiment is compared and is shown in the Table 5.1 No of DNA paira Horizotal Vertical Proposed Heterogeneous partitioned based method CONCLUSION Figure 5.1 Microarrays are a revolutionary new technology with great potential to provide accurate medical diagnostics help find the right treatment and cure for many diseases and provide a detailed genome-wide molecular portrait of cellular states. By considering the vertical partitioning of the data good decision tree can be created by using the ID3 classification algorithm so that accurate medical decision and diagnostics can be done to provide better cure for the diseases by creating decision tree on the basis of the gene Finding new insights into the molecular basis of biological processes and searching for new drugs and treatments is a problem of high complexity and where the techniques of molecular biology has been applied for many decades. The process is analogous to a large search of a few molecular entities, connections or relationships in a large sea of possibilities. We hope that this special issue on Microarray Data Mining will make more researchers interested in the field and its challenges and will be a contribution towards realizing the potential of microarrays for biology and medicine. Volume 1, Issue 4 November - December 2012 Page 138

5 REFERENCES [1] M.C. Doganay, T.B. Pederson, Y. Saygin, E. Savas and A. Levi. Distributed privacy preserving k- means clustering with additive secret sharing, Proceedings of the 2008 international workshop on Privacy and anonymity in information society. PAIS '08, pp , Mar [2] Jaideep Vaidya and Chris Clifton \Privacypreserving k - means clustering over vertically partitioned data,"proceedings of ninth ACM SIGKDD international Conference on Knowledge discovery and data mining. USA '03, pp , Dec [3] A. Rakesh and R. Srikant \Privacy- preserving data mining, "Proceedings Of the 2000 ACM SIGMOD International conference of Management of Data.USA, pp , Mar [4] Margaret H. Dunham, Data Mining - Introductory and Advanced Concepts, Person Education, [5] H Kargupta, S Datta,Q wang and K Siva Kumar\Random-data perturbation techniques and privacy-preserving data mining "IEEE conference on Knowledge and Information system on data mining. London, pp , sep [6] S.V. Kaya, T.B. Pedersen, E. Savas and Y Saygan \Efficient Privacy- preserving distributed clustering based on secret sharing, In PAKDD 2007 International Workshops: Emerging Technologies in Knowledge Discovery and data mining. Springer, pp , Mar [7] Random-permutation: /Random Permutation. [8] Pascal Pailliar. \Public key Cryptosystem based on composite degree residuosity class, "Advances in Cryptology EUROCRYPT 99 International Conference on Theory and Application of Cryptographic Techniques. pp , May [9] Jaideep Vaidya and Chris Clifton.\Privacypreserving association rules in vertically partitioned data."in Proceedings of Eighth ACMSIGKD international Conference on Knowledge discovery and data mining. CANADA '02, pp , july [10]Secure-multiparty-computation: multiparty computation. [11] Merz C J, Murphy P M, "UCI Repository of Machine Learning Database," Available mlearn/. Volume 1, Issue 4 November - December 2012 Page 139

PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION

PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION Justin Z. Zhan, LiWu Chang, Stan Matwin Abstract We propose a new scheme for multiple parties to conduct data mining computations without disclosing