Distributed Data Mining for Pervasive and Privacy-Sensitive Applications

Size: px
Start display at page:

Download "Distributed Data Mining for Pervasive and Privacy-Sensitive Applications"

Transcription

1 Distributed Data Mining for Pervasive and Privacy-Sensitive Applications Hillol Kargupta Department of Computer Science and Electrical Engineering University of Maryland Baltimore County 000 Hilltop Circle Baltimore, MD 2250 Abstract This paper considers the distributed data mining (DDM) problem where transmission or sharing of data is not desirable because of limited bandwidth or privacy-sensitive nature of the distributed, possibly multi-party, data. It notes that most DDM algorithms for such applications produce an ensemble of models (e.g. clusters, classifiers, and associations) generated from that data observed at different sites, sometimes using different techniques. These ensembles are usually difficult to interpret and translate into useful knowledge. The paper argues that linear representations of models are very promising for performing meta-analysis of ensembles that may be useful for addressing this problem. It particularly considers the Fourier representation of discrete structures like decision trees as an example. It also points out several possible applications of this technique, such as PCA-based visualization, aggregation, and construction of redundancy-free ensemble of orthogonal decision trees.. Introduction Data mining deals with the problem of extracting interesting associations, classifiers, clusters, and other patterns from data by paying careful attention to the available computing, storage, communication, and human resources. The emergence of network-based environments has introduced many data mining applications where the resources are distributed. The Internet, sensor networks, mobile applications, and widely prevalent networks of desktop computers are some examples of such environments. The field of Distributed Data Mining (DDM) [5, 22] deals with the problem of mining data using distributed resources. When the data can be freely and efficiently transported from one node to another without significant overhead, DDM algorithms may offer better scalability and response time by () properly redistributing the data in different partitions or (2) distributing the computation, or (3) a combination of both. However, when the data sources are distributed and network bandwidth is restricted, DDM algorithms work by avoiding or minimizing communication of data. If the distributed data belong to different parties who do not want to share the raw data or when the privacy and security of the data are of utmost importance, collecting data to a single site may not be possible. DDM techniques may offer a solution to both of these classes of problems. This paper considers DDM applications where communication of large amount of data in its original form is not desirable because of constraints like limited bandwidth and privacy of the data. There exists a large body of literature (for a review see [22]) on this class of DDM algorithms. Most of the techniques to handle this scenario deals with either the homogeneous [4, 7, 2, 23, 28] (sites observing identical set of attributes) or heterogeneous [5, 8, 5, 3] (sites observing different sets of attributes). Many of these DDM techniques [28, 3, 20] produce an ensemble of models generated from different data sites, possibly using different algorithms and systems. Combining these models in an appropriate manner for generating a global perspective of the data is a key problem in distributed data mining. Most ensembles work by combining only the outputs (e.g. class label assigned by the classifier, cluster membership in case of a clustering problem) using various methods like the weighted average, bagging [2], order statisticsbased techniques [3], voting [28], mutual informationmaximization [30]. However, aggregating the outputs of the models does not solve the complete problem, particularly for a distributed data mining application. We need to analyze and understand the characteristics of the aggregated models. This paper focuses on the growing problem of aggregating, understanding, and manipulating the ensemble of models often produced by pervasive and privacy preserving DDM applications. Section 2 discusses the role of DDM for pervasive and privacy preserving applications. Section 3 offers a framework for aggregating and manipulating ensemble of decision trees using Fourier basis. Section 4 describes a 09

2 novel scheme for PCA-based visualization of an ensemble of decision trees. Section 5 presents a scalable PCAbased scheme for removing redundancy in a decision treeensemble and constructing the orthogonal decision trees. Finally, Section 6 concludes the paper. 2 Pervasive and Privacy Preserving Applications of DDM This section considers a class of distributed data mining applications where downloading data to a single location for subsequent mining is difficult and undesirable. It particularly focuses on distributed data mining in a () mobile, pervasive environment and (2) privacy-sensitive multi-party environment. 2. Distributed Data Mining in a Pervasive Environment Analyzing and monitoring time-critical data streams generated from distributed sources in a pervasive environment is important for many domains like financial data monitoring, process control, regulation compliance, security, and defense applications. Truly pervasive applications usually involve a heterogeneous collection of computing devices connected by networks with various types of bandwidth constraints. Mobile devices like PDAs, cellphones, laptops, and wearables computers are usually connected over low-bandwidth wireless networks. The emergence of powerful mobile devices with reasonable computing and storage capacity is ushering an era of advanced data and computation-intensive mobile applications. Monitoring and mining time-critical data in a ubiquitous fashion is one such possibility. Sensor webs offer another class of data mining applications. They usually involve different types of sensors and data processing nodes connected over a wireless network. Central collection of data from every sensor node may create heavy traffic over the limited bandwidth wireless channels and this may also drain a lot of power from the devices (data transmission consumes considerable power). A carefully designed distributed architecture for data mining in a sensor network is likely to reduce the communication load and also reduce the battery power more evenly across the different nodes in the sensor network. Data Mining in such pervasive environments calls for an approach that can extract patterns from distributed data without necessarily downloading large volume of data at a regular basis. A recently developed experimental mobile data mining system, MobiMine [6] points out the need for the new generation of DDM algorithms that pays careful attention to the cost of transmitting data and models generated from data streams over low-bandwidth networks. The Figure. (Top) Main screen of the MobiMine. (Bottom) A MobiMine interface for visualizing decision tree ensembles. MobiMine is a mobile application for monitoring, management, and mining of financial data streams from PDAs. Figure shows some of the interfaces of the system. MobiMine deals with a stream of models, continuously generated from the financial data. Similar situation arises in many other DDM applications. For example, consider the distributed on-board vehicle data stream mining system that is currently being developed at the UMBC DIADIC laboratory. The system is designed to mine multiple vehicles in a fleet for online health monitoring. The vehicles are connected to a central control station through a wireless network. The vehicles are also equipped with on-board computer systems that process the continuous stream of locally collected data. The on-board module processes the data, sends generated models to the central control station, and report unusual changes in the observed processes. Figure 2 shows some of the interfaces of the system. Data mining techniques for analyzing, aggregating, and visualizing models generated from the streams play a key 0

3 role in such pervasive DDM applications. Figure 2. Distributed vehicle data stream mining system: (Top) The main interface, (Bottom) The interface for monitoring the health of the vehicle. The interface shows the statespace of the vehicle. However, pervasive applications with limited bandwidth network are not the only applications of the DDM technology. There are many other applications of DDM that deal with distributed environments where bandwidth is not the bottleneck. The following section considers one such class of applications. 2.2 Privacy-Sensitive Distributed Data Mining Applications Preserving the privacy of data is important in many data mining applications. Privacy of the data can depend on many different aspects. In most applications, the privacy issue is somehow related to an individual or groups of individuals sharing some common characteristics in a given context. Sometimes the patterns detected by a data mining system may be used in a counter-productive manner that violates the privacy of an individual or a group of individuals. Therefore, it is important to protect the privacy of the data and its context while mining. If the privacy is associated with the identity of an individual then sometimes removing the identification information from the data may solve the problem. However, there exist many applications where such simple solutions do not work. The data set may still reveal certain information that violates the privacy of different entities associated with the data. Therefore, in a privacy-sensitive application it is important to create a shield between the data and the data mining program in order to deny direct access to the raw data. Many privacy-sensitive applications involve multiple data and computing nodes. In fact, even if we have a single source of data, by definition the separation between data and the data mining program forces us to treat them as separate entities in a distributed environment. Therefore, it is useful to consider the general problem of mining multiple data sets, located at different sites, that belong to different parties. We assume that the data sets are proprietary and privacy-sensitive. Therefore, exchange of raw data among different parties or sending data out of its owner s secured environment is not preferred. Many distributed data mining algorithms are appropriate for this class of applications since they try to minimize communication of raw data. There exists a growing body of literature on this topic. DDM algorithms are often designed to minimize the movement of raw data and that is usually helpful for privacy preserving applications. Some of the DDM algorithms work without transferring any raw data from the sources and some of them do move part of them if necessary. Some of them preserves privacy and some do not. For example, the meta-learning approach [28], the Fourier spectrum-based approach to combine decision trees [6, 20], the Collective hierarchical clustering [0] are some examples that can be used with minor modifications for privacy-preserving mining from distributed data. There is also a body of work that approaches this problem from a cryptographic perspective. One way to hide the data is to distort it some way while making sure that the data mining techniques can still find the type of patterns we are interested in. A value-distortion-based technique to protect data privacy is suggested in []. Value distortion is defined as,, where denotes the original value and is a random value drawn from some distribution, respectively. The key idea is that using the distribution of, the original distribution of can be approximated. Cryptographic tools are suggested in [32] in order to secure data transmission, along with communication between local sites as opposed to one centralized site. A privacy preserving technique to construct decision trees [24] is reported in [7]. The approach depends on a completely reliable in-

4 termediary party, in order to regulate the privacy preservation. Kantarcioglu and Clifton [] investigated an association rule mining from homogeneous data using a commutative encryption tool. We are currently building a DDM environment for privacy-sensitive applications at the UMBC DIADIC laboratory. It is currently equipped with different techniques to mine data without directly accessing it. Figure 3 shows one of the main interfaces of the system. representations of decision trees (e.g., CART[3], ID3[24], and C4.5 [25]) for demonstrating the possibility of going beyond the traditional ensembles that just combine the outputs of the models. This section considers the decision tree since it is a popular technique to learn classifiers from data and it is represented by a discrete structure. Learning decision trees from distributed and stream data often produces large ensembles [6, 20, 26, 29]. The rest of this paper considers the Fourier representation of decision trees which allows efficient representation, aggregation, and manipulation of tree-ensembles. 3. Decision Trees as Numeric Functions Figure 3. An interface for computing feature dependencies from privacy-sensitive data in a distributed environment. It shows the module for computing correlation without directly accessing the raw data. In this paper we consider the DDM perspective of privacy-sensitive applications. Most DDM algorithms that do not share any data with other participating sites, share locally generated models, and combine them using different techniques. Therefore, just like the pervasive DDM applications, privacy-sensitive applications also require proper aggregation, transformation, and understanding of the ensemble of models collected from different sites. This paper considers an algebraic framework to do so. The rest of this paper presents a linear representation-based approach and identifies many different research directions. 3 Linear Representations for Aggregation, Manipulation, and Better Understanding of Ensemble of Models This section considers ensemble of classifiers represented using discrete structures and proposes a framework to aggregate, understand, and manipulate them using formal algebraic operations. It particularly considers linear A decision tree defined over a domain of categorical attributes can be treated as a numeric function. First note that a decision tree is a function that maps its domain members to a range of class labels. Sometimes, it is a symbolic function where features take symbolic (non-numeric) values. However, a symbolic function can be easily converted to a numeric function by simply replacing the symbols with numeric values in a consistent manner. Once the tree is converted to a discrete numeric function, we can also apply any appropriate analytical transformation that we want. Fourier transformation is one such interesting possibility. Fourier representation of a function is a linear combination of the Fourier basis functions. The weights, called Fourier coefficients, completely define the representation. Each coefficient is associated with a Fourier basis function that depends on a certain subset of features defining the domain. This section reviews the Fourier representation of decision tree ensembles, introduced elsewhere [4, 6]. 3.2 A Brief Review of the Fourier Basis in the Boolean Domain Fourier bases are orthogonal functions that can be used to represent any discrete function. In other words, it is a functionally complete representation. Consider the set of all -dimensional feature vectors where the -th feature can take different categorical values. The Fourier basis set that spans this space is comprised of basis functions. Each Fourier basis function is defined as, "!$#&% (*),+ )-.) ' where / and are strings of length ; and 0 are 687 -th attribute-value in x and j, respectively; :9;9:9 < >= and represents the feature-cardinality vector,? 2;9:9;9 is called the j-th basis function. The vector / is called a partition, and the order of a partition / is the number of non-zero feature values it contains. A 2

5 : : Fourier basis function depends on some only when the 7. If a partition/ has exactly number corresponding 0 of non-zeros values, then we say the partition is of order since the corresponding Fourier basis function depends only on those number of variables that take non-zero values in the partition/. A function, that maps an -dimensional discrete domain to a real-valued range, can be represented using the Fourier where is the Fourier Coefficient (FC) corresponding to the partition / and is the complex conjugate ; The Fourier coefficient can be viewed as the relative contribution of the partition / to the function value Therefore, the absolute value of can be used as the significance of the corresponding partition/. If the magnitude of some is very small compared to other coefficients, we may consider the / -th partition to be insignificant and neglect its contribution. The order of a Fourier coefficient is nothing but the order of the corresponding partition. We shall often use terms like high order or low order coefficients to refer to a set of Fourier coefficients whose orders are relatively large or small respectively. Energy of a spectrum is defined by the summation. Let us also define the inner product between two spectra and where "! #$! 2:9;9:9 #$!&% '(% )+* is the column matrix of all Fourier coefficients in an arbitrary but fixed order. Superscript, denotes the transpose operation and -/.0- denotes the total number of coefficients in the spectrum. The inner product, $! $! We will also use the definition of the inner product between a pair of real-valued functions defined over some domain 7. This is defined 3 The following section considers the Fourier spectrum of decision trees and discusses some of its useful properties. 3.3 Properties of Decision Trees in the Fourier Domain This section considers the Fourier spectrum of decision trees with finite depths, bounded by some constant. The underlying functions in such decision trees can be represented by a constant depth Boolean AND and OR circuit (or equivalently <>= circuit). Linial et al. [8] noted that the Fourier spectrum of <>= circuit has very interesting properties and proved the following lemma. Lemma (Linial, 993) Let? of an <A= circuit. Then KJML B %DC $EGFIH? L9N FIOIP"QSR be the size and depth where T / denotes the order of the partition j and U is a non-negative integer. The term on the left hand side of the inequality represents the energy of the spectrum captured by the coefficients with order greater than a given constant U. Lemma essentially states the following property about decision trees: The energy captured by all high order Fourier coefficients is small. This is because the energy of the Fourier coefficients of higher order decays exponentially. This observation suggests that the spectrum of a Boolean decision tree (or equivalently bounded depth function) can be approximated by computing only a small number of low order Fourier coefficients. So Fourier basis offers an efficient numeric representation of a decision tree in the form of a linear function that can be easily stored and manipulated. The exponential decay property of Fourier spectrum also holds for non-boolean decision trees. The complete proof is available elsewhere [2]. There are two additional important characteristics of the Fourier spectrum of a decision tree:. The Fourier spectrum of a decision tree can be efficiently computed [6]. 2. The Fourier spectrum can be directly used for constructing the tree [2]. In other words, we can go back and forth between the tree and its spectrum. This is philosophically similar to the switching between the time and frequency domains in the traditional application of Fourier analysis for signal processing. Fourier transformation of decision trees also preserves inner product. The functional behavior of a decision tree is defined by the class labels it assigns. Therefore, if 6 2 2;9;9:9 % % = are the members of the domain 7 then the functional behavior of a decision tree can be captured by the vector 8;: ) + % % )+*, where the superscript, denotes the transpose operation. The following lemma proves that the inner product between two such vectors is identical to the same in between their respective Fourier spectra. 3

6 2nd Principal Component st Principal Component Figure 4. Visualization of an ensemble of decision trees using PCA. Each point represents a single decision tree. Lemma 2 Let 2$! for all 4 7 then ;:!! $! 6 $! 2 2 and 3 "! 98;: $! 43 The fourth step is true since Fourier basis functions are orthonormal. The Fourier spectrum of a decision tree offers a real valued representation that allows a wide range of different data analysis techniques for analyzing and understanding the ensemble of decision trees. The rest of this paper considers several such possibilities. 4 Visualizing Ensemble of Decision Trees Visual inspection of ensemble models for identifying their relative similarities and dissimilarities is one basic needs for better understanding of an ensemble. This section offers a technique for visualizing decision treeensembles using the Principal Component Analysis (PCA) [9]. Consider a set of decision trees ; 2 2:9;9;92 ; let 2 2 2:9;9:9 be their respective spectra. In order to visualize the functional behavior of any given tree we need to somehow represent the vector ; #$). The inner product between 2 ) and ) provides a measure of their similarities. Therefore, the inner product matrix can be something useful for studying the ensemble. Unfortunately, for most real-life applications explicit operations using #$) are not practical since the domain of ; # is usually very large. However, Lemma 2 offers a practical way to solve this problem. Since Fourier spectrum preserves inner product we can operate in the Fourier domain efficiently without explicitly dealing with the #$) -s. The inner product matrix is useful for measuring pairwise similarity between trees. However sometimes we may need to represent the trees in a new embedding. PCA is a popular technique to construct a smaller dimensional representation of high dimensional data. Although PCA may not be directly applicable to the discrete structures trees, the technique works fine with the representation of decision trees in the Fourier domain. Let be the union of all partitions with non-zero coefficients from all the spectra under consideration and - - is the cardinality of the set. Consider an arbitrary but fixed ordering of all the members of. Let us now define the matrix such that!- #$! -, where K "! - denotes the Fourier coefficient corresponding to the 0 -th partition in the ordering from the spectrum of the tree 8. After translating the column-means of the matrix to zero, we get the new matrix. Therefore,!-!- -, where - is the mean of the 0 -th column of matrix. This is a real valued - - matrix. The covariance matrix of is therefore *. This is a symmetric matrix with an eigenvalue decomposition. A straight forward application of PCA on and its subsequent projection along the dominant eigenvectors can be used to create a compact, smaller dimensional representation of the trees. Figure 4 shows a two-dimensional representation of an ensemble of trees using the two most dominant eigenvectors. Visualization of trees is not the only thing we can do using the Fourier representation of trees. It also offers us a way to create linear combinations the trees and develop the notion of redundancy-free orthogonal trees. The following section outlines these possibilities. 4

7 8 8 O 8-5 Removing Redundancies from Ensembles Combining the output of base models is the central issue in ensemble learning. There exist several well known techniques [2, 27, 33, 3] to do that. All of these techniques combine the output of the base classifiers in different ways. They do not structurally combine the classifiers themselves. The Fourier representation offers a unique way to do that. The Fourier spectrum of a linear combination of decision tree classifiers can be computed by first computing the Fourier spectrum of every tree and then aggregating them using the chosen scheme for constructing the ensemble. be the underlying function representing the ensemble of different decision trees where the output is a weighted linear combination of the outputs of the base classifiers. Then we can 9;9:9 9;9:9 - ) "! Where is the weight of the F decision tree and is the set of all partitions with non-zero Fourier coefficients in its spectrum. 2$! 2 where "$! $! and. Therefore, the Fourier spectrum (an linear ensemble classifier) is simply the weighted sum of the spectra of the member trees. The base models of an ensemble often share redundancy resulting from similar observations noted at different sites. Continuous data stream environments may also introduce redundancy in the generated models because of the underlying periodicity in the data. Therefore, removing redundancy from the base models may be useful for creating the ensemble. The following part of this section explores a Fourier spectrum-based approach to do that. Consider the matrix where!- where - is the output of the tree - for input 4 7. is an matrix where is the size of the input domain and is the total number of trees in the ensemble. An ensemble classifier that combines the outputs of the base classifiers can be viewed as a function defined over the set of all rows in. If D!- denotes the 0 -th column matrix of then the ensemble classifier can be viewed as a function of D! 2 D! 2;9:9;9 D!. When the ensemble classifier is a linear combination of the outputs of the base classifiers we have D!! 9;9;9!, where 5 is the column matrix of the overall ensemble-output. Since the base classifiers may have redundancy, we would like to construct a compact low-dimensional representation of the matrix. However, explicit construction and manipulation of the matrix is difficult, since most practical applications deal with a very large domain. We can try to construct an approximation of using only the available training data. One such approximation of and its Principal Component Analysis-based projection is reported elsewhere [9]. Their technique performs PCA of the matrix, projects the data in the representation defined by the eigenvectors of the covariance matrix of, and then performs linear regression for computing the coefficients 2 2;9:9;92 and. While the approach is interesting, it has a serious problem. First of all, the construction of an approximation of even for the training data is computationally prohibiting for most large scale data mining applications. Moreover, this is an approximation since the matrix is computed only over the observed data set of the entire domain. In the following we demonstrate a novel way to perform a PCA of the matrix, defined over the entire domain. The approach uses the Fourier spectra of the trees, Lemma 2, and works without explicitly generating the matrix. The following analysis will assume that the columns of the matrix are mean-zero. This restriction can be easily removed with a simple extension of the analysis. Note that the covariance of the matrix is *. Let us denote this covariance matrix by =. The 2 0 -th entry of the matrix, =! #$! - $! # 2-3 () The third step is true by Lemma 2. Now let us the consider the matrix where!- - $!, i.e. the coefficient corresponding to the -th member of the partition set from the spectrum of the tree -. Equation implies that the covariance matrices of and are identical. Note that is an - - dimensional matrix. For most practical applications Therefore analyzing using techniques like PCA is significantly easier. The following discourse outlines a PCA-based approach. PCA of the matrix produces a set of eigenvectors which in turn defines a set of Principal Components, 2 2:9;9:9. Let G - $! be the 0 -th component of the -th

8 eigenvector of the matrix G - $! 2 0 G - $! - 98;: G - $! +! - - "!! - $! 98;: - $!. The eigenvalue decom- Where position constructs a new representation of the underlying domain where the feature corresponding to column vector i.e., ) 98;:. Note that is a linear combination of a set of Fourier spectra and therefore it is also a Fourier spectrum. Also note that -s are orthogonal. The above analysis offers a way to construct the Fourier spectra of a set of functions that are orthogonal to each other. We can construct decision trees from each of these spectra using the tree construction technique developed elsewhere and these trees will be mutually orthogonal. These trees can be used to create a less redundant and more efficient ensemble of classifiers. The following section concludes this paper. 6 Conclusions This paper considers one of the central research issues in the field of distributed data mining understanding, aggregating, and manipulating models generated by different types of algorithms from different data sites. It argues that the traditional ensemble-based approaches to combine only the outputs of the base models do not serve the purpose very well as far as distributed data mining is concerned. We need techniques that allow advanced meta-level analysis of models, like detecting the underlying redundancies, visualizing the evolution of the patterns, detecting the stability of the ensemble, and others. This paper proposed an approach based on linear representation of discrete structures. It particularly considered Fourier representation of decision trees and showed that such representations can be very useful for visualizing, aggregating, and removing redundancies from an ensemble. Although, the paper considers the Fourier representation, this is clearly not the only available linear representation around. Distributed data mining applications deal with many discrete structures like graphs and clusters that can also benefit from appropriate linear decompositions. Eigenvectors and Wavelets are other interesting choices for representing ensembles that need further investigations. Acknowledgments The authors acknowledge supports from the United States National Science Foundation CAREER award IIS , NASA (NRA) NAS2-3743, and TEDCO, Maryland Technology Development Center. The author would also like to thank Ligong Yang for producing Figure 4. References [] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceeding of the ACM SIGMOD Conference on Management of Data, pages , Dallas, Texas, May ACM Press. [2] L. Breiman. Bagging predictors. Machine Learning, 24(2):23 40, 996. [3] L. Breiman, J. H. Freidman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 984. [4] P. Chan and S. Stolfo. Toward parallel and distributed learning by meta-learning. In Working Notes AAAI Work. Knowledge Discovery in Databases, pages AAAI, 993. [5] R. Chen, S. Krishnamoorthy, and H. Kargupta. Distributed web mining using Bayesian networks from multiple data streams. In IEEE International Conference on Data Mining, pages , CA, USA, 200. [6] W. Fan, S. Stolfo, and J. Zhang. The application of adaboost for distributed, scalable and on-line learning. In Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, 999. [7] G. Forman and B. Zhang. Distributed data clustering can be efficient and exact. In SIGKDD Explorations, volume 2 of 2. ACM, [8] D. Hershberger and H. Kargupta. Distributed multivariate regression using wavelet-based collective data mining. Journal of Parallel Distributed Computing, 6: , 200. [9] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 933. [0] E. Johnson and H. Kargupta. Collective, hierarchical clustering from distributed, heterogeneous data. In Lecture Notes in Computer Science, volume 759, pages Springer- Verlag, 999. [] M. Kantarcioglu and C. Clifton. Privacy-preserving distributed mining of association rules on horizontally partitioned data. In SIGMOD Workshop on DMKD, Madison, WI, June [2] H. Kargupta, I. Hamzaoglu, and B. Stafford. Scalable, distributed data mining using an agent based architecture. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors, Proceedings of Knowledge Discovery And Data Mining, pages 2 24, Menlo Park, CA, 997. AAAI Press. 6

9 [3] H. Kargupta, W. Huang, K. S., and E. Johnson. Distributed clustering using collective principal component analysis. Knowledge and Information Systems Journal Special Issue on Distributed and Parallel Knowledge Discovery, 3: , 200. [4] H. Kargupta and B. Park. Mining time-critical data stream using the Fourier spectrum of decision trees. In Proceedings of the IEEE International Conference on Data Mining, pages IEEE Press, 200. [5] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Collective data mining: A new perspective towards distributed data mining. In Advances in Distributed and Parallel Knowledge Discovery, Eds: Kargupta, Hillol and Chan, Philip. AAAI/MIT Press, [6] H. Kargupta, B. Park, S. Pittie, L. Liu, D. Kushraj, and K. Sarkar. Mobimine: Monitoring the stock market from a PDA. ACM SIGKDD Explorations, 3(2):37 46, January [7] Y. Lindell and B. Pinkas. Privacy preserving data mining. In Advances in Cryptology CRYPTO 2000, pages 36 54, August [8] N. Linial, Y. Mansour, and N. Nisan. Constant depth circuits, fourier transform, and learnability. Journal of the ACM, 40: , 993. [9] C. J. Merz and M. J. Pazzani. A principal components approach to combining regression estimates. Machine Learning, 36( 2):9 32, 999. [20] B. Park, A. R., and H. Kargupta. A fourier analysis-based approach to learn classifier from distributed heterogeneous data. In Proceedings of the First SIAM Internation Conference on Data Mining, Chicago, US, 200. [2] B. H. Park and H. Kargupta. Constructing simpler decision trees from the fourier spectrum of ensemble models: Theoretical issues and application in mining data streams. In communication (Shorter version published in SIGMOD DMKD 02 Workshop, [22] B. H. Park and H. Kargupta. Distributed data mining: Algorithms, systems, and applications. In Data Mining Handbook. To be published, [23] S. Parthasarathy and M. Ogihara. Clustering distributed homogeneous datasets. In PDKK, pages , [24] J. R. Quinlan. Induction of decision trees. Machine Learning, ():8 06, 986. [25] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kauffman, 993. [26] J. R. Quinlan. Bagging, boosting and C4.5. In Proceedings of AAAI 96 National Conference on Artificial Intelligence, pages , 996. [27] P. Smyth and D. Wolpert. Linearly combining density estimators via stacking. Machine Learning, 36( 2):59 83, 999. [28] S. Stolfo et al. Jam: Java agents for meta-learning over distributed databases. In Proceedings Third International Conference on Knowledge Discovery and Data Mining, pages 74 8, Menlo Park, CA, 997. AAAI Press. [29] W. N. Street and Y. Kim. A streaming ensemble algorithm (sea) for large-scale classificaiton. In Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 200. [30] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining partitionings. In Proceedings the 8th National Conference on Artificial Intelligence (AAAI), July, Edmonton, Canada, AAAI. [3] K. Tumer and J. Ghosh. Robust order statistics based ensemble for distributed data mining. In Advances in Distributed and Parallel Knowledge Discovery, Eds: Kargupta, Hillol and Chan, Philip. MIT, [32] J. Vaidya and C. Clifton. Privacy preserving association rule mining in vertically partitioned data. In The Eighth ACM SIGKDD International conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, CA, July [33] D. Wolpert. Stacked generalization. Neural Networks, 5:24 259,

Orthogonal Decision Trees

Orthogonal Decision Trees 1 Orthogonal Decision Trees Hillol Kargupta, Byung-Hoon Park, Haimonti Dutta Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore,

More information

An Introduction to Distributed Data Mining

An Introduction to Distributed Data Mining An Introduction to Distributed Data Mining Hillol Kargupta Faculty of Computer Science School of Electrical Engineering and Computer Science Washington State University http://www.eecs eecs.wsu.edu/~ /~hillol

More information

Data Mining in Distributed and Ubiquitous Environments: Past, Present, and Future. Hillol Kargupta

Data Mining in Distributed and Ubiquitous Environments: Past, Present, and Future. Hillol Kargupta Data Mining in Distributed and Ubiquitous Environments: Past, Present, and Future Hillol Kargupta Department of Computer Science and Electrical Engineering University of Maryland Baltimore County Baltimore,

More information

The Applicability of the Perturbation Model-based Privacy Preserving Data Mining for Real-world Data

The Applicability of the Perturbation Model-based Privacy Preserving Data Mining for Real-world Data The Applicability of the Perturbation Model-based Privacy Preserving Data Mining for Real-world Data Li Liu, Murat Kantarcioglu and Bhavani Thuraisingham Computer Science Department University of Texas

More information

An Empirical Comparison of Spectral Learning Methods for Classification

An Empirical Comparison of Spectral Learning Methods for Classification An Empirical Comparison of Spectral Learning Methods for Classification Adam Drake and Dan Ventura Computer Science Department Brigham Young University, Provo, UT 84602 USA Email: adam drake1@yahoo.com,

More information

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

CLASSIFICATION FOR SCALING METHODS IN DATA MINING CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department

More information

PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION

PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION Justin Z. Zhan, LiWu Chang, Stan Matwin Abstract We propose a new scheme for multiple parties to conduct data mining computations without disclosing

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Distributed Data Mining

Distributed Data Mining Distributed Data Mining Grigorios Tsoumakas* Department of Informatics Aristotle University of Thessaloniki Thessaloniki, 54124 Greece voice: +30 2310-998418 fax: +30 2310-998419 email: greg@csd.auth.gr

More information

USING PRINCIPAL COMPONENTS ANALYSIS FOR AGGREGATING JUDGMENTS IN THE ANALYTIC HIERARCHY PROCESS

USING PRINCIPAL COMPONENTS ANALYSIS FOR AGGREGATING JUDGMENTS IN THE ANALYTIC HIERARCHY PROCESS Analytic Hierarchy To Be Submitted to the the Analytic Hierarchy 2014, Washington D.C., U.S.A. USING PRINCIPAL COMPONENTS ANALYSIS FOR AGGREGATING JUDGMENTS IN THE ANALYTIC HIERARCHY PROCESS Natalie M.

More information

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of

More information

Experimental Analysis of a Privacy-Preserving Scalar Product Protocol

Experimental Analysis of a Privacy-Preserving Scalar Product Protocol Experimental Analysis of a Privacy-Preserving Scalar Product Protocol Zhiqiang Yang Rebecca N. Wright Hiranmayee Subramaniam Computer Science Department Stevens Institute of Technology graduate Stevens

More information

Analysis Of Privacy Preserving Random Perturbation Techniques: Further Explorations

Analysis Of Privacy Preserving Random Perturbation Techniques: Further Explorations Analysis Of Privacy Preserving Random Perturbation Techniques: Further Explorations Haimonti Dutta Department of CSEE University of Maryland Baltimore County Baltimore, Maryland 21250 hdutta1@cs.umbc.edu

More information

International Journal of Modern Engineering and Research Technology

International Journal of Modern Engineering and Research Technology Volume 2, Issue 4, October 2015 ISSN: 2348-8565 (Online) International Journal of Modern Engineering and Research Technology Website: http://www.ijmert.org Privacy Preservation in Data Mining Using Mixed

More information

Data Distortion for Privacy Protection in a Terrorist Analysis System

Data Distortion for Privacy Protection in a Terrorist Analysis System Data Distortion for Privacy Protection in a Terrorist Analysis System Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang Department of Computer Science, University of Kentucky, Lexington KY 40506-0046, USA

More information

A General Greedy Approximation Algorithm with Applications

A General Greedy Approximation Algorithm with Applications A General Greedy Approximation Algorithm with Applications Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, NY 10598 tzhang@watson.ibm.com Abstract Greedy approximation algorithms have been

More information

MetaData for Database Mining

MetaData for Database Mining MetaData for Database Mining John Cleary, Geoffrey Holmes, Sally Jo Cunningham, and Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand. Abstract: At present, a machine

More information

Research Statement. Yehuda Lindell. Dept. of Computer Science Bar-Ilan University, Israel.

Research Statement. Yehuda Lindell. Dept. of Computer Science Bar-Ilan University, Israel. Research Statement Yehuda Lindell Dept. of Computer Science Bar-Ilan University, Israel. lindell@cs.biu.ac.il www.cs.biu.ac.il/ lindell July 11, 2005 The main focus of my research is the theoretical foundations

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

A Review on Privacy Preserving Data Mining Approaches

A Review on Privacy Preserving Data Mining Approaches A Review on Privacy Preserving Data Mining Approaches Anu Thomas Asst.Prof. Computer Science & Engineering Department DJMIT,Mogar,Anand Gujarat Technological University Anu.thomas@djmit.ac.in Jimesh Rana

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees

A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees Doina Caragea, Adrian Silvescu and Vasant Honavar Artificial Intelligence Research

More information

Module 9 : Numerical Relaying II : DSP Perspective

Module 9 : Numerical Relaying II : DSP Perspective Module 9 : Numerical Relaying II : DSP Perspective Lecture 36 : Fast Fourier Transform Objectives In this lecture, We will introduce Fast Fourier Transform (FFT). We will show equivalence between FFT and

More information

Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters

Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters S. Bailey, R Grossman, H. Sivakumar, and A. Turinsky National Center for Data Mining University of Illinois at Chicago

More information

Rotation Perturbation Technique for Privacy Preserving in Data Stream Mining

Rotation Perturbation Technique for Privacy Preserving in Data Stream Mining 218 IJSRSET Volume 4 Issue 8 Print ISSN: 2395-199 Online ISSN : 2394-499 Themed Section : Engineering and Technology Rotation Perturbation Technique for Privacy Preserving in Data Stream Mining Kalyani

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters

Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters S. Bailey, R Grossman, H. Sivakumar, and A. Turinsky National Center for Data Mining University of Illinois at Chicago

More information

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2 A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation Kwanyong Lee 1 and Hyeyoung Park 2 1. Department of Computer Science, Korea National Open

More information

Privacy Preserving Classification of heterogeneous Partition Data through ID3 Technique

Privacy Preserving Classification of heterogeneous Partition Data through ID3 Technique Privacy Preserving Classification of heterogeneous Partition Data through ID3 Technique Saurabh Karsoliya 1 B.Tech. (CSE) MANIT, Bhopal, M.P., INDIA Abstract: The goal of data mining is to extract or mine

More information

BUILDING PRIVACY-PRESERVING C4.5 DECISION TREE CLASSIFIER ON MULTI- PARTIES

BUILDING PRIVACY-PRESERVING C4.5 DECISION TREE CLASSIFIER ON MULTI- PARTIES BUILDING PRIVACY-PRESERVING C4.5 DECISION TREE CLASSIFIER ON MULTI- PARTIES ALKA GANGRADE 1, RAVINDRA PATEL 2 1 Technocrats Institute of Technology, Bhopal, MP. 2 U.I.T., R.G.P.V., Bhopal, MP email alkagangrade@yahoo.co.in,

More information

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi 1. Introduction The choice of a particular transform in a given application depends on the amount of

More information

DECISION tree [1] ensembles are frequently used in data

DECISION tree [1] ensembles are frequently used in data IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 7, JULY 2006 1 Orthogonal Decision Trees Hillol Kargupta, Byung-Hoon Park, and Haimonti Dutta Abstract This paper introduces orthogonal

More information

Sensor Based Time Series Classification of Body Movement

Sensor Based Time Series Classification of Body Movement Sensor Based Time Series Classification of Body Movement Swapna Philip, Yu Cao*, and Ming Li Department of Computer Science California State University, Fresno Fresno, CA, U.S.A swapna.philip@gmail.com,

More information

Penalizied Logistic Regression for Classification

Penalizied Logistic Regression for Classification Penalizied Logistic Regression for Classification Gennady G. Pekhimenko Department of Computer Science University of Toronto Toronto, ON M5S3L1 pgen@cs.toronto.edu Abstract Investigation for using different

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

ECE 285 Class Project Report

ECE 285 Class Project Report ECE 285 Class Project Report Based on Source localization in an ocean waveguide using supervised machine learning Yiwen Gong ( yig122@eng.ucsd.edu), Yu Chai( yuc385@eng.ucsd.edu ), Yifeng Bu( ybu@eng.ucsd.edu

More information

IJSER. Privacy and Data Mining

IJSER. Privacy and Data Mining Privacy and Data Mining 2177 Shilpa M.S Dept. of Computer Science Mohandas College of Engineering and Technology Anad,Trivandrum shilpams333@gmail.com Shalini.L Dept. of Computer Science Mohandas College

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Agglomerative clustering on vertically partitioned data

Agglomerative clustering on vertically partitioned data Agglomerative clustering on vertically partitioned data R.Senkamalavalli Research Scholar, Department of Computer Science and Engg., SCSVMV University, Enathur, Kanchipuram 631 561 sengu_cool@yahoo.com

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator

Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator R.Saravanan 1, J.Sivapriya 2, M.Shahidha 3 1 Assisstant Professor, Department of IT,SMVEC, Puducherry, India 2,3 UG student, Department

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

PPKM: Preserving Privacy in Knowledge Management

PPKM: Preserving Privacy in Knowledge Management PPKM: Preserving Privacy in Knowledge Management N. Maheswari (Corresponding Author) P.G. Department of Computer Science Kongu Arts and Science College, Erode-638-107, Tamil Nadu, India E-mail: mahii_14@yahoo.com

More information

Feature Selection for Image Retrieval and Object Recognition

Feature Selection for Image Retrieval and Object Recognition Feature Selection for Image Retrieval and Object Recognition Nuno Vasconcelos et al. Statistical Visual Computing Lab ECE, UCSD Presented by Dashan Gao Scalable Discriminant Feature Selection for Image

More information

Efficient SQL-Querying Method for Data Mining in Large Data Bases

Efficient SQL-Querying Method for Data Mining in Large Data Bases Efficient SQL-Querying Method for Data Mining in Large Data Bases Nguyen Hung Son Institute of Mathematics Warsaw University Banacha 2, 02095, Warsaw, Poland Abstract Data mining can be understood as a

More information

Efficient Case Based Feature Construction

Efficient Case Based Feature Construction Efficient Case Based Feature Construction Ingo Mierswa and Michael Wurst Artificial Intelligence Unit,Department of Computer Science, University of Dortmund, Germany {mierswa, wurst}@ls8.cs.uni-dortmund.de

More information

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel

More information

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER A.Shabbir 1, 2 and G.Verdoolaege 1, 3 1 Department of Applied Physics, Ghent University, B-9000 Ghent, Belgium 2 Max Planck Institute

More information

2.2 Set Operations. Introduction DEFINITION 1. EXAMPLE 1 The union of the sets {1, 3, 5} and {1, 2, 3} is the set {1, 2, 3, 5}; that is, EXAMPLE 2

2.2 Set Operations. Introduction DEFINITION 1. EXAMPLE 1 The union of the sets {1, 3, 5} and {1, 2, 3} is the set {1, 2, 3, 5}; that is, EXAMPLE 2 2.2 Set Operations 127 2.2 Set Operations Introduction Two, or more, sets can be combined in many different ways. For instance, starting with the set of mathematics majors at your school and the set of

More information

Social Behavior Prediction Through Reality Mining

Social Behavior Prediction Through Reality Mining Social Behavior Prediction Through Reality Mining Charlie Dagli, William Campbell, Clifford Weinstein Human Language Technology Group MIT Lincoln Laboratory This work was sponsored by the DDR&E / RRTO

More information

Co-clustering for differentially private synthetic data generation

Co-clustering for differentially private synthetic data generation Co-clustering for differentially private synthetic data generation Tarek Benkhelif, Françoise Fessant, Fabrice Clérot and Guillaume Raschia January 23, 2018 Orange Labs & LS2N Journée thématique EGC &

More information

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati Analytical Representation on Secure Mining in Horizontally Distributed Database Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering

More information

Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

More information

FACE RECOGNITION USING FUZZY NEURAL NETWORK

FACE RECOGNITION USING FUZZY NEURAL NETWORK FACE RECOGNITION USING FUZZY NEURAL NETWORK TADI.CHANDRASEKHAR Research Scholar, Dept. of ECE, GITAM University, Vishakapatnam, AndraPradesh Assoc. Prof., Dept. of. ECE, GIET Engineering College, Vishakapatnam,

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

NISS. Joyee Ghosh, Jerome P. Reiter, and Alan F. Karr. Technical Report Number 160 June 2006

NISS. Joyee Ghosh, Jerome P. Reiter, and Alan F. Karr. Technical Report Number 160 June 2006 NISS Secure computation with horizontally partitioned data using adaptive regressive splines Joyee Ghosh, Jerome P. Reiter, and Alan F. Karr Technical Report Number 160 June 2006 National Institute of

More information

IMPLEMENTATION OF UNIFY ALGORITHM FOR MINING OF ASSOCIATION RULES IN PARTITIONED DATABASES

IMPLEMENTATION OF UNIFY ALGORITHM FOR MINING OF ASSOCIATION RULES IN PARTITIONED DATABASES IMPLEMENTATION OF UNIFY ALGORITHM FOR MINING OF ASSOCIATION RULES IN PARTITIONED DATABASES KATARU MANI 1, N. JAYA KRISHNA 2 1 Mtech Student, Department of CSE. EMAIL: manicsit@gmail.com 2 Assistant Professor,

More information

Project Participants

Project Participants Annual Report for Period:10/2004-10/2005 Submitted on: 06/21/2005 Principal Investigator: Yang, Li. Award ID: 0414857 Organization: Western Michigan Univ Title: Projection and Interactive Exploration of

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

CHAPTER 2 WIRELESS SENSOR NETWORKS AND NEED OF TOPOLOGY CONTROL

CHAPTER 2 WIRELESS SENSOR NETWORKS AND NEED OF TOPOLOGY CONTROL WIRELESS SENSOR NETWORKS AND NEED OF TOPOLOGY CONTROL 2.1 Topology Control in Wireless Sensor Networks Network topology control is about management of network topology to support network-wide requirement.

More information

Privacy Preserving based on Random Projection using Data Perturbation Technique

Privacy Preserving based on Random Projection using Data Perturbation Technique IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 03, 2015 ISSN (online): 2321-0613 Privacy Preserving based on Random Projection using Data Perturbation Technique Ripal

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Mining High Order Decision Rules

Mining High Order Decision Rules Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high

More information

CONSTRUCTION AND EVALUATION OF MESHES BASED ON SHORTEST PATH TREE VS. STEINER TREE FOR MULTICAST ROUTING IN MOBILE AD HOC NETWORKS

CONSTRUCTION AND EVALUATION OF MESHES BASED ON SHORTEST PATH TREE VS. STEINER TREE FOR MULTICAST ROUTING IN MOBILE AD HOC NETWORKS CONSTRUCTION AND EVALUATION OF MESHES BASED ON SHORTEST PATH TREE VS. STEINER TREE FOR MULTICAST ROUTING IN MOBILE AD HOC NETWORKS 1 JAMES SIMS, 2 NATARAJAN MEGHANATHAN 1 Undergrad Student, Department

More information

Digital copying involves a process. Developing a raster detector system with the J array processing language SOFTWARE.

Digital copying involves a process. Developing a raster detector system with the J array processing language SOFTWARE. Developing a raster detector system with the J array processing language by Jan Jacobs All digital copying aims to reproduce an original image as faithfully as possible under certain constraints. In the

More information

PRIVACY PRESERVING IN DISTRIBUTED DATABASE USING DATA ENCRYPTION STANDARD (DES)

PRIVACY PRESERVING IN DISTRIBUTED DATABASE USING DATA ENCRYPTION STANDARD (DES) PRIVACY PRESERVING IN DISTRIBUTED DATABASE USING DATA ENCRYPTION STANDARD (DES) Jyotirmayee Rautaray 1, Raghvendra Kumar 2 School of Computer Engineering, KIIT University, Odisha, India 1 School of Computer

More information

Classification of Hyperspectral Breast Images for Cancer Detection. Sander Parawira December 4, 2009

Classification of Hyperspectral Breast Images for Cancer Detection. Sander Parawira December 4, 2009 1 Introduction Classification of Hyperspectral Breast Images for Cancer Detection Sander Parawira December 4, 2009 parawira@stanford.edu In 2009 approximately one out of eight women has breast cancer.

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

A Comparison of Resampling Methods for Clustering Ensembles

A Comparison of Resampling Methods for Clustering Ensembles A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei-Bidgoli Computer Science Department Michigan State University East Lansing, MI, 48824, USA Alexander Topchy Computer Science Department

More information

Reconstruction-based Classification Rule Hiding through Controlled Data Modification

Reconstruction-based Classification Rule Hiding through Controlled Data Modification Reconstruction-based Classification Rule Hiding through Controlled Data Modification Aliki Katsarou, Aris Gkoulalas-Divanis, and Vassilios S. Verykios Abstract In this paper, we propose a reconstruction

More information

Survey of Distributed Decision Tree Induction Techniques

Survey of Distributed Decision Tree Induction Techniques Survey of Distributed Decision Tree Induction Techniques Jie Ouyang {jouyang}@oakland.edu SECS, Oakland University Rochester MI, 48326 OU Technical Report Number TR-OU-IIS-080912 September 17, 2008 Survey

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

A Novel Algorithm for Associative Classification

A Novel Algorithm for Associative Classification A Novel Algorithm for Associative Classification Gourab Kundu 1, Sirajum Munir 1, Md. Faizul Bari 1, Md. Monirul Islam 1, and K. Murase 2 1 Department of Computer Science and Engineering Bangladesh University

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

General properties of staircase and convex dual feasible functions

General properties of staircase and convex dual feasible functions General properties of staircase and convex dual feasible functions JÜRGEN RIETZ, CLÁUDIO ALVES, J. M. VALÉRIO de CARVALHO Centro de Investigação Algoritmi da Universidade do Minho, Escola de Engenharia

More information

Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust

Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust G.Mareeswari 1, V.Anusuya 2 ME, Department of CSE, PSR Engineering College, Sivakasi, Tamilnadu,

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set Renu Vashist School of Computer Science and Engineering Shri Mata Vaishno Devi University, Katra,

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering

Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering Journal of Computational Information Systems 10: 12 (2014) 5147 5154 Available at http://www.jofcis.com Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering Ye TIAN 1, Peng YANG

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Privacy Preserving Two-Layer Decision Tree Classifier for Multiparty Databases

Privacy Preserving Two-Layer Decision Tree Classifier for Multiparty Databases Privacy Preserving Two-Layer Decision Tree Classifier for Multiparty Databases Alka Gangrade T.I.T.-M.C.A. Technocrats Institute of Technology Bhopal, India alkagangrade@yahoo.co.in Ravindra Patel Dept.

More information

Service-Oriented Architecture for Privacy-Preserving Data Mashup

Service-Oriented Architecture for Privacy-Preserving Data Mashup Service-Oriented Architecture for Privacy-Preserving Data Mashup Thomas Trojer a Benjamin C. M. Fung b Patrick C. K. Hung c a Quality Engineering, Institute of Computer Science, University of Innsbruck,

More information

Some questions of consensus building using co-association

Some questions of consensus building using co-association Some questions of consensus building using co-association VITALIY TAYANOV Polish-Japanese High School of Computer Technics Aleja Legionow, 4190, Bytom POLAND vtayanov@yahoo.com Abstract: In this paper

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

DEPARTMENT OF COMPUTER SCIENCE

DEPARTMENT OF COMPUTER SCIENCE Department of Computer Science 1 DEPARTMENT OF COMPUTER SCIENCE Office in Computer Science Building, Room 279 (970) 491-5792 cs.colostate.edu (http://www.cs.colostate.edu) Professor L. Darrell Whitley,

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Text Modeling with the Trace Norm

Text Modeling with the Trace Norm Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to

More information

MINING ASSOCIATION RULE FOR HORIZONTALLY PARTITIONED DATABASES USING CK SECURE SUM TECHNIQUE

MINING ASSOCIATION RULE FOR HORIZONTALLY PARTITIONED DATABASES USING CK SECURE SUM TECHNIQUE MINING ASSOCIATION RULE FOR HORIZONTALLY PARTITIONED DATABASES USING CK SECURE SUM TECHNIQUE Jayanti Danasana 1, Raghvendra Kumar 1 and Debadutta Dey 1 1 School of Computer Engineering, KIIT University,

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI

More information

Cyber attack detection using decision tree approach

Cyber attack detection using decision tree approach Cyber attack detection using decision tree approach Amit Shinde Department of Industrial Engineering, Arizona State University,Tempe, AZ, USA {amit.shinde@asu.edu} In this information age, information

More information

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers A. Srivastava E. Han V. Kumar V. Singh Information Technology Lab Dept. of Computer Science Information Technology Lab Hitachi

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information