Weighted-Object Ensemble Clustering
|
|
- Malcolm White
- 6 years ago
- Views:
Transcription
1 213 IEEE 13th International Conference on Data Mining Weighted-Object Ensemble Clustering Yazhou Ren School of Computer Science and Engineering South China University of Technology Guangzhou, 516, China Guoji Zhang School of Sciences South China University of Technology Guangzhou, 5164, China Carlotta Domeniconi Department of Computer Science George Mason University Fairfax, 223, USA Guoxian Yu College of Computer and Information Science Southwest University Chongqing, 4715, China Abstract Ensemble clustering, also known as consensus clustering, aims to generate a stable and robust clustering through the consolidation of multiple base clusterings. In recent years many ensemble clustering methods have been proposed, most of which treat each clustering and each object as equally important. Some approaches make use of weights associated with clusters, or with clusterings, when assembling the different base clusterings. Boosting algorithms developed for classification have also led to the idea of considering weighted objects during the clustering process. However, not much effort has been put towards incorporating weighted objects into the consensus process. To fill this gap, in this paper we propose an approach called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficult it is to cluster an object by constructing the coassociation matrix that summarizes the base clustering results, and we then embed the corresponding information as weights associated to objects. We propose three different consensus techniques to leverage the weighted objects. All three reduce the ensemble clustering problem to a graph partitioning one. We present extensive experimental results which demonstrate that our WOEC approach outperforms state-of-the-art consensus clustering methods and is robust to parameter settings. Keywords Ensemble clustering, consensus clustering, graph partition, weighted objects. I. INTRODUCTION Clustering is the key step for many tasks in data mining. It is well known that off-the-shelf clustering methods may discover different patterns in a given set of data. This is because each algorithm has its own bias due to the optimization of different criteria. Furthermore, there is no ground truth to validate the result. Recently the use of clustering ensembles has emerged as a technique for overcoming problems with clustering algorithms [1], [2]. A clustering ensemble technique is characterized by two components: the mechanism to generate diverse partitions, and the consensus function to combine the input partitions into a final clustering. A clustering ensemble consists of different clusterings, obtained from multiple applications of any single algorithm with different initializations, or on various bootstrap samples of the available data, or from the application of different algorithms to the same data set. Clustering ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature: they can provide more robust and stable solutions by making use of the consensus across multiple clustering results, while averaging out emergent spurious structures that arise due to the various biases to which each participating algorithm is tuned, or to the variance induced by different data samples. Researchers have investigated how to combine clustering ensembles with subspace clusterings in an effort to address both the ill-posed nature of clustering and the curse-ofdimensionality in high dimensional spaces [3], [4]. A subspace clustering is a collection of weighted clusters, where each cluster has a weight vector representing the relevance of features for that cluster. In a subspace clustering ensemble, the consensus function makes use of both the clusters and the weight vectors provided by the base subspace clusterings. Work has also been done to evaluate the relevance of each base clustering (and assign weights accordingly), in an effort to improve the final consensus clustering [5]. To the best of our knowledge, no previous work has investigated how to use weights associated to objects within the clustering ensemble framework. Several approaches have studied the benefits of weighting the objects in iterative clustering methods combined with boosting techniques [6], [7], [8]. Empirical observations suggest that large weights should be assigned to the objects that are hard to be clustered. As a result, centroid-based clustering methods move the centers towards the regions where the objects whose cluster membership is hard to be determined are located. In boosting, the weights bias the data distribution, thus making the region around the difficult points denser. With this interpretation of the weights, shifting the centers of the clusters corresponds to moving them towards the modes of the distribution modified by the weights. Nock and Nielsen [9] formulated clustering as a constrained minimization using a Bregman divergence, and recommended that the hard objects should be given large weights. They analyzed the benefits of using boosting techniques in clustering and introduced several weighted versions of classical clustering algorithms. Inspired by this work, we propose an approach called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficult it is to cluster an object by constructing the co-association matrix that summarizes the base clustering results, and we then embed the corresponding information as /13 $ IEEE DOI 1.119/ICDM
2 weights associated to objects. Different from boosting [1], the weights are not subject to iterative changes. We propose three different consensus techniques to leverage the weighted objects. All three reduce the ensemble clustering problem to a graph partitioning one. We present extensive experimental results which demonstrate that our WOEC approach outperforms state-of-the-art consensus clustering methods and is robust to parameter settings. The main contributions of this paper are as follows: i) To the best of our knowledge, for the first time weights associated to objects are used within clustering ensembles. In WOEC, a one-shot method is proposed to assign weights to objects, and then three weighted-object techniques are presented to compute the consolidated clustering. ii) A Weighted-Object k-means algorithm (WOKmeans) is proposed. Its superiority to k-means demonstrates the effectiveness of the one-shot weights assignment. iii) Extensive experiments are conducted to analyze the performance of WOEC against existing consensus clustering algorithms, on several data sets and using three popular clustering evaluation measures. The rest of this paper is organized as follows. In Section II, we review related work on ensemble clustering. In Section III, we introduce the WOEC methodology. Section IV gives the experimental settings and Section V analyzes the experimental results. Conclusions and future work are provided in Section VI. II. RELATED WORK Ensemble techniques were first developed for supervised settings. Empirical results have shown that they are capable of improving the generalization performance of the base classifiers [11]. This result has inspired researchers to further investigate the development of ensemble techniques for unsupervised problems, namely clustering. Fred and Jain [12], [13] captured the information provided by the base clusterings in a co-association matrix, where each entry is the frequency according to which two objects are clustered together. The co-association matrix was used as a similarity matrix to compute the final clustering. Strehl and Ghosh [2] proposed three graph-based methods, namely CSPA, HGPA, and MCLA. CSPA constructs a co-association matrix, which is again viewed as a pairwise similarity matrix. A graph is then constructed, where each object corresponds to a node, and each entry of the coassociation matrix gives the weight of the edge between two objects. The METIS [14] package is used to partition the objects into k clusters. HGPA combines multiple clusterings by solving a hypergraph partitioning problem. In the hypergraph, each hyperedge represents a cluster, and it connects all the objects that belong to the corresponding cluster. The package HMETIS [15] is utilized to generate the final clustering. MCLA considers each cluster as a node in a meta-graph, and sets the pairwise similarity between two clusters as the ratio between the number of shared objects and the number of total objects in the two clusters. In the meta-graph, each cluster is represented by a hyperedge. MCLA groups and collapses related hyperedges, and it assigns an object to the collapsed hyperedge to which it participates most strongly. MCLA also uses METIS [14] to partition the meta-graph. Fern and Brodley [16] introduced HBGF to integrate base clusterings. HBGF constructs a bipartite graph that models both objects and clusters simultaneously as vertices. In the bipartite graph, an object is connected to several clusters, whereas there is no edge between pairwise objects or pairwise clusters. HBGF applies two graph partitioning algorithms: spectral graph partitioning [17], [18] and METIS to get the final partitions. To handle missing values, Wang, Shan and Banerjee proposed Bayesian cluster ensembles (BCE) [19], which considers all base clustering results as feature vectors, and then learns a Bayesian mixed-membership model from this feature representation. Domeniconi and Al-Razgan [3], [4] combined the clustering ensemble framework with subspace clustering. A subspace clustering is a collection of weighted clusters, where each cluster has a weight vector representing the relevance of features for that cluster. The input to the consensus function is a collection of subspace clusterings. The subspace clustering method used is Locally Adaptive Clustering (LAC) [2], and the resulting clustering techniques are WSPA and WBPA. Li and Ding [21] specified different weights for different clusterings and proposed an approach called Weighted Consensus Clustering (WCC). The optimal weights are sought iteratively through optimization and then used to compute the consensus result within the framework of nonnegative matrix factorization (NMF) [5]. More recently, a method called ensemble clustering by matrix completion (ECMC) was proposed [22]. ECMC uses the reliable pair of objects to construct a partially observed co-association matrix, and exploits the matrix completion algorithm to replenish the missing entries of the co-association matrix. Two objects are reliable if they are often clustered together or seldom clustered together. ECMC then uses spectral clustering on the completed matrix to get the final clustering. However, ECMC has two disadvantages: (i) the notion of reliability between objects is hard to define, and (ii) the matrix completion process may result in information loss. All the aforementioned methods do not assign weights to objects. In contrast, our proposed WOEC algorithms use the information provided by the base clusterings to define objects weights which reflect how difficult it is to cluster them. The weights are then embedded within the consensus function. Specifically, we propose three different WOEC approaches: (i) Weighted-Object Meta Clustering (WOMC), (ii) Weighted-Object Similarity Partition () Clustering, and (iii) Weighted-Object Hybrid Bipartite () Graph Partition Clustering. III. WEIGHTED OBJECT ENSEMBLE CLUSTERING This section introduces the ensemble clustering problem and presents the details of our weighted-object ensemble clustering. A. Ensemble Clustering Problem Formulation An ensemble clustering approach consists of two steps. First, multiple base clusterings are generated; second, the base clusterings are consolidated into the final clustering result. We assume here that the base clusterings have already been generated, and we only discuss hard clustering; however, the methods proposed can be extended to solve soft clustering problems as well. 628
3 Let X = {x 1, x 2,..., x n } denote the data set and X = [x T 1 ; x T 2 ;...; x T n ] R n d be its matricial representation, where n is the number of objects and d is the dimensionality of each object x i =(x i1,x i2,..., x id ) T. Lets consider a set of R clustering solutions C = {C 1,C 2,..., C R }, where each clustering component C r = {C1,C r 2, r..., Ck r r }, r =1, 2,...,R, partitions the data set X into k r disjoint clusters, i.e. Ci r Cr j = ( i = j, i, j =1, 2,...,k r ), and kr k=1 Cr k = X. The ensemble clustering problem consists in defining a consensus function Γ that maps the set of base clusterings C into a single consolidated clustering C. B. One-shot Weight Assignment Boosting [1], [11] is a supervised technique which seeks to create a strong learner based on a set of weak learners. Adaboost (short for Adaptive Boosting) [23] is the most commonly used boosting algorithm. It iteratively generates a distribution over the data, which is then used to train the next weak classifier. In each run, objects that are hard to be classified gain more weight, while easy-to-classify objects lose weight. The newly trained classifier focuses on the points with larger weights. At termination, Adaboost combines the weak classifiers into a strong one. The effectiveness of boosting has been proven both theoretically and experimentally. Boosting algorithms iteratively assign weights to objects using label information, which is likely to be unknown in clustering applications. This makes difficult to apply boosting techniques to clustering, and explains why little research has been performed in the context of weighted-object ensemble clustering methods. This paper aims at reducing this gap. Our goal is to enable difficultto-cluster points to play a bigger role when producing the final consolidated clustering result. Following this objective, we propose a one-shot weight assignment to objects using the results of all base clusterings, and then embed the weights into the successive consensus clustering process. Similar to boosting, points that are hard to cluster receive larger weights, while easy-to-cluster points are given smaller weights. The difference is that boosting is an iterative process, while our weight assignment scheme is performed in one-shot. The details are given below. Let A be the n n co-association matrix built from the clustering solutions C: A ij = V ij (1) R where V ij is the number of times objects x i and x j co-occur in the same cluster, and R = C is the ensemble size. Obviously, A ij [, 1]. WesetA ii =1, i =1, 2,...,n. A ij 1 means that x i and x j are often placed in the same cluster. A ij means that x i and x j are often placed in different clusters. In both cases, the base clusterings show a high level of agreement. When A ij.5, roughly half of the clusterings group x i and x j together, and the other half place them in different clusters. This scenario is the most uncertain one, since the clustering components show no agreement on how to cluster points x i and x j. We can capture this trend by mapping the A ij values through a quadratic function: y(x) =x(1 x),x [, 1]. This function achieves the peak at x =.5, and the minima at x = and x =1, as illustrated in Fig Hence, we can measure 1 Other functions satisfying these properties can be used as well. Fig. 1. y y = x(1 x) x = x Quadratic function: y(x) =x(1 x) the level of uncertainty in clustering two points x i and x j as follows: confusion(x i, x j )=A ij (1 A ij ) (2) The confusion index reaches its maximum of.25 when A ij =.5, and its minimum of when A ij =or A ij =1.Weuse this confusion measure to define the weight associated with each object as follows: w i = 4 n confusion(x i, x j ) (3) n j=1 The normalization factor 4 n is used to guarantee that w i [, 1]. To avoid a value of for a weight, which can lead to instability, we add a smoothing term: w i = w i + e (4) 1+e where e is a small positive number (e =.1 in our experiments). As a result, w i (, 1]. A large w i value means that confusion(x i, x j ) is large for different x j values. As such it measures how hard it is to cluster x i. The assignment of large weights to points that are hard to cluster is consistent with the way boosting techniques operate [1], [11]. C. Algorithms In this section we introduce three different Weighted-Object Ensemble Clustering (WOEC) algorithms. They all make use of the weights associated with objects (computed as explained above) to determine the consensus clustering. The overall process is shown in Fig. 2. As illustrated in the Figure, the consensus functions of and make use of the feature vectors, while WOMC does not. 1) Weighted-Object Meta Clustering Algorithm (WOMC): The first approach we introduce is a weighted version of the Meta Clustering Algorithm (MCLA) introduced in [2]. We call the resulting algorithm Weighted-Object Meta Clustering (WOMC). The key step of meta clustering algorithms is the clustering of clusters. When measuring the similarity between two clusters, MCLA treats all the points present in both clusters equally. In contrast, the proposed WOMC technique distinguishes the contribution of hard-to-cluster and easy-tocluster points. Specifically, a point in the intersection of two clusters contributes to the clusters similarity in a way that is proportional to its weight. The details of the approach follow. We explicit the components of each C r in C. This gives: C = {C1, 1..., Ck 1 1,C1, 2..., Ck 2 2,..., C1 R,..., Ck R R }, where each 629
4 Fig. 2. The process map of Weighted-object Ensemble Clustering (WOEC) component now corresponds to a cluster. The WOMC algorithm groups the clusters in C. To this end, it proceeds as follows. It constructs an undirected meta-graph G =(V,E) with V = n c = R r=1 k r vertices, each representing a cluster in C. The edge E ij connecting vertices V i and V j is assigned the weight value S ij defined as follows: S ij = x k C i C j w k x k C i C j w k (5) Clearly, S ij [, 1]. IfS ij =, there is no edge between C i and C j in G. S ij =1indicates that C i and C j contain the same points. WOMC partitions the meta-graph G into k (number of clusters in the final clustering result) meta-clusters, each representing a group of clusters, by using the similarity measure S ij and by applying the well-known graph partitioning algorithm METIS [14]. Similarly to MCLA, WOMC also assigns each point to the meta-cluster in which it participates most strongly. Specifically, a data point may occur in different meta-clusters, and it is associated to a given meta-cluster according to the ratio of clusters it belongs to. A point is eventually assigned to the meta-cluster with the highest ratio for that point. Ties are broken randomly. There might be a situation in which no object is assigned to a meta-cluster. Thus, the final combined clustering C may contains less than k clusters. The similarity measure defined in Eq. (5) computes the similarity of two clusters, not only based on the fraction of shared points, but also on the values of the weights associated to the points. In constrast, the binary Jaccard measure Ci Cj C i C j computes the similarity as the proportion between the size of the intersection and the size of the union of C i and C j. Suppose C i and C j have fixed points assigned to them. Thus, their Jaccard similarity is also fixed. Lets see how changing the weights assigned to the points affects the value of S ij. Consider D =(C i C j ) (C i C j ), i.e. the set of points not in the intersection. Let the weights assigned to the points in D be fixed. Then, increasing (decreasing) the weights assigned to the points in (C i C j ) will cause an increase (decrease) of S ij. As such, the more hard-to-cluster points C i and C j share, the more similar they are. Lets now fix the weights assigned to the points is (C i C j ). Increasing (decreasing) the weights assigned to the points in D will cause a decrease (increase) of S ij. As a consequence, the more hard-to-cluster points C i and C j do not share, the smaller their similarity is. Algorithm 1 describes the WOMC algorithm. Algorithm 1 Weighted-Object Meta Clustering Algorithm (WOMC) Input: Set of base clusterings C; Graph partitioning algorithm METIS; Number of clusters k. Output: The final consensus clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Construct the meta-graph G =(V,E), where the weights associated to the edges are computed using Eq. (5). 3: MetaClusters = METIS(G, k ). //Partition the metagraph into k meta-clusters. 4: C = Assign(MetaClusters) //Assign each object to the meta-cluster in which it participates most strongly. 5: return C. 2) Weighted-Object Similarity Partitioning Algorithm (): Previous work on clustering [6], [7], [8] suggested to assign large weights to objects that are hard to be clustered, and iteratively move cluster centers towards areas that contain objects with large weights. Inspired by this work, we proceed similarly while leveraging the information provided by the ensemble to estimate the weights. The motivation stems from boosting. The difference is that clustering methods combined with a boosting technique assign weights to objects and move cluster centers iteratively, while we only move centers once for each base clustering by making use of the original data and objects weight information. Clustering with a boosting technique aims to generate a better single clustering result while we seek to find a better consolidated consensus solution. To this end, for each cluster in the base clusterings C r (r =1, 2,..., R), we compute the corresponding weighted center as follows: m r l = x i Cl r x i Cl r w i x i (6) w i where l = 1,..., k r. The basic idea behind this step is to move the center of each cluster towards the region consisting of the objects that are hard to cluster. We call it shifting center technique. The experiments presented in Section V demonstrate the rationale and the improvement achieved by such transformation. The similarity between a point x i and a 63
5 weighted center m r l is calculated using the following exponential function: d(x i, m r l )=exp{ x i m r l 2 } (7) t The probability of cluster Cl r,givenx i, can then be defined as: P (Cl r d(x i, m r l x i )= ) kr l =1 d(x i, m r l ) (8) The smaller x i m r l 2 is, the larger P (Cl r x i) will be. We can now define the vector Pi r of posterior probabilities associated with x i : P r i =(P (C r 1 x i ),P(C r 2 x i ),..., P (C r k r x i )) (9) where k r l=1 P (Cr l x i)=1. Pi r provides a new representation of x i in a space of relative coordinates with respect to cluster centroids, where each dimension corresponds to one cluster. This new representation embeds information from both the original input data and the clustering ensemble. We use the cosine similarity to measure the similarity Sij r between x i and x j with respect to the base clustering C r : S r ij = P r i (P r j )T P r i P r j (1) where T denotes the transpose of a vector. There are R similarity matrices S r (r =1,..., R), each one corresponding to a base clustering. We combine these matrices into one final similarity matrix S: S = 1 R S r (11) R r=1 S represents the average similarity between x i and x j (through vectors P i and P j ), across the R contributing clusterings. Hence we can construct an undirected graph G = (V,E), where V = n and each vertex represents an object in X. The edge E ij connecting vertices V i and V j is assigned the weight value S ij. A graph partitioning algorithm (METIS in our experiments) can be applied to graph G to compute the partitioning of the n vertices that minimizes the edge weightcut. This gives the consensus clustering we seek. The algorithm is illustrated in Algorithm 2. Algorithm 2 Weighted-Object Similarity Partitioning Algorithm () Input: Data set X ; Set of base clusterings C; Graph partitioning algorithm METIS; Number of clusters k. Output: The final consensus clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Construct the corresponding graph G = (V,E), with weights computed through Eqs. (6)-(11). 3: C = METIS(G, k ). //Partition the graph into k parts, each representing a final cluster. 4: return C. 3) Weighted-Object Hybrid Bipartite Graph Partitioning Algorithm (): measures pairwise similarities which are solely instance-based, thus ignoring cluster similarities. We now introduce our third technique,, which attempts to capture both kinds of similarities. This approach reduces the clustering consensus problem to a bipartite graph partitioning problem, which partitions both cluster vertices and instance vertices simultaneously. Thus, it also accounts for similarities between clusters. As, makes use of the shifting center technique and computes probabilities as in Eq. (8). constructs an undirected hybrid bipartite graph G = (V,E), where V contains n c + n vertices (n c = R r=1 k r). The first n c vertices represent all the clusters in C and the last n vertices represent objects. The edge E ij connecting the vertices V i and V j is assigned the weight value S ij defined as follows: S ji =, S ij = S ji = P (C j x i ) if both vertices i and j are clusters or objects (12) where P (C j x i ) is the probability of cluster C j given x i, and can be computed from[ Eq. (8). The ] matrix S of elements S ij B T can be written as S =, where the n n B c matrix B is defined as follows: P1 1 P P R 1 P B = 2 1 P P1 R (13) Pn 1 Pn 2... Pn R where P r i is a row vector defined in Eq. (9). A graph partitioning algorithm (METIS in our experiments) is then used to partition this graph into k hybrid parts (each containing objects and clusters), so that the edge weight-cut is minimized. The partition of the objects provides the final clustering result. The algorithm is given in Algorithm 3. Algorithm 3 Weighted-Object Hybrid Bipartite Graph Partitioning Algorithm () Input: Data set X ; Set of base clusterings C; Graph partitioning algorithm METIS; Number of clusters k. Output: The final consensus clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Construct the hybrid bipartite graph G = (V,E) as described in this subsection. 3: C hybrid = METIS(G, k ). //Partition the graph into k parts. 4: return the partition of objects in C hybrid as C. D. Weighted-Object k-means To verify the effect of moving the center of each cluster according to Eq. (6), we introduce here a modified version of 631
6 Algorithm 4 Weighted-Object k-means (WOKmeans) Input: Data set X ; Set of base clusterings C; Number of clusters k. Output: The final clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Randomly generate k weighted cluster centers c 1,..., c k, each corresponding to a cluster C k = (k =1,..., k ). 3: repeat 4: //assign each object to its closest weighted center i, C k = C k x i, where k = arg min k x i c k. 5: //generate new weighted centers x c k = i C k w i x i,k =1,..., k x (14) i C k w i 6: until All weighted centers do not change or reach the maximum of iteration 7: return C = {C 1,C 2,..., C k }. k-means that makes use of such weighted means. The resulting technique is called Weighted-Object k-means (WOKmeans), and it s illustrated in Algorithm 4. WOKmeans computes the weighted centers after assigning objects to clusters in each iteration, as shown in Eq. (14). The main idea is to move cluster centers towards regions with larger weights iteratively. Compared to the weighted version of k-means in [9], WOKmeans computes all objects weights in advance using the coassociation matrix provided by the base clusterings, while [9] assigns weights to objects iteratively. IV. EMPIRICAL EVALUATION A. Data Sets We conducted experiments on one simulated data and eight UCI 2 data sets to compare the performance of the WOEC methods against existing state-of-the-art techniques. A description of the data sets used is given in Table I. The simulated data is discussed in V-A. The Satimage data set originally contained 6435 objects; we randomly selected 42 images equally distributed among the six classes for our experiments. The Spambase data set had 461 objects; we randomly sampled 5 objects (25 for each class). The other six data sets were used in their original form. For each data set, each feature is preprocessed to have zero mean value and unit variance. TABLE I. DATA SETS USED IN THE EXPERIMENTS Data sets #Objects #Dimensions #Classes Simulated data AustralianCredit BalanceScale BreastTissue Glass HayesRoth Iris Satimage Spambase B. Evaluation Measures There exist various metrics to evaluate clustering results [24], [25]. Since the labels of the data are known, we use the Rand Index (RI) [26], the (ARI) [26], and the Normalized Mutual Information (NMI) [2] as the validity indices. Let C = {C1,C2,..., Ck } be the consensus clustering result (k is the number of clusters in the final consensus clustering), and let C = {C 1, C 2,..., C k } be the ground-truth clustering (k is the number of classes). 1) RI measures the similarity between C and C as follows: where RI(C, C) = a + b a + b = ( a + b + c + d n (15) 2) a is the number of pairs of points that are in the same cluster in both C and C. b is the number of pairs of points that are in different clusters in both C and C. c is the number of pairs of points that are in the same cluster in C and in different clusters in C. d is the number of pairs of points that are in different clusters in C and in the same cluster in C. RI ranges from to 1. 2) ARI defines the similarity between C and C as follows: ( nij ) ARI(C i,j 2 (S S)/ ( ) n 2, C) = 1 2 (S + S) (S S)/ ( ) n (16) 2 ( n i 2 ), S = j ( nj 2 ), and n i and n j denote where S = i the number of objects in cluster Ci and C j (i =1,..., k ; j = 1,..., k), respectively. n ij = Ci C j. The ARI values range from-1to+1. 3) NMI [2] treats the clustering labels of C and C as two random variables. The mutual information of these two random variables is evaluated and then normalized within the [, 1] range. A larger value of RI (ARI or NMI) means that C and C are more similar. C. Base Clusterings To generate diverse clusterings we used k-means. Random sampling of the data and feature subset selection techniques were applied. Specifically, for the first half of the base clusterings, in each run we randomly selected 7% of the objects and performed k-means on this subset of the data. The remaining objects were assigned to the closest cluster based on the Euclidean distances between the object and the cluster centers. For the second half of the base clusterings, in each run we selected 7% of the features and conducted k-means in the corresponding subspace. When using k-means, we randomly generated initial centers in each independent run. V. RESULTS AND ANALYSIS For a fair comparison, we used the METIS package [14] for all the clustering ensemble methods that need to partition a graph. The number of clusters k in the consensus clustering is set equal to the number of classes, for all data and all 632
7 TABLE II. RESULTS ON SIMULATED DATA Data Index min max mean CSPA MCLA HBGF BCE WCC ECMC WOEC WOMC Simulated RI Data ARI NMI TABLE III. COMPARISON BETWEEN k-means AND WOKMEANS ON SIMULATED DATA Data Index k-means WOKmeans Simulated RI.8329± ±.1888 Data ARI.6657± ±.3776 NMI.6329± ±.3611 methods 3. In our experiments, the value of t (see Eq. (7)), for each base clustering, is set to the average of the squared Euclidean distances between the data and the weighted means. All the experimental results are the average of 1 independent runs. We also performed a paired t-test to assess the statistical significance of the results; the significance level was always set to.5. We compare the proposed WOEC methods against eight state-of-the-art ensemble clustering algorithms, namely: CSPA [2], MCLA [2], HBGF [16], BCE [19], WCC [21], ECMC [22], WSPA[3], [4] and WBPA[3], [4]. (HGPA [2] was not included in our experiments since it typically performs worse than CSPA and MCLA.) A. Simulated Data To illustrate the effectiveness of the proposed three WOEC algorithms and of WOKmeans, we first generated a simulated data. The simulated data consists of d = 2 input features and 2 classes, each consisting of 15 points. Both classes are normally distributed according to multivariate Gaussian distributions, as shown in Fig. 3. The mean vector and the covariance matrix of class 1 are (5, 1) and [1 ; 1], respectively. For class 2, the mean vector is (2, 1) and the covariance matrix is again [1 ; 1]. We set the ensemble size R = 4. RI, ARI and NMI are applied to assess the performance of all algorithms. For each of the 4 base clusterings, the data sampling technique was used. That is, we randomly selected 7% of the points and performed k-means on this subset of the data. The remaining points were assigned to the closest cluster based on the Euclidean distance between the points and the cluster centers. Each feature value was also restricted to have zero mean and unit variance. For each independent run, we record the minimum (min), maximum (max), and mean value (of RI, ARI and NMI) for all the base clusterings. The corresponding averages of 1 runs are also reported. The results are shown in Table II. Significantly better values are given in boldface. and achieve the best performance for all the evaluation indices (RI, ARI, and NMI). This is because the hard-to-cluster points (i.e. the points between the clusters boundaries) have the largest contribution to the shifting center procedure. This causes a shift of the clusters centers towards the region with hard-to-cluster points. As such, the influence of 3 In real applications, classes may be multimodal, and thus k should be larger than the number of classes. In other cases, there may be less clusters than classes. These scenarios are not considered in this paper and we simply set k equal to the number of classes. Fig. 3. Feature Simulated data Pseudo outliers Class 1 (5,1) (2,1) Mean vectors Class Feature 1 Simulated Data the pseudo-outliers 4 is reduced, thereby leading to more robust ensemble clustering results. BCE performs the worst. WOMC achieves comparable performance to the other methods, but loses to and, showing that the technique of shifting centers can work better on this data. To further demonstrate the effectiveness of shifting centers, we conducted experiments on the simulated data to compare the performance of k-means and WOKmeans. The experimental setup is the same as before. Table III shows the results. Clearly, WOKmeans achieves a considerable improvement with respect to k-means and its standard deviations are also smaller. k-means is negatively affected by the pseudo-outliers. Although WOKmeans improves upon k-means, its performance is considerably worse than that of ensemble clustering methods, demonstrating the effectiveness of ensemble techniques. Points that receive the largest weights are highlighted in squares in Fig. 3, showing that between-cluster points tend to gain large weights. From Table II and III, we can see that generally a better RI value corresponds to a better ARI and NMI values. Therefore, we only report the ARI values for the rest of the experiments due to the limitation in space. B. Evaluation of the WOEC Techniques We conducted experiments to evaluate the effectiveness of the proposed WOEC methods on eight UCI data sets. Base clusterings were generated as explained in Section IV-C. We experimented with three different ensemble sizes R: 2, 4, and 6. The results in terms of ARI are shown in Table IV. Each ARI value reported is the average of 1 independent runs. In each row, the values in boldface are the ARI values that 4 As shown in Fig. 3, the left circled points are far away from the mean point of class 1 and the density around them is much lower than that around the other points in class 1. These points can bias the computation of the mean vector. They are generated from the same distribution as the other red points in class 1, and should be grouped in the same cluster, but behave like outliers for the purpose of this discussion. For this reason we call them pseudo-outliers. Similarly, the points in the right circle are the pseudo-outliers of class
8 TABLE IV. COMPARISON AGAINST EXISTING METHODS (ARI VALUES) Data R min max mean CSPA MCLA HBGF BCE WCC ECMC WOEC WOMC Australian (.4189) Balance (.1336) Breast (.2952) Glass (.168) HayesRoth (.61) Iris (.5784) Satimage (.699) Spambase (.1412) are statistically significantly better than the remaining ones. We also run k-means 1 times for baseline comparison; its average ARI is given in Table IV in parenthesis, following the name of each data set. For each independent run, we record the minimum, maximum and mean ARI for all the base clusterings. The corresponding averages are also reported in the Table. We can see that the mean ARI of the base clusterings is very close to the baseline for each data set. From Table IV we can see that at least one of the WOEC methods can always achieve the best performance on all data sets, except for R =2and R =4on BreastTissue. To be more specific, WOMC gives the best performance on AustralianCredit, BalanceScale, and Satimage; achieves the best performance on Glass and HayesRoth; while significantly outperforms all other methods on Spambase. On Iris data, and achieve comparable performance and both outperform all other methods. As expected, the ensemble clustering methods achieve superior performance with respect to the baseline for most of the problems. In particular, on the Spambase data set, the ensemble techniques provide a considerable improvement. The poor performance of the baseline may be due to the large dimensionality of the data in this case. There are exceptions to this general trend. For instance, most of the ensemble clustering methods are inferior to the baseline on BalanceScale, Glass, and HayesRoth. Nevertheless, at least one of the WOEC methods performs well also for these data sets, i.e. WOMC on BalanceScale, and on Glass and HayesRoth. An interesting result is that the performance of and is close to the max index on HayesRoth and Spambase respectively, and WOMC can achieve better ARI values than the max index on Satimage. As pointed out earlier, our WOMC technique is a variation of MCLA. We can see that these two methods have comparable performance on AustralianCredit, Glass, HayesRoth, and Spambase, while WOMC outperforms MCLA on the other 4 data sets. This shows the benefits of considering objects weights when measuring the similarity of two meta clusters in a meta clustering algorithm. Compared against the most recent ECMC technique, the WOEC methods can achieve much better results in general. The main reason is that ECMC assigns a value of one to the entries of the co-association matrix that are larger than a threshold d 1, and a value of zero to the entries that are smaller than a threshold value d. This leads to information loss; in particular, the loss of information regarding the hardto-cluster objects. In addition, the setting of thresholds like d and d 1 is always problematic. In our experiments we set d =.2 and d 1 =.8 as suggested in [22]. ECMC needs also to choose a suitable value for a parameter C that minimizes 1 T M1, where M is a pairwise similarity matrix (details are in [22]), which is time consuming. In contrast, WOMC does not need to perform any parameter selection, while and are robust to the setting of its parameter t, asthe analysis in Section V-E shows. C. Comparison against WSPA and WBPA To perform comparisons against WSPA and WBPA, we generated the base clusterings using the LAC algorithm, as described in [3], [4]. LAC assigns weights to features, locally at each cluster. Both WSPA and WBPA require such weights, while our WOEC techniques have a wider applicability. The weight vectors computed by the LAC algorithm are discarded when combined with the WOEC approaches. Results are shown in Table V. As before, statistically significant results are in boldface. We can see that the WOEC methods can always achieve the best performance, while WSPA and WBPA give comparable to best results only on BreastTissue and Glass data sets. D. Evaluation of WOKmeans In Section V-A, we have discussed the superiority of WOKmeans with respect to k-means on the simulated data. To further investigate the behavior of WOKmeans, in this section we add more comparisons on three data sets: BreastTissue, Glass, and Iris. The reported results are the average of 1 individual runs. In each run, the maximum number of iterations of both methods is set to 1. The number of clusters is set to the number of classes given by the ground truth labels. The ensemble sizes are again 2, 4 and 6. Note that only WOKmeans is affected by the ensemble size. Table VI gives 634
9 AustralianCredit.2 HayesRoth Spambase t t t (a) AustralianCredit (b) HayesRoth (c) Spambase Fig. 4. Sensitivity analysis of parameter t.5 AustralianCredit WOMC HayesRoth Spambase WOMC Ensemble size Ensemble size.2.1 WOMC Ensemble size (a) AustralianCredit (b) HayesRoth (c) Spambase Fig. 5. Sensitivity analysis of ensemble size R TABLE V. COMPARISON AGAINST WSPA AND WBPA (ARI) Data R WSPA WBPA WOEC WOMC Australian Balance Breast Glass HayesRoth Iris Satimage Spambase the results. We observe that WOKmeans can significantly outperform k-means on the three data sets in all cases. For Iris data, WOKmeans not only achieves a better performance, but it s also more stable with much smaller standard deviations. TABLE VI. EXPERIMENTAL RESULTS OF k-means AND WOKMEANS (ARI) Data R k-means WOKmeans ±.375 BreastTissue ± ± ± ±.35 Glass 4.168± ± ± ±.294 Iris ± ± ±.329 These results empirically demonstrate again the effect of Eq. (6) on the resulting clusters. As suggested by previous analysis [6], [7], [8], moving the centers towards the hard-to-cluster data can improve the clustering of the same. E. Sensitivity Analysis of Parameter t Here we test how sensitive the and the methods are with respect to the values of the parameter t in Eq. (7). We set the ensemble size R =4and change the parameter t from.5 to 1. We use the AustralianCredit, HayesRoth, and Spambase data sets to observe the performance of the methods for the different values of t. Fig.4 shows the results. We can see that both and tend to be stable when t>5 on HayesRoth and Spambase, while the performance is almost the same for different values of t on AustralianCredit. In fact, the probability of a cluster 635
10 Ck r given an object x i, computed as in Eq. (8), is essentially determined by the Euclidean distances between x i and all the cluster centers in clustering C r. Different values of t will not change the sequence of probability values. This explains why both and are robust to different values of the parameter t. Nevertheless, we should avoid setting t to values that are too small or too large. In these two cases, in fact, k, d(x i, m r k ) and d(x i, m r k ) 1, respectively, in Eq. (7), and this will influence the consensus clustering process of and. Therefore, the average of the squared Euclidean distances between the data and the weighted means is a reasonable default value of t for each base clustering. F. Sensitivity Analysis of Ensemble Size R In this section we investigate the sensitivity of the three WOEC methods with respect to the ensemble size R. We change R from 5 to 1, and we use the same three data sets: AustralianCredit, HayesRoth, and Spambase. Fig. 5 gives the results. In general, with an increase of the ensemble size, the performance of each WOEC method improves. WOEC tends to be stable when R is larger than 1, 6, and 3 on AustralianCredit, HayesRoth, and Spambase, respectively. On AustralianCredit, all three WOEC methods can achieve a good performance when R =5. Also and achieve a stable performance on Spambase in correspondence of a small ensemble size. Each method gives the highest ARI for a different data set. VI. CONCLUSIONS AND FUTURE WORK We have introduced an approach called Weighted-Object Ensemble Clustering (WOEC). Three different consensus techniques have been proposed to leverage weighted objects. Experimental results indicate that each of the three techniques leads in performance in different data sets. This suggests that the three methods are most effective under different conditions. The achievement of a better understanding of such conditions is an interesting question we ll investigate. Like most existing ensemble methods, the WOEC approaches are quadratic in the number of points and linear in the dimensionality. How to apply WOEC methods effectively and efficiently with large scale and high dimensional data is also a challenging future work. VII. ACKNOWLEDGMENTS This paper is partially supported by grants from the Natural Science Foundation of China (Project Nos , , ), the Fundamental Research Funds for the Central Universities of China (Project Nos. XDJK21B2, XDJK213C123), the Doctoral Fund of Southwest University (Nos. SWU1163 and SWU11334) and the China Scholarship Council (CSC). REFERENCES [1] J. Ghosh and A. Acharya, Cluster ensembles, WIREs Data Mining and Knowledge Discovery, vol. 1, no. 4, pp , 211. [2] A. Strehl and J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research, vol. 3, pp , 22. [3] M. Al-Razgan and C. Domeniconi, Weighted clustering ensemble, in Proceedings of The 6th SIAM International Conference on Data Mining, April 26, pp [4] C. Domeniconi and M. Al-Razgan, Weighted cluster ensembles: Methods and analysis, ACM Transactions on Knowledge Discovery from Data, vol. 2, no. 4, pp. 17:1 4, January 29. [5] T. Li, C. Ding, and M. I. Jordan, Solving consensus and semisupervised clustering problems using nonnegative matrix factorization, in Proceedings of the 7th IEEE International Conference on Data Mining, 27, pp [6] G. Hamerly and C. Elkan, Alternatives to the k-means algorithm that find better clusterings, in Proceedings of the 11th International Conference on Information and Knowledge Management, 22, pp [7] A. Topchy, B. Minaei-Bidgoli, A. K. Jain, and W. F. Punch, Adaptive clustering ensembles, in Proceedings of the 17th International Conference on Pattern Recognition, August 24, pp [8] B. Zhang, M. Hsu, and U. Dayal, K-harmonic means a spatial clustering algorithm with boosting, Temporal, Spatial, and Spatio-Temporal Data Mining, pp , 21. [9] R. Nock and F. Nielsen, On weighting clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp , August 26. [1] R. E. Schapire, The strength of weak learnability, Machine Learning, vol. 5, no. 2, pp , 199. [11] Z. Zhou, Ensemble Methods: Foundations and Algorithms. Chapman & Hall, 212. [12] A. Fred and A. Jain, Data clustering using evidence accumulation, in Proceedings of the 16th International Conference of Pattern Recognition, 22, pp [13] A. Fred and A. K. Jain, Evidence accumulation clustering based on the k-means algorithm, in Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, 22, pp [14] G. Karypis and V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM Journal on Scientific Computing, vol. 2, no. 1, pp , August [15] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, Multilevel hypergraph partitioning: Application in vlsi domain, in Proceedings of the Design and Automation Conference, 1997, pp [16] X. Z. Fern and C. E. Brodley, Solving cluster ensemble problems by bipartite graph partitioning, in Proceedings of the 21th International Conference on Machine Learning, 24, pp [17] A. Y. Ng, M. I. Jordan, and Y. Weiss, On spectral clustering: Analysis and an algorithm, in Advances in Neural Information Processing Systems, 21, pp [18] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp , August 2. [19] H. Wang, H. Shan, and A. Banerjee, Bayesian cluster ensembles, in Proceedings of the 9th SIAM International Conference on Data Mining, 29, pp [2] C. Domeniconi, D. Papadopoulos, D. Gunopulos, and S. Ma, Subspace clustering of high dimensional data, in Proceedings of the SIAM International Conference on Data Mining, April 24, pp [21] T. Li and C. Ding, Weighted consensus clustering, in Proceedings of the 8th SIAM International Conference on Data Mining, 28, pp [22] J. Yi, T. Yang, R. Jin, A. K. Jain, and M. Mahdavi, Robust ensemble clustering by matrix completion, in Proceedings of the 12th IEEE International Conference on Data Mining, December 212, pp [23] Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, vol. 55, pp , [24] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Prez, and I. Perona, An extensive comparative study of cluster validity indices, Pattern Recognition, vol. 46, no. 1, pp , January 213. [25] C. Legny, S. Juhsz, and A. Babos, Cluster validity measurement techniques, in Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, 26, pp [26] L. Hubert and P. Arabie, Comparing partitions, Journal of Classification, vol. 2, no. 1, pp ,
Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI
Consensus Clustering Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering The ensemble of classifiers is a well established strategy in supervised learning Unsupervised learning aims the same goal:
More informationA Comparison of Resampling Methods for Clustering Ensembles
A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei-Bidgoli Computer Science Department Michigan State University East Lansing, MI, 48824, USA Alexander Topchy Computer Science Department
More informationWeighted Clustering Ensembles
Weighted Clustering Ensembles Muna Al-Razgan Carlotta Domeniconi Abstract Cluster ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature. Cluster ensembles can
More informationCluster Ensemble Algorithm using the Binary k-means and Spectral Clustering
Journal of Computational Information Systems 10: 12 (2014) 5147 5154 Available at http://www.jofcis.com Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering Ye TIAN 1, Peng YANG
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationAdvances in Fuzzy Clustering and Its Applications. J. Valente de Oliveira and W. Pedrycz (Editors)
Advances in Fuzzy Clustering and Its Applications J. Valente de Oliveira and W. Pedrycz (Editors) Contents Preface 3 1 Soft Cluster Ensembles 1 1.1 Introduction................................... 1 1.1.1
More informationCluster Ensembles for High Dimensional Clustering: An Empirical Study
Cluster Ensembles for High Dimensional Clustering: An Empirical Study Xiaoli Z. Fern xz@ecn.purdue.edu School of Electrical and Computer Engineering, Purdue University, W. Lafayette, IN 47907, USA Carla
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More informationClustering ensemble method
https://doi.org/10.1007/s13042-017-0756-7 ORIGINAL ARTICLE Clustering ensemble method Tahani Alqurashi 1 Wenjia Wang 1 Received: 28 September 2015 / Accepted: 20 October 2017 The Author(s) 2018 Abstract
More informationConsensus clustering by graph based approach
Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr
More informationA Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis
A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationCOMBINING MULTIPLE PARTITIONS CREATED WITH A GRAPH-BASED CONSTRUCTION FOR DATA CLUSTERING
Author manuscript, published in "IEEE International Workshop on Machine Learning for Signal Processing, Grenoble : France (29)" COMBINING MULTIPLE PARTITIONS CREATED WITH A GRAPH-BASED CONSTRUCTION FOR
More informationConsensus Clusterings
Consensus Clusterings Nam Nguyen, Rich Caruana Department of Computer Science, Cornell University Ithaca, New York 14853 {nhnguyen,caruana}@cs.cornell.edu Abstract In this paper we address the problem
More informationCHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION
CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant
More informationAn Empirical Comparison of Spectral Learning Methods for Classification
An Empirical Comparison of Spectral Learning Methods for Classification Adam Drake and Dan Ventura Computer Science Department Brigham Young University, Provo, UT 84602 USA Email: adam drake1@yahoo.com,
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationEnhancing Single-Objective Projective Clustering Ensembles
Enhancing Single-Objective Projective Clustering Ensembles Francesco Gullo DEIS Dept. University of Calabria 87036 Rende CS), Italy fgullo@deis.unical.it Carlotta Domeniconi Department of Computer Science
More information4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.
1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationStudy and Implementation of CHAMELEON algorithm for Gene Clustering
[1] Study and Implementation of CHAMELEON algorithm for Gene Clustering 1. Motivation Saurav Sahay The vast amount of gathered genomic data from Microarray and other experiments makes it extremely difficult
More informationA Graph Based Approach for Clustering Ensemble of Fuzzy Partitions
Journal of mathematics and computer Science 6 (2013) 154-165 A Graph Based Approach for Clustering Ensemble of Fuzzy Partitions Mohammad Ahmadzadeh Mazandaran University of Science and Technology m.ahmadzadeh@ustmb.ac.ir
More informationA Unified Framework to Integrate Supervision and Metric Learning into Clustering
A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December
More informationClustering: Classic Methods and Modern Views
Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering
More informationCluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning
Cluster Validation Ke Chen Reading: [5.., KPM], [Wang et al., 9], [Yang & Chen, ] COMP4 Machine Learning Outline Motivation and Background Internal index Motivation and general ideas Variance-based internal
More informationIndividualized Error Estimation for Classification and Regression Models
Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models
More informationThe Curse of Dimensionality
The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more
More informationA Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data
Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data
More informationClustering Ensembles Based on Normalized Edges
Clustering Ensembles Based on Normalized Edges Yan Li 1,JianYu 2, Pengwei Hao 1,3, and Zhulin Li 1 1 Center for Information Science, Peking University, Beijing, 100871, China {yanli, lizhulin}@cis.pku.edu.cn
More informationDynamic Clustering of Data with Modified K-Means Algorithm
2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq
More informationDATA clustering is an unsupervised learning technique
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS 1 Enhanced Ensemble Clustering via Fast Propagation of Cluster-wise Similarities Dong Huang, Member, IEEE, Chang-Dong Wang, Member, IEEE, Hongxing
More informationAN ABSTRACT OF THE THESIS OF
AN ABSTRACT OF THE THESIS OF Sicheng Xiong for the degree of Master of Science in Computer Science presented on April 25, 2013. Title: Active Learning of Constraints for Semi-Supervised Clustering Abstract
More informationAn Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data
An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University
More informationRough Set based Cluster Ensemble Selection
Rough Set based Cluster Ensemble Selection Xueen Wang, Deqiang Han, Chongzhao Han Ministry of Education Key Lab for Intelligent Networks and Network Security (MOE KLINNS Lab), Institute of Integrated Automation,
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationMeta-Clustering. Parasaran Raman PhD Candidate School of Computing
Meta-Clustering Parasaran Raman PhD Candidate School of Computing What is Clustering? Goal: Group similar items together Unsupervised No labeling effort Popular choice for large-scale exploratory data
More informationRank Measures for Ordering
Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many
More informationRecent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will
Lectures Recent advances in Metamodel of Optimal Prognosis Thomas Most & Johannes Will presented at the Weimar Optimization and Stochastic Days 2010 Source: www.dynardo.de/en/library Recent advances in
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationCS6716 Pattern Recognition
CS6716 Pattern Recognition Prototype Methods Aaron Bobick School of Interactive Computing Administrivia Problem 2b was extended to March 25. Done? PS3 will be out this real soon (tonight) due April 10.
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationVisual Representations for Machine Learning
Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationMultiobjective Data Clustering
To appear in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Multiobjective Data Clustering Martin H. C. Law Alexander P. Topchy Anil K. Jain Department of Computer Science
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationOutlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
More informationRefined Shared Nearest Neighbors Graph for Combining Multiple Data Clusterings
Refined Shared Nearest Neighbors Graph for Combining Multiple Data Clusterings Hanan Ayad and Mohamed Kamel Pattern Analysis and Machine Intelligence Lab, Systems Design Engineering, University of Waterloo,
More informationCorrespondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets
Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University of Houston, Houston, TX 77204-3010
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationAn Efficient Learning of Constraints For Semi-Supervised Clustering using Neighbour Clustering Algorithm
An Efficient Learning of Constraints For Semi-Supervised Clustering using Neighbour Clustering Algorithm T.Saranya Research Scholar Snr sons college Coimbatore, Tamilnadu saran2585@gmail.com Dr. K.Maheswari
More informationOutlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data
Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University
More informationFlexible-Hybrid Sequential Floating Search in Statistical Feature Selection
Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and
More informationScalable Clustering of Signed Networks Using Balance Normalized Cut
Scalable Clustering of Signed Networks Using Balance Normalized Cut Kai-Yang Chiang,, Inderjit S. Dhillon The 21st ACM International Conference on Information and Knowledge Management (CIKM 2012) Oct.
More informationUsing the Kolmogorov-Smirnov Test for Image Segmentation
Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer
More informationAccumulation. Instituto Superior Técnico / Instituto de Telecomunicações. Av. Rovisco Pais, Lisboa, Portugal.
Combining Multiple Clusterings Using Evidence Accumulation Ana L.N. Fred and Anil K. Jain + Instituto Superior Técnico / Instituto de Telecomunicações Av. Rovisco Pais, 149-1 Lisboa, Portugal email: afred@lx.it.pt
More informationMixture Models and the EM Algorithm
Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is
More informationInstance-Wise Weighted Nonnegative Matrix Factorization for Aggregating Partitions with Locally Reliable Clusters
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 15) Instance-Wise Weighted Nonnegative Matrix Factorization for Aggregating s with Locally Reliable Clusters
More informationImproving Image Segmentation Quality Via Graph Theory
International Symposium on Computers & Informatics (ISCI 05) Improving Image Segmentation Quality Via Graph Theory Xiangxiang Li, Songhao Zhu School of Automatic, Nanjing University of Post and Telecommunications,
More informationThe Projected Dip-means Clustering Algorithm
Theofilos Chamalis Department of Computer Science & Engineering University of Ioannina GR 45110, Ioannina, Greece thchama@cs.uoi.gr ABSTRACT One of the major research issues in data clustering concerns
More informationAdaptive Metric Nearest Neighbor Classification
Adaptive Metric Nearest Neighbor Classification Carlotta Domeniconi Jing Peng Dimitrios Gunopulos Computer Science Department Computer Science Department Computer Science Department University of California
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationImproving the Efficiency of Fast Using Semantic Similarity Algorithm
International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationInternational Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani
LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationAn Improvement of Centroid-Based Classification Algorithm for Text Classification
An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,
More informationA Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2
A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation Kwanyong Lee 1 and Hyeyoung Park 2 1. Department of Computer Science, Korea National Open
More informationIdentifying Layout Classes for Mathematical Symbols Using Layout Context
Rochester Institute of Technology RIT Scholar Works Articles 2009 Identifying Layout Classes for Mathematical Symbols Using Layout Context Ling Ouyang Rochester Institute of Technology Richard Zanibbi
More informationClustering of Data with Mixed Attributes based on Unified Similarity Metric
Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1
More informationJournal of Statistical Software
JSS Journal of Statistical Software August 2010, Volume 36, Issue 9. http://www.jstatsoft.org/ LinkCluE: A MATLAB Package for Link-Based Cluster Ensembles Natthakan Iam-on Aberystwyth University Simon
More informationA Memetic Heuristic for the Co-clustering Problem
A Memetic Heuristic for the Co-clustering Problem Mohammad Khoshneshin 1, Mahtab Ghazizadeh 2, W. Nick Street 1, and Jeffrey W. Ohlmann 1 1 The University of Iowa, Iowa City IA 52242, USA {mohammad-khoshneshin,nick-street,jeffrey-ohlmann}@uiowa.edu
More informationAutomatic Group-Outlier Detection
Automatic Group-Outlier Detection Amine Chaibi and Mustapha Lebbah and Hanane Azzag LIPN-UMR 7030 Université Paris 13 - CNRS 99, av. J-B Clément - F-93430 Villetaneuse {firstname.secondname}@lipn.univ-paris13.fr
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationSupplementary Figure 1. Decoding results broken down for different ROIs
Supplementary Figure 1 Decoding results broken down for different ROIs Decoding results for areas V1, V2, V3, and V1 V3 combined. (a) Decoded and presented orientations are strongly correlated in areas
More informationExplore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan
Explore Co-clustering on Job Applications Qingyun Wan SUNet ID:qywan 1 Introduction In the job marketplace, the supply side represents the job postings posted by job posters and the demand side presents
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 11, November 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationImage Segmentation for Image Object Extraction
Image Segmentation for Image Object Extraction Rohit Kamble, Keshav Kaul # Computer Department, Vishwakarma Institute of Information Technology, Pune kamble.rohit@hotmail.com, kaul.keshav@gmail.com ABSTRACT
More informationLogical Templates for Feature Extraction in Fingerprint Images
Logical Templates for Feature Extraction in Fingerprint Images Bir Bhanu, Michael Boshra and Xuejun Tan Center for Research in Intelligent Systems University of Califomia, Riverside, CA 9252 1, USA Email:
More informationModel-based Clustering With Probabilistic Constraints
To appear in SIAM data mining Model-based Clustering With Probabilistic Constraints Martin H. C. Law Alexander Topchy Anil K. Jain Abstract The problem of clustering with constraints is receiving increasing
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationOn Error Correlation and Accuracy of Nearest Neighbor Ensemble Classifiers
On Error Correlation and Accuracy of Nearest Neighbor Ensemble Classifiers Carlotta Domeniconi and Bojun Yan Information and Software Engineering Department George Mason University carlotta@ise.gmu.edu
More informationAn Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification
An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification Flora Yu-Hui Yeh and Marcus Gallagher School of Information Technology and Electrical Engineering University
More informationLearning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li
Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationIncluding the Size of Regions in Image Segmentation by Region Based Graph
International Journal of Emerging Engineering Research and Technology Volume 3, Issue 4, April 2015, PP 81-85 ISSN 2349-4395 (Print) & ISSN 2349-4409 (Online) Including the Size of Regions in Image Segmentation
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationFast Fuzzy Clustering of Infrared Images. 2. brfcm
Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.
More informationLecture 10: Semantic Segmentation and Clustering
Lecture 10: Semantic Segmentation and Clustering Vineet Kosaraju, Davy Ragland, Adrien Truong, Effie Nehoran, Maneekwan Toyungyernsub Department of Computer Science Stanford University Stanford, CA 94305
More informationCost-sensitive Boosting for Concept Drift
Cost-sensitive Boosting for Concept Drift Ashok Venkatesan, Narayanan C. Krishnan, Sethuraman Panchanathan Center for Cognitive Ubiquitous Computing, School of Computing, Informatics and Decision Systems
More informationCHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM
96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays
More informationMotivation. Technical Background
Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering
More informationCOMP 465: Data Mining Still More on Clustering
3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following
More informationColour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation
ÖGAI Journal 24/1 11 Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation Michael Bleyer, Margrit Gelautz, Christoph Rhemann Vienna University of Technology
More informationGraph-based High Level Motion Segmentation using Normalized Cuts
Graph-based High Level Motion Segmentation using Normalized Cuts Sungju Yun, Anjin Park and Keechul Jung Abstract Motion capture devices have been utilized in producing several contents, such as movies
More informationAN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION
AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationSupervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationLink Prediction for Social Network
Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More information