Weighted-Object Ensemble Clustering

Size: px
Start display at page:

Download "Weighted-Object Ensemble Clustering"

Transcription

1 213 IEEE 13th International Conference on Data Mining Weighted-Object Ensemble Clustering Yazhou Ren School of Computer Science and Engineering South China University of Technology Guangzhou, 516, China Guoji Zhang School of Sciences South China University of Technology Guangzhou, 5164, China Carlotta Domeniconi Department of Computer Science George Mason University Fairfax, 223, USA Guoxian Yu College of Computer and Information Science Southwest University Chongqing, 4715, China Abstract Ensemble clustering, also known as consensus clustering, aims to generate a stable and robust clustering through the consolidation of multiple base clusterings. In recent years many ensemble clustering methods have been proposed, most of which treat each clustering and each object as equally important. Some approaches make use of weights associated with clusters, or with clusterings, when assembling the different base clusterings. Boosting algorithms developed for classification have also led to the idea of considering weighted objects during the clustering process. However, not much effort has been put towards incorporating weighted objects into the consensus process. To fill this gap, in this paper we propose an approach called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficult it is to cluster an object by constructing the coassociation matrix that summarizes the base clustering results, and we then embed the corresponding information as weights associated to objects. We propose three different consensus techniques to leverage the weighted objects. All three reduce the ensemble clustering problem to a graph partitioning one. We present extensive experimental results which demonstrate that our WOEC approach outperforms state-of-the-art consensus clustering methods and is robust to parameter settings. Keywords Ensemble clustering, consensus clustering, graph partition, weighted objects. I. INTRODUCTION Clustering is the key step for many tasks in data mining. It is well known that off-the-shelf clustering methods may discover different patterns in a given set of data. This is because each algorithm has its own bias due to the optimization of different criteria. Furthermore, there is no ground truth to validate the result. Recently the use of clustering ensembles has emerged as a technique for overcoming problems with clustering algorithms [1], [2]. A clustering ensemble technique is characterized by two components: the mechanism to generate diverse partitions, and the consensus function to combine the input partitions into a final clustering. A clustering ensemble consists of different clusterings, obtained from multiple applications of any single algorithm with different initializations, or on various bootstrap samples of the available data, or from the application of different algorithms to the same data set. Clustering ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature: they can provide more robust and stable solutions by making use of the consensus across multiple clustering results, while averaging out emergent spurious structures that arise due to the various biases to which each participating algorithm is tuned, or to the variance induced by different data samples. Researchers have investigated how to combine clustering ensembles with subspace clusterings in an effort to address both the ill-posed nature of clustering and the curse-ofdimensionality in high dimensional spaces [3], [4]. A subspace clustering is a collection of weighted clusters, where each cluster has a weight vector representing the relevance of features for that cluster. In a subspace clustering ensemble, the consensus function makes use of both the clusters and the weight vectors provided by the base subspace clusterings. Work has also been done to evaluate the relevance of each base clustering (and assign weights accordingly), in an effort to improve the final consensus clustering [5]. To the best of our knowledge, no previous work has investigated how to use weights associated to objects within the clustering ensemble framework. Several approaches have studied the benefits of weighting the objects in iterative clustering methods combined with boosting techniques [6], [7], [8]. Empirical observations suggest that large weights should be assigned to the objects that are hard to be clustered. As a result, centroid-based clustering methods move the centers towards the regions where the objects whose cluster membership is hard to be determined are located. In boosting, the weights bias the data distribution, thus making the region around the difficult points denser. With this interpretation of the weights, shifting the centers of the clusters corresponds to moving them towards the modes of the distribution modified by the weights. Nock and Nielsen [9] formulated clustering as a constrained minimization using a Bregman divergence, and recommended that the hard objects should be given large weights. They analyzed the benefits of using boosting techniques in clustering and introduced several weighted versions of classical clustering algorithms. Inspired by this work, we propose an approach called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficult it is to cluster an object by constructing the co-association matrix that summarizes the base clustering results, and we then embed the corresponding information as /13 $ IEEE DOI 1.119/ICDM

2 weights associated to objects. Different from boosting [1], the weights are not subject to iterative changes. We propose three different consensus techniques to leverage the weighted objects. All three reduce the ensemble clustering problem to a graph partitioning one. We present extensive experimental results which demonstrate that our WOEC approach outperforms state-of-the-art consensus clustering methods and is robust to parameter settings. The main contributions of this paper are as follows: i) To the best of our knowledge, for the first time weights associated to objects are used within clustering ensembles. In WOEC, a one-shot method is proposed to assign weights to objects, and then three weighted-object techniques are presented to compute the consolidated clustering. ii) A Weighted-Object k-means algorithm (WOKmeans) is proposed. Its superiority to k-means demonstrates the effectiveness of the one-shot weights assignment. iii) Extensive experiments are conducted to analyze the performance of WOEC against existing consensus clustering algorithms, on several data sets and using three popular clustering evaluation measures. The rest of this paper is organized as follows. In Section II, we review related work on ensemble clustering. In Section III, we introduce the WOEC methodology. Section IV gives the experimental settings and Section V analyzes the experimental results. Conclusions and future work are provided in Section VI. II. RELATED WORK Ensemble techniques were first developed for supervised settings. Empirical results have shown that they are capable of improving the generalization performance of the base classifiers [11]. This result has inspired researchers to further investigate the development of ensemble techniques for unsupervised problems, namely clustering. Fred and Jain [12], [13] captured the information provided by the base clusterings in a co-association matrix, where each entry is the frequency according to which two objects are clustered together. The co-association matrix was used as a similarity matrix to compute the final clustering. Strehl and Ghosh [2] proposed three graph-based methods, namely CSPA, HGPA, and MCLA. CSPA constructs a co-association matrix, which is again viewed as a pairwise similarity matrix. A graph is then constructed, where each object corresponds to a node, and each entry of the coassociation matrix gives the weight of the edge between two objects. The METIS [14] package is used to partition the objects into k clusters. HGPA combines multiple clusterings by solving a hypergraph partitioning problem. In the hypergraph, each hyperedge represents a cluster, and it connects all the objects that belong to the corresponding cluster. The package HMETIS [15] is utilized to generate the final clustering. MCLA considers each cluster as a node in a meta-graph, and sets the pairwise similarity between two clusters as the ratio between the number of shared objects and the number of total objects in the two clusters. In the meta-graph, each cluster is represented by a hyperedge. MCLA groups and collapses related hyperedges, and it assigns an object to the collapsed hyperedge to which it participates most strongly. MCLA also uses METIS [14] to partition the meta-graph. Fern and Brodley [16] introduced HBGF to integrate base clusterings. HBGF constructs a bipartite graph that models both objects and clusters simultaneously as vertices. In the bipartite graph, an object is connected to several clusters, whereas there is no edge between pairwise objects or pairwise clusters. HBGF applies two graph partitioning algorithms: spectral graph partitioning [17], [18] and METIS to get the final partitions. To handle missing values, Wang, Shan and Banerjee proposed Bayesian cluster ensembles (BCE) [19], which considers all base clustering results as feature vectors, and then learns a Bayesian mixed-membership model from this feature representation. Domeniconi and Al-Razgan [3], [4] combined the clustering ensemble framework with subspace clustering. A subspace clustering is a collection of weighted clusters, where each cluster has a weight vector representing the relevance of features for that cluster. The input to the consensus function is a collection of subspace clusterings. The subspace clustering method used is Locally Adaptive Clustering (LAC) [2], and the resulting clustering techniques are WSPA and WBPA. Li and Ding [21] specified different weights for different clusterings and proposed an approach called Weighted Consensus Clustering (WCC). The optimal weights are sought iteratively through optimization and then used to compute the consensus result within the framework of nonnegative matrix factorization (NMF) [5]. More recently, a method called ensemble clustering by matrix completion (ECMC) was proposed [22]. ECMC uses the reliable pair of objects to construct a partially observed co-association matrix, and exploits the matrix completion algorithm to replenish the missing entries of the co-association matrix. Two objects are reliable if they are often clustered together or seldom clustered together. ECMC then uses spectral clustering on the completed matrix to get the final clustering. However, ECMC has two disadvantages: (i) the notion of reliability between objects is hard to define, and (ii) the matrix completion process may result in information loss. All the aforementioned methods do not assign weights to objects. In contrast, our proposed WOEC algorithms use the information provided by the base clusterings to define objects weights which reflect how difficult it is to cluster them. The weights are then embedded within the consensus function. Specifically, we propose three different WOEC approaches: (i) Weighted-Object Meta Clustering (WOMC), (ii) Weighted-Object Similarity Partition () Clustering, and (iii) Weighted-Object Hybrid Bipartite () Graph Partition Clustering. III. WEIGHTED OBJECT ENSEMBLE CLUSTERING This section introduces the ensemble clustering problem and presents the details of our weighted-object ensemble clustering. A. Ensemble Clustering Problem Formulation An ensemble clustering approach consists of two steps. First, multiple base clusterings are generated; second, the base clusterings are consolidated into the final clustering result. We assume here that the base clusterings have already been generated, and we only discuss hard clustering; however, the methods proposed can be extended to solve soft clustering problems as well. 628

3 Let X = {x 1, x 2,..., x n } denote the data set and X = [x T 1 ; x T 2 ;...; x T n ] R n d be its matricial representation, where n is the number of objects and d is the dimensionality of each object x i =(x i1,x i2,..., x id ) T. Lets consider a set of R clustering solutions C = {C 1,C 2,..., C R }, where each clustering component C r = {C1,C r 2, r..., Ck r r }, r =1, 2,...,R, partitions the data set X into k r disjoint clusters, i.e. Ci r Cr j = ( i = j, i, j =1, 2,...,k r ), and kr k=1 Cr k = X. The ensemble clustering problem consists in defining a consensus function Γ that maps the set of base clusterings C into a single consolidated clustering C. B. One-shot Weight Assignment Boosting [1], [11] is a supervised technique which seeks to create a strong learner based on a set of weak learners. Adaboost (short for Adaptive Boosting) [23] is the most commonly used boosting algorithm. It iteratively generates a distribution over the data, which is then used to train the next weak classifier. In each run, objects that are hard to be classified gain more weight, while easy-to-classify objects lose weight. The newly trained classifier focuses on the points with larger weights. At termination, Adaboost combines the weak classifiers into a strong one. The effectiveness of boosting has been proven both theoretically and experimentally. Boosting algorithms iteratively assign weights to objects using label information, which is likely to be unknown in clustering applications. This makes difficult to apply boosting techniques to clustering, and explains why little research has been performed in the context of weighted-object ensemble clustering methods. This paper aims at reducing this gap. Our goal is to enable difficultto-cluster points to play a bigger role when producing the final consolidated clustering result. Following this objective, we propose a one-shot weight assignment to objects using the results of all base clusterings, and then embed the weights into the successive consensus clustering process. Similar to boosting, points that are hard to cluster receive larger weights, while easy-to-cluster points are given smaller weights. The difference is that boosting is an iterative process, while our weight assignment scheme is performed in one-shot. The details are given below. Let A be the n n co-association matrix built from the clustering solutions C: A ij = V ij (1) R where V ij is the number of times objects x i and x j co-occur in the same cluster, and R = C is the ensemble size. Obviously, A ij [, 1]. WesetA ii =1, i =1, 2,...,n. A ij 1 means that x i and x j are often placed in the same cluster. A ij means that x i and x j are often placed in different clusters. In both cases, the base clusterings show a high level of agreement. When A ij.5, roughly half of the clusterings group x i and x j together, and the other half place them in different clusters. This scenario is the most uncertain one, since the clustering components show no agreement on how to cluster points x i and x j. We can capture this trend by mapping the A ij values through a quadratic function: y(x) =x(1 x),x [, 1]. This function achieves the peak at x =.5, and the minima at x = and x =1, as illustrated in Fig Hence, we can measure 1 Other functions satisfying these properties can be used as well. Fig. 1. y y = x(1 x) x = x Quadratic function: y(x) =x(1 x) the level of uncertainty in clustering two points x i and x j as follows: confusion(x i, x j )=A ij (1 A ij ) (2) The confusion index reaches its maximum of.25 when A ij =.5, and its minimum of when A ij =or A ij =1.Weuse this confusion measure to define the weight associated with each object as follows: w i = 4 n confusion(x i, x j ) (3) n j=1 The normalization factor 4 n is used to guarantee that w i [, 1]. To avoid a value of for a weight, which can lead to instability, we add a smoothing term: w i = w i + e (4) 1+e where e is a small positive number (e =.1 in our experiments). As a result, w i (, 1]. A large w i value means that confusion(x i, x j ) is large for different x j values. As such it measures how hard it is to cluster x i. The assignment of large weights to points that are hard to cluster is consistent with the way boosting techniques operate [1], [11]. C. Algorithms In this section we introduce three different Weighted-Object Ensemble Clustering (WOEC) algorithms. They all make use of the weights associated with objects (computed as explained above) to determine the consensus clustering. The overall process is shown in Fig. 2. As illustrated in the Figure, the consensus functions of and make use of the feature vectors, while WOMC does not. 1) Weighted-Object Meta Clustering Algorithm (WOMC): The first approach we introduce is a weighted version of the Meta Clustering Algorithm (MCLA) introduced in [2]. We call the resulting algorithm Weighted-Object Meta Clustering (WOMC). The key step of meta clustering algorithms is the clustering of clusters. When measuring the similarity between two clusters, MCLA treats all the points present in both clusters equally. In contrast, the proposed WOMC technique distinguishes the contribution of hard-to-cluster and easy-tocluster points. Specifically, a point in the intersection of two clusters contributes to the clusters similarity in a way that is proportional to its weight. The details of the approach follow. We explicit the components of each C r in C. This gives: C = {C1, 1..., Ck 1 1,C1, 2..., Ck 2 2,..., C1 R,..., Ck R R }, where each 629

4 Fig. 2. The process map of Weighted-object Ensemble Clustering (WOEC) component now corresponds to a cluster. The WOMC algorithm groups the clusters in C. To this end, it proceeds as follows. It constructs an undirected meta-graph G =(V,E) with V = n c = R r=1 k r vertices, each representing a cluster in C. The edge E ij connecting vertices V i and V j is assigned the weight value S ij defined as follows: S ij = x k C i C j w k x k C i C j w k (5) Clearly, S ij [, 1]. IfS ij =, there is no edge between C i and C j in G. S ij =1indicates that C i and C j contain the same points. WOMC partitions the meta-graph G into k (number of clusters in the final clustering result) meta-clusters, each representing a group of clusters, by using the similarity measure S ij and by applying the well-known graph partitioning algorithm METIS [14]. Similarly to MCLA, WOMC also assigns each point to the meta-cluster in which it participates most strongly. Specifically, a data point may occur in different meta-clusters, and it is associated to a given meta-cluster according to the ratio of clusters it belongs to. A point is eventually assigned to the meta-cluster with the highest ratio for that point. Ties are broken randomly. There might be a situation in which no object is assigned to a meta-cluster. Thus, the final combined clustering C may contains less than k clusters. The similarity measure defined in Eq. (5) computes the similarity of two clusters, not only based on the fraction of shared points, but also on the values of the weights associated to the points. In constrast, the binary Jaccard measure Ci Cj C i C j computes the similarity as the proportion between the size of the intersection and the size of the union of C i and C j. Suppose C i and C j have fixed points assigned to them. Thus, their Jaccard similarity is also fixed. Lets see how changing the weights assigned to the points affects the value of S ij. Consider D =(C i C j ) (C i C j ), i.e. the set of points not in the intersection. Let the weights assigned to the points in D be fixed. Then, increasing (decreasing) the weights assigned to the points in (C i C j ) will cause an increase (decrease) of S ij. As such, the more hard-to-cluster points C i and C j share, the more similar they are. Lets now fix the weights assigned to the points is (C i C j ). Increasing (decreasing) the weights assigned to the points in D will cause a decrease (increase) of S ij. As a consequence, the more hard-to-cluster points C i and C j do not share, the smaller their similarity is. Algorithm 1 describes the WOMC algorithm. Algorithm 1 Weighted-Object Meta Clustering Algorithm (WOMC) Input: Set of base clusterings C; Graph partitioning algorithm METIS; Number of clusters k. Output: The final consensus clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Construct the meta-graph G =(V,E), where the weights associated to the edges are computed using Eq. (5). 3: MetaClusters = METIS(G, k ). //Partition the metagraph into k meta-clusters. 4: C = Assign(MetaClusters) //Assign each object to the meta-cluster in which it participates most strongly. 5: return C. 2) Weighted-Object Similarity Partitioning Algorithm (): Previous work on clustering [6], [7], [8] suggested to assign large weights to objects that are hard to be clustered, and iteratively move cluster centers towards areas that contain objects with large weights. Inspired by this work, we proceed similarly while leveraging the information provided by the ensemble to estimate the weights. The motivation stems from boosting. The difference is that clustering methods combined with a boosting technique assign weights to objects and move cluster centers iteratively, while we only move centers once for each base clustering by making use of the original data and objects weight information. Clustering with a boosting technique aims to generate a better single clustering result while we seek to find a better consolidated consensus solution. To this end, for each cluster in the base clusterings C r (r =1, 2,..., R), we compute the corresponding weighted center as follows: m r l = x i Cl r x i Cl r w i x i (6) w i where l = 1,..., k r. The basic idea behind this step is to move the center of each cluster towards the region consisting of the objects that are hard to cluster. We call it shifting center technique. The experiments presented in Section V demonstrate the rationale and the improvement achieved by such transformation. The similarity between a point x i and a 63

5 weighted center m r l is calculated using the following exponential function: d(x i, m r l )=exp{ x i m r l 2 } (7) t The probability of cluster Cl r,givenx i, can then be defined as: P (Cl r d(x i, m r l x i )= ) kr l =1 d(x i, m r l ) (8) The smaller x i m r l 2 is, the larger P (Cl r x i) will be. We can now define the vector Pi r of posterior probabilities associated with x i : P r i =(P (C r 1 x i ),P(C r 2 x i ),..., P (C r k r x i )) (9) where k r l=1 P (Cr l x i)=1. Pi r provides a new representation of x i in a space of relative coordinates with respect to cluster centroids, where each dimension corresponds to one cluster. This new representation embeds information from both the original input data and the clustering ensemble. We use the cosine similarity to measure the similarity Sij r between x i and x j with respect to the base clustering C r : S r ij = P r i (P r j )T P r i P r j (1) where T denotes the transpose of a vector. There are R similarity matrices S r (r =1,..., R), each one corresponding to a base clustering. We combine these matrices into one final similarity matrix S: S = 1 R S r (11) R r=1 S represents the average similarity between x i and x j (through vectors P i and P j ), across the R contributing clusterings. Hence we can construct an undirected graph G = (V,E), where V = n and each vertex represents an object in X. The edge E ij connecting vertices V i and V j is assigned the weight value S ij. A graph partitioning algorithm (METIS in our experiments) can be applied to graph G to compute the partitioning of the n vertices that minimizes the edge weightcut. This gives the consensus clustering we seek. The algorithm is illustrated in Algorithm 2. Algorithm 2 Weighted-Object Similarity Partitioning Algorithm () Input: Data set X ; Set of base clusterings C; Graph partitioning algorithm METIS; Number of clusters k. Output: The final consensus clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Construct the corresponding graph G = (V,E), with weights computed through Eqs. (6)-(11). 3: C = METIS(G, k ). //Partition the graph into k parts, each representing a final cluster. 4: return C. 3) Weighted-Object Hybrid Bipartite Graph Partitioning Algorithm (): measures pairwise similarities which are solely instance-based, thus ignoring cluster similarities. We now introduce our third technique,, which attempts to capture both kinds of similarities. This approach reduces the clustering consensus problem to a bipartite graph partitioning problem, which partitions both cluster vertices and instance vertices simultaneously. Thus, it also accounts for similarities between clusters. As, makes use of the shifting center technique and computes probabilities as in Eq. (8). constructs an undirected hybrid bipartite graph G = (V,E), where V contains n c + n vertices (n c = R r=1 k r). The first n c vertices represent all the clusters in C and the last n vertices represent objects. The edge E ij connecting the vertices V i and V j is assigned the weight value S ij defined as follows: S ji =, S ij = S ji = P (C j x i ) if both vertices i and j are clusters or objects (12) where P (C j x i ) is the probability of cluster C j given x i, and can be computed from[ Eq. (8). The ] matrix S of elements S ij B T can be written as S =, where the n n B c matrix B is defined as follows: P1 1 P P R 1 P B = 2 1 P P1 R (13) Pn 1 Pn 2... Pn R where P r i is a row vector defined in Eq. (9). A graph partitioning algorithm (METIS in our experiments) is then used to partition this graph into k hybrid parts (each containing objects and clusters), so that the edge weight-cut is minimized. The partition of the objects provides the final clustering result. The algorithm is given in Algorithm 3. Algorithm 3 Weighted-Object Hybrid Bipartite Graph Partitioning Algorithm () Input: Data set X ; Set of base clusterings C; Graph partitioning algorithm METIS; Number of clusters k. Output: The final consensus clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Construct the hybrid bipartite graph G = (V,E) as described in this subsection. 3: C hybrid = METIS(G, k ). //Partition the graph into k parts. 4: return the partition of objects in C hybrid as C. D. Weighted-Object k-means To verify the effect of moving the center of each cluster according to Eq. (6), we introduce here a modified version of 631

6 Algorithm 4 Weighted-Object k-means (WOKmeans) Input: Data set X ; Set of base clusterings C; Number of clusters k. Output: The final clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Randomly generate k weighted cluster centers c 1,..., c k, each corresponding to a cluster C k = (k =1,..., k ). 3: repeat 4: //assign each object to its closest weighted center i, C k = C k x i, where k = arg min k x i c k. 5: //generate new weighted centers x c k = i C k w i x i,k =1,..., k x (14) i C k w i 6: until All weighted centers do not change or reach the maximum of iteration 7: return C = {C 1,C 2,..., C k }. k-means that makes use of such weighted means. The resulting technique is called Weighted-Object k-means (WOKmeans), and it s illustrated in Algorithm 4. WOKmeans computes the weighted centers after assigning objects to clusters in each iteration, as shown in Eq. (14). The main idea is to move cluster centers towards regions with larger weights iteratively. Compared to the weighted version of k-means in [9], WOKmeans computes all objects weights in advance using the coassociation matrix provided by the base clusterings, while [9] assigns weights to objects iteratively. IV. EMPIRICAL EVALUATION A. Data Sets We conducted experiments on one simulated data and eight UCI 2 data sets to compare the performance of the WOEC methods against existing state-of-the-art techniques. A description of the data sets used is given in Table I. The simulated data is discussed in V-A. The Satimage data set originally contained 6435 objects; we randomly selected 42 images equally distributed among the six classes for our experiments. The Spambase data set had 461 objects; we randomly sampled 5 objects (25 for each class). The other six data sets were used in their original form. For each data set, each feature is preprocessed to have zero mean value and unit variance. TABLE I. DATA SETS USED IN THE EXPERIMENTS Data sets #Objects #Dimensions #Classes Simulated data AustralianCredit BalanceScale BreastTissue Glass HayesRoth Iris Satimage Spambase B. Evaluation Measures There exist various metrics to evaluate clustering results [24], [25]. Since the labels of the data are known, we use the Rand Index (RI) [26], the (ARI) [26], and the Normalized Mutual Information (NMI) [2] as the validity indices. Let C = {C1,C2,..., Ck } be the consensus clustering result (k is the number of clusters in the final consensus clustering), and let C = {C 1, C 2,..., C k } be the ground-truth clustering (k is the number of classes). 1) RI measures the similarity between C and C as follows: where RI(C, C) = a + b a + b = ( a + b + c + d n (15) 2) a is the number of pairs of points that are in the same cluster in both C and C. b is the number of pairs of points that are in different clusters in both C and C. c is the number of pairs of points that are in the same cluster in C and in different clusters in C. d is the number of pairs of points that are in different clusters in C and in the same cluster in C. RI ranges from to 1. 2) ARI defines the similarity between C and C as follows: ( nij ) ARI(C i,j 2 (S S)/ ( ) n 2, C) = 1 2 (S + S) (S S)/ ( ) n (16) 2 ( n i 2 ), S = j ( nj 2 ), and n i and n j denote where S = i the number of objects in cluster Ci and C j (i =1,..., k ; j = 1,..., k), respectively. n ij = Ci C j. The ARI values range from-1to+1. 3) NMI [2] treats the clustering labels of C and C as two random variables. The mutual information of these two random variables is evaluated and then normalized within the [, 1] range. A larger value of RI (ARI or NMI) means that C and C are more similar. C. Base Clusterings To generate diverse clusterings we used k-means. Random sampling of the data and feature subset selection techniques were applied. Specifically, for the first half of the base clusterings, in each run we randomly selected 7% of the objects and performed k-means on this subset of the data. The remaining objects were assigned to the closest cluster based on the Euclidean distances between the object and the cluster centers. For the second half of the base clusterings, in each run we selected 7% of the features and conducted k-means in the corresponding subspace. When using k-means, we randomly generated initial centers in each independent run. V. RESULTS AND ANALYSIS For a fair comparison, we used the METIS package [14] for all the clustering ensemble methods that need to partition a graph. The number of clusters k in the consensus clustering is set equal to the number of classes, for all data and all 632

7 TABLE II. RESULTS ON SIMULATED DATA Data Index min max mean CSPA MCLA HBGF BCE WCC ECMC WOEC WOMC Simulated RI Data ARI NMI TABLE III. COMPARISON BETWEEN k-means AND WOKMEANS ON SIMULATED DATA Data Index k-means WOKmeans Simulated RI.8329± ±.1888 Data ARI.6657± ±.3776 NMI.6329± ±.3611 methods 3. In our experiments, the value of t (see Eq. (7)), for each base clustering, is set to the average of the squared Euclidean distances between the data and the weighted means. All the experimental results are the average of 1 independent runs. We also performed a paired t-test to assess the statistical significance of the results; the significance level was always set to.5. We compare the proposed WOEC methods against eight state-of-the-art ensemble clustering algorithms, namely: CSPA [2], MCLA [2], HBGF [16], BCE [19], WCC [21], ECMC [22], WSPA[3], [4] and WBPA[3], [4]. (HGPA [2] was not included in our experiments since it typically performs worse than CSPA and MCLA.) A. Simulated Data To illustrate the effectiveness of the proposed three WOEC algorithms and of WOKmeans, we first generated a simulated data. The simulated data consists of d = 2 input features and 2 classes, each consisting of 15 points. Both classes are normally distributed according to multivariate Gaussian distributions, as shown in Fig. 3. The mean vector and the covariance matrix of class 1 are (5, 1) and [1 ; 1], respectively. For class 2, the mean vector is (2, 1) and the covariance matrix is again [1 ; 1]. We set the ensemble size R = 4. RI, ARI and NMI are applied to assess the performance of all algorithms. For each of the 4 base clusterings, the data sampling technique was used. That is, we randomly selected 7% of the points and performed k-means on this subset of the data. The remaining points were assigned to the closest cluster based on the Euclidean distance between the points and the cluster centers. Each feature value was also restricted to have zero mean and unit variance. For each independent run, we record the minimum (min), maximum (max), and mean value (of RI, ARI and NMI) for all the base clusterings. The corresponding averages of 1 runs are also reported. The results are shown in Table II. Significantly better values are given in boldface. and achieve the best performance for all the evaluation indices (RI, ARI, and NMI). This is because the hard-to-cluster points (i.e. the points between the clusters boundaries) have the largest contribution to the shifting center procedure. This causes a shift of the clusters centers towards the region with hard-to-cluster points. As such, the influence of 3 In real applications, classes may be multimodal, and thus k should be larger than the number of classes. In other cases, there may be less clusters than classes. These scenarios are not considered in this paper and we simply set k equal to the number of classes. Fig. 3. Feature Simulated data Pseudo outliers Class 1 (5,1) (2,1) Mean vectors Class Feature 1 Simulated Data the pseudo-outliers 4 is reduced, thereby leading to more robust ensemble clustering results. BCE performs the worst. WOMC achieves comparable performance to the other methods, but loses to and, showing that the technique of shifting centers can work better on this data. To further demonstrate the effectiveness of shifting centers, we conducted experiments on the simulated data to compare the performance of k-means and WOKmeans. The experimental setup is the same as before. Table III shows the results. Clearly, WOKmeans achieves a considerable improvement with respect to k-means and its standard deviations are also smaller. k-means is negatively affected by the pseudo-outliers. Although WOKmeans improves upon k-means, its performance is considerably worse than that of ensemble clustering methods, demonstrating the effectiveness of ensemble techniques. Points that receive the largest weights are highlighted in squares in Fig. 3, showing that between-cluster points tend to gain large weights. From Table II and III, we can see that generally a better RI value corresponds to a better ARI and NMI values. Therefore, we only report the ARI values for the rest of the experiments due to the limitation in space. B. Evaluation of the WOEC Techniques We conducted experiments to evaluate the effectiveness of the proposed WOEC methods on eight UCI data sets. Base clusterings were generated as explained in Section IV-C. We experimented with three different ensemble sizes R: 2, 4, and 6. The results in terms of ARI are shown in Table IV. Each ARI value reported is the average of 1 independent runs. In each row, the values in boldface are the ARI values that 4 As shown in Fig. 3, the left circled points are far away from the mean point of class 1 and the density around them is much lower than that around the other points in class 1. These points can bias the computation of the mean vector. They are generated from the same distribution as the other red points in class 1, and should be grouped in the same cluster, but behave like outliers for the purpose of this discussion. For this reason we call them pseudo-outliers. Similarly, the points in the right circle are the pseudo-outliers of class

8 TABLE IV. COMPARISON AGAINST EXISTING METHODS (ARI VALUES) Data R min max mean CSPA MCLA HBGF BCE WCC ECMC WOEC WOMC Australian (.4189) Balance (.1336) Breast (.2952) Glass (.168) HayesRoth (.61) Iris (.5784) Satimage (.699) Spambase (.1412) are statistically significantly better than the remaining ones. We also run k-means 1 times for baseline comparison; its average ARI is given in Table IV in parenthesis, following the name of each data set. For each independent run, we record the minimum, maximum and mean ARI for all the base clusterings. The corresponding averages are also reported in the Table. We can see that the mean ARI of the base clusterings is very close to the baseline for each data set. From Table IV we can see that at least one of the WOEC methods can always achieve the best performance on all data sets, except for R =2and R =4on BreastTissue. To be more specific, WOMC gives the best performance on AustralianCredit, BalanceScale, and Satimage; achieves the best performance on Glass and HayesRoth; while significantly outperforms all other methods on Spambase. On Iris data, and achieve comparable performance and both outperform all other methods. As expected, the ensemble clustering methods achieve superior performance with respect to the baseline for most of the problems. In particular, on the Spambase data set, the ensemble techniques provide a considerable improvement. The poor performance of the baseline may be due to the large dimensionality of the data in this case. There are exceptions to this general trend. For instance, most of the ensemble clustering methods are inferior to the baseline on BalanceScale, Glass, and HayesRoth. Nevertheless, at least one of the WOEC methods performs well also for these data sets, i.e. WOMC on BalanceScale, and on Glass and HayesRoth. An interesting result is that the performance of and is close to the max index on HayesRoth and Spambase respectively, and WOMC can achieve better ARI values than the max index on Satimage. As pointed out earlier, our WOMC technique is a variation of MCLA. We can see that these two methods have comparable performance on AustralianCredit, Glass, HayesRoth, and Spambase, while WOMC outperforms MCLA on the other 4 data sets. This shows the benefits of considering objects weights when measuring the similarity of two meta clusters in a meta clustering algorithm. Compared against the most recent ECMC technique, the WOEC methods can achieve much better results in general. The main reason is that ECMC assigns a value of one to the entries of the co-association matrix that are larger than a threshold d 1, and a value of zero to the entries that are smaller than a threshold value d. This leads to information loss; in particular, the loss of information regarding the hardto-cluster objects. In addition, the setting of thresholds like d and d 1 is always problematic. In our experiments we set d =.2 and d 1 =.8 as suggested in [22]. ECMC needs also to choose a suitable value for a parameter C that minimizes 1 T M1, where M is a pairwise similarity matrix (details are in [22]), which is time consuming. In contrast, WOMC does not need to perform any parameter selection, while and are robust to the setting of its parameter t, asthe analysis in Section V-E shows. C. Comparison against WSPA and WBPA To perform comparisons against WSPA and WBPA, we generated the base clusterings using the LAC algorithm, as described in [3], [4]. LAC assigns weights to features, locally at each cluster. Both WSPA and WBPA require such weights, while our WOEC techniques have a wider applicability. The weight vectors computed by the LAC algorithm are discarded when combined with the WOEC approaches. Results are shown in Table V. As before, statistically significant results are in boldface. We can see that the WOEC methods can always achieve the best performance, while WSPA and WBPA give comparable to best results only on BreastTissue and Glass data sets. D. Evaluation of WOKmeans In Section V-A, we have discussed the superiority of WOKmeans with respect to k-means on the simulated data. To further investigate the behavior of WOKmeans, in this section we add more comparisons on three data sets: BreastTissue, Glass, and Iris. The reported results are the average of 1 individual runs. In each run, the maximum number of iterations of both methods is set to 1. The number of clusters is set to the number of classes given by the ground truth labels. The ensemble sizes are again 2, 4 and 6. Note that only WOKmeans is affected by the ensemble size. Table VI gives 634

9 AustralianCredit.2 HayesRoth Spambase t t t (a) AustralianCredit (b) HayesRoth (c) Spambase Fig. 4. Sensitivity analysis of parameter t.5 AustralianCredit WOMC HayesRoth Spambase WOMC Ensemble size Ensemble size.2.1 WOMC Ensemble size (a) AustralianCredit (b) HayesRoth (c) Spambase Fig. 5. Sensitivity analysis of ensemble size R TABLE V. COMPARISON AGAINST WSPA AND WBPA (ARI) Data R WSPA WBPA WOEC WOMC Australian Balance Breast Glass HayesRoth Iris Satimage Spambase the results. We observe that WOKmeans can significantly outperform k-means on the three data sets in all cases. For Iris data, WOKmeans not only achieves a better performance, but it s also more stable with much smaller standard deviations. TABLE VI. EXPERIMENTAL RESULTS OF k-means AND WOKMEANS (ARI) Data R k-means WOKmeans ±.375 BreastTissue ± ± ± ±.35 Glass 4.168± ± ± ±.294 Iris ± ± ±.329 These results empirically demonstrate again the effect of Eq. (6) on the resulting clusters. As suggested by previous analysis [6], [7], [8], moving the centers towards the hard-to-cluster data can improve the clustering of the same. E. Sensitivity Analysis of Parameter t Here we test how sensitive the and the methods are with respect to the values of the parameter t in Eq. (7). We set the ensemble size R =4and change the parameter t from.5 to 1. We use the AustralianCredit, HayesRoth, and Spambase data sets to observe the performance of the methods for the different values of t. Fig.4 shows the results. We can see that both and tend to be stable when t>5 on HayesRoth and Spambase, while the performance is almost the same for different values of t on AustralianCredit. In fact, the probability of a cluster 635

10 Ck r given an object x i, computed as in Eq. (8), is essentially determined by the Euclidean distances between x i and all the cluster centers in clustering C r. Different values of t will not change the sequence of probability values. This explains why both and are robust to different values of the parameter t. Nevertheless, we should avoid setting t to values that are too small or too large. In these two cases, in fact, k, d(x i, m r k ) and d(x i, m r k ) 1, respectively, in Eq. (7), and this will influence the consensus clustering process of and. Therefore, the average of the squared Euclidean distances between the data and the weighted means is a reasonable default value of t for each base clustering. F. Sensitivity Analysis of Ensemble Size R In this section we investigate the sensitivity of the three WOEC methods with respect to the ensemble size R. We change R from 5 to 1, and we use the same three data sets: AustralianCredit, HayesRoth, and Spambase. Fig. 5 gives the results. In general, with an increase of the ensemble size, the performance of each WOEC method improves. WOEC tends to be stable when R is larger than 1, 6, and 3 on AustralianCredit, HayesRoth, and Spambase, respectively. On AustralianCredit, all three WOEC methods can achieve a good performance when R =5. Also and achieve a stable performance on Spambase in correspondence of a small ensemble size. Each method gives the highest ARI for a different data set. VI. CONCLUSIONS AND FUTURE WORK We have introduced an approach called Weighted-Object Ensemble Clustering (WOEC). Three different consensus techniques have been proposed to leverage weighted objects. Experimental results indicate that each of the three techniques leads in performance in different data sets. This suggests that the three methods are most effective under different conditions. The achievement of a better understanding of such conditions is an interesting question we ll investigate. Like most existing ensemble methods, the WOEC approaches are quadratic in the number of points and linear in the dimensionality. How to apply WOEC methods effectively and efficiently with large scale and high dimensional data is also a challenging future work. VII. ACKNOWLEDGMENTS This paper is partially supported by grants from the Natural Science Foundation of China (Project Nos , , ), the Fundamental Research Funds for the Central Universities of China (Project Nos. XDJK21B2, XDJK213C123), the Doctoral Fund of Southwest University (Nos. SWU1163 and SWU11334) and the China Scholarship Council (CSC). REFERENCES [1] J. Ghosh and A. Acharya, Cluster ensembles, WIREs Data Mining and Knowledge Discovery, vol. 1, no. 4, pp , 211. [2] A. Strehl and J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research, vol. 3, pp , 22. [3] M. Al-Razgan and C. Domeniconi, Weighted clustering ensemble, in Proceedings of The 6th SIAM International Conference on Data Mining, April 26, pp [4] C. Domeniconi and M. Al-Razgan, Weighted cluster ensembles: Methods and analysis, ACM Transactions on Knowledge Discovery from Data, vol. 2, no. 4, pp. 17:1 4, January 29. [5] T. Li, C. Ding, and M. I. Jordan, Solving consensus and semisupervised clustering problems using nonnegative matrix factorization, in Proceedings of the 7th IEEE International Conference on Data Mining, 27, pp [6] G. Hamerly and C. Elkan, Alternatives to the k-means algorithm that find better clusterings, in Proceedings of the 11th International Conference on Information and Knowledge Management, 22, pp [7] A. Topchy, B. Minaei-Bidgoli, A. K. Jain, and W. F. Punch, Adaptive clustering ensembles, in Proceedings of the 17th International Conference on Pattern Recognition, August 24, pp [8] B. Zhang, M. Hsu, and U. Dayal, K-harmonic means a spatial clustering algorithm with boosting, Temporal, Spatial, and Spatio-Temporal Data Mining, pp , 21. [9] R. Nock and F. Nielsen, On weighting clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp , August 26. [1] R. E. Schapire, The strength of weak learnability, Machine Learning, vol. 5, no. 2, pp , 199. [11] Z. Zhou, Ensemble Methods: Foundations and Algorithms. Chapman & Hall, 212. [12] A. Fred and A. Jain, Data clustering using evidence accumulation, in Proceedings of the 16th International Conference of Pattern Recognition, 22, pp [13] A. Fred and A. K. Jain, Evidence accumulation clustering based on the k-means algorithm, in Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, 22, pp [14] G. Karypis and V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM Journal on Scientific Computing, vol. 2, no. 1, pp , August [15] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, Multilevel hypergraph partitioning: Application in vlsi domain, in Proceedings of the Design and Automation Conference, 1997, pp [16] X. Z. Fern and C. E. Brodley, Solving cluster ensemble problems by bipartite graph partitioning, in Proceedings of the 21th International Conference on Machine Learning, 24, pp [17] A. Y. Ng, M. I. Jordan, and Y. Weiss, On spectral clustering: Analysis and an algorithm, in Advances in Neural Information Processing Systems, 21, pp [18] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp , August 2. [19] H. Wang, H. Shan, and A. Banerjee, Bayesian cluster ensembles, in Proceedings of the 9th SIAM International Conference on Data Mining, 29, pp [2] C. Domeniconi, D. Papadopoulos, D. Gunopulos, and S. Ma, Subspace clustering of high dimensional data, in Proceedings of the SIAM International Conference on Data Mining, April 24, pp [21] T. Li and C. Ding, Weighted consensus clustering, in Proceedings of the 8th SIAM International Conference on Data Mining, 28, pp [22] J. Yi, T. Yang, R. Jin, A. K. Jain, and M. Mahdavi, Robust ensemble clustering by matrix completion, in Proceedings of the 12th IEEE International Conference on Data Mining, December 212, pp [23] Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, vol. 55, pp , [24] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Prez, and I. Perona, An extensive comparative study of cluster validity indices, Pattern Recognition, vol. 46, no. 1, pp , January 213. [25] C. Legny, S. Juhsz, and A. Babos, Cluster validity measurement techniques, in Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, 26, pp [26] L. Hubert and P. Arabie, Comparing partitions, Journal of Classification, vol. 2, no. 1, pp ,

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering The ensemble of classifiers is a well established strategy in supervised learning Unsupervised learning aims the same goal:

More information

A Comparison of Resampling Methods for Clustering Ensembles

A Comparison of Resampling Methods for Clustering Ensembles A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei-Bidgoli Computer Science Department Michigan State University East Lansing, MI, 48824, USA Alexander Topchy Computer Science Department

More information

Weighted Clustering Ensembles

Weighted Clustering Ensembles Weighted Clustering Ensembles Muna Al-Razgan Carlotta Domeniconi Abstract Cluster ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature. Cluster ensembles can

More information

Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering

Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering Journal of Computational Information Systems 10: 12 (2014) 5147 5154 Available at http://www.jofcis.com Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering Ye TIAN 1, Peng YANG

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Advances in Fuzzy Clustering and Its Applications. J. Valente de Oliveira and W. Pedrycz (Editors)

Advances in Fuzzy Clustering and Its Applications. J. Valente de Oliveira and W. Pedrycz (Editors) Advances in Fuzzy Clustering and Its Applications J. Valente de Oliveira and W. Pedrycz (Editors) Contents Preface 3 1 Soft Cluster Ensembles 1 1.1 Introduction................................... 1 1.1.1

More information

Cluster Ensembles for High Dimensional Clustering: An Empirical Study

Cluster Ensembles for High Dimensional Clustering: An Empirical Study Cluster Ensembles for High Dimensional Clustering: An Empirical Study Xiaoli Z. Fern xz@ecn.purdue.edu School of Electrical and Computer Engineering, Purdue University, W. Lafayette, IN 47907, USA Carla

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Clustering ensemble method

Clustering ensemble method https://doi.org/10.1007/s13042-017-0756-7 ORIGINAL ARTICLE Clustering ensemble method Tahani Alqurashi 1 Wenjia Wang 1 Received: 28 September 2015 / Accepted: 20 October 2017 The Author(s) 2018 Abstract

More information

Consensus clustering by graph based approach

Consensus clustering by graph based approach Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

COMBINING MULTIPLE PARTITIONS CREATED WITH A GRAPH-BASED CONSTRUCTION FOR DATA CLUSTERING

COMBINING MULTIPLE PARTITIONS CREATED WITH A GRAPH-BASED CONSTRUCTION FOR DATA CLUSTERING Author manuscript, published in "IEEE International Workshop on Machine Learning for Signal Processing, Grenoble : France (29)" COMBINING MULTIPLE PARTITIONS CREATED WITH A GRAPH-BASED CONSTRUCTION FOR

More information

Consensus Clusterings

Consensus Clusterings Consensus Clusterings Nam Nguyen, Rich Caruana Department of Computer Science, Cornell University Ithaca, New York 14853 {nhnguyen,caruana}@cs.cornell.edu Abstract In this paper we address the problem

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

An Empirical Comparison of Spectral Learning Methods for Classification

An Empirical Comparison of Spectral Learning Methods for Classification An Empirical Comparison of Spectral Learning Methods for Classification Adam Drake and Dan Ventura Computer Science Department Brigham Young University, Provo, UT 84602 USA Email: adam drake1@yahoo.com,

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Enhancing Single-Objective Projective Clustering Ensembles

Enhancing Single-Objective Projective Clustering Ensembles Enhancing Single-Objective Projective Clustering Ensembles Francesco Gullo DEIS Dept. University of Calabria 87036 Rende CS), Italy fgullo@deis.unical.it Carlotta Domeniconi Department of Computer Science

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Study and Implementation of CHAMELEON algorithm for Gene Clustering

Study and Implementation of CHAMELEON algorithm for Gene Clustering [1] Study and Implementation of CHAMELEON algorithm for Gene Clustering 1. Motivation Saurav Sahay The vast amount of gathered genomic data from Microarray and other experiments makes it extremely difficult

More information

A Graph Based Approach for Clustering Ensemble of Fuzzy Partitions

A Graph Based Approach for Clustering Ensemble of Fuzzy Partitions Journal of mathematics and computer Science 6 (2013) 154-165 A Graph Based Approach for Clustering Ensemble of Fuzzy Partitions Mohammad Ahmadzadeh Mazandaran University of Science and Technology m.ahmadzadeh@ustmb.ac.ir

More information

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

A Unified Framework to Integrate Supervision and Metric Learning into Clustering A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning

Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning Cluster Validation Ke Chen Reading: [5.., KPM], [Wang et al., 9], [Yang & Chen, ] COMP4 Machine Learning Outline Motivation and Background Internal index Motivation and general ideas Variance-based internal

More information

Individualized Error Estimation for Classification and Regression Models

Individualized Error Estimation for Classification and Regression Models Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

More information

Clustering Ensembles Based on Normalized Edges

Clustering Ensembles Based on Normalized Edges Clustering Ensembles Based on Normalized Edges Yan Li 1,JianYu 2, Pengwei Hao 1,3, and Zhulin Li 1 1 Center for Information Science, Peking University, Beijing, 100871, China {yanli, lizhulin}@cis.pku.edu.cn

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

DATA clustering is an unsupervised learning technique

DATA clustering is an unsupervised learning technique IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS 1 Enhanced Ensemble Clustering via Fast Propagation of Cluster-wise Similarities Dong Huang, Member, IEEE, Chang-Dong Wang, Member, IEEE, Hongxing

More information

AN ABSTRACT OF THE THESIS OF

AN ABSTRACT OF THE THESIS OF AN ABSTRACT OF THE THESIS OF Sicheng Xiong for the degree of Master of Science in Computer Science presented on April 25, 2013. Title: Active Learning of Constraints for Semi-Supervised Clustering Abstract

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Rough Set based Cluster Ensemble Selection

Rough Set based Cluster Ensemble Selection Rough Set based Cluster Ensemble Selection Xueen Wang, Deqiang Han, Chongzhao Han Ministry of Education Key Lab for Intelligent Networks and Network Security (MOE KLINNS Lab), Institute of Integrated Automation,

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Meta-Clustering. Parasaran Raman PhD Candidate School of Computing

Meta-Clustering. Parasaran Raman PhD Candidate School of Computing Meta-Clustering Parasaran Raman PhD Candidate School of Computing What is Clustering? Goal: Group similar items together Unsupervised No labeling effort Popular choice for large-scale exploratory data

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will Lectures Recent advances in Metamodel of Optimal Prognosis Thomas Most & Johannes Will presented at the Weimar Optimization and Stochastic Days 2010 Source: www.dynardo.de/en/library Recent advances in

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

CS6716 Pattern Recognition

CS6716 Pattern Recognition CS6716 Pattern Recognition Prototype Methods Aaron Bobick School of Interactive Computing Administrivia Problem 2b was extended to March 25. Done? PS3 will be out this real soon (tonight) due April 10.

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Multiobjective Data Clustering

Multiobjective Data Clustering To appear in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Multiobjective Data Clustering Martin H. C. Law Alexander P. Topchy Anil K. Jain Department of Computer Science

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Refined Shared Nearest Neighbors Graph for Combining Multiple Data Clusterings

Refined Shared Nearest Neighbors Graph for Combining Multiple Data Clusterings Refined Shared Nearest Neighbors Graph for Combining Multiple Data Clusterings Hanan Ayad and Mohamed Kamel Pattern Analysis and Machine Intelligence Lab, Systems Design Engineering, University of Waterloo,

More information

Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets

Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University of Houston, Houston, TX 77204-3010

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

An Efficient Learning of Constraints For Semi-Supervised Clustering using Neighbour Clustering Algorithm

An Efficient Learning of Constraints For Semi-Supervised Clustering using Neighbour Clustering Algorithm An Efficient Learning of Constraints For Semi-Supervised Clustering using Neighbour Clustering Algorithm T.Saranya Research Scholar Snr sons college Coimbatore, Tamilnadu saran2585@gmail.com Dr. K.Maheswari

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and

More information

Scalable Clustering of Signed Networks Using Balance Normalized Cut

Scalable Clustering of Signed Networks Using Balance Normalized Cut Scalable Clustering of Signed Networks Using Balance Normalized Cut Kai-Yang Chiang,, Inderjit S. Dhillon The 21st ACM International Conference on Information and Knowledge Management (CIKM 2012) Oct.

More information

Using the Kolmogorov-Smirnov Test for Image Segmentation

Using the Kolmogorov-Smirnov Test for Image Segmentation Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer

More information

Accumulation. Instituto Superior Técnico / Instituto de Telecomunicações. Av. Rovisco Pais, Lisboa, Portugal.

Accumulation. Instituto Superior Técnico / Instituto de Telecomunicações. Av. Rovisco Pais, Lisboa, Portugal. Combining Multiple Clusterings Using Evidence Accumulation Ana L.N. Fred and Anil K. Jain + Instituto Superior Técnico / Instituto de Telecomunicações Av. Rovisco Pais, 149-1 Lisboa, Portugal email: afred@lx.it.pt

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Instance-Wise Weighted Nonnegative Matrix Factorization for Aggregating Partitions with Locally Reliable Clusters

Instance-Wise Weighted Nonnegative Matrix Factorization for Aggregating Partitions with Locally Reliable Clusters Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 15) Instance-Wise Weighted Nonnegative Matrix Factorization for Aggregating s with Locally Reliable Clusters

More information

Improving Image Segmentation Quality Via Graph Theory

Improving Image Segmentation Quality Via Graph Theory International Symposium on Computers & Informatics (ISCI 05) Improving Image Segmentation Quality Via Graph Theory Xiangxiang Li, Songhao Zhu School of Automatic, Nanjing University of Post and Telecommunications,

More information

The Projected Dip-means Clustering Algorithm

The Projected Dip-means Clustering Algorithm Theofilos Chamalis Department of Computer Science & Engineering University of Ioannina GR 45110, Ioannina, Greece thchama@cs.uoi.gr ABSTRACT One of the major research issues in data clustering concerns

More information

Adaptive Metric Nearest Neighbor Classification

Adaptive Metric Nearest Neighbor Classification Adaptive Metric Nearest Neighbor Classification Carlotta Domeniconi Jing Peng Dimitrios Gunopulos Computer Science Department Computer Science Department Computer Science Department University of California

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2 A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation Kwanyong Lee 1 and Hyeyoung Park 2 1. Department of Computer Science, Korea National Open

More information

Identifying Layout Classes for Mathematical Symbols Using Layout Context

Identifying Layout Classes for Mathematical Symbols Using Layout Context Rochester Institute of Technology RIT Scholar Works Articles 2009 Identifying Layout Classes for Mathematical Symbols Using Layout Context Ling Ouyang Rochester Institute of Technology Richard Zanibbi

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software August 2010, Volume 36, Issue 9. http://www.jstatsoft.org/ LinkCluE: A MATLAB Package for Link-Based Cluster Ensembles Natthakan Iam-on Aberystwyth University Simon

More information

A Memetic Heuristic for the Co-clustering Problem

A Memetic Heuristic for the Co-clustering Problem A Memetic Heuristic for the Co-clustering Problem Mohammad Khoshneshin 1, Mahtab Ghazizadeh 2, W. Nick Street 1, and Jeffrey W. Ohlmann 1 1 The University of Iowa, Iowa City IA 52242, USA {mohammad-khoshneshin,nick-street,jeffrey-ohlmann}@uiowa.edu

More information

Automatic Group-Outlier Detection

Automatic Group-Outlier Detection Automatic Group-Outlier Detection Amine Chaibi and Mustapha Lebbah and Hanane Azzag LIPN-UMR 7030 Université Paris 13 - CNRS 99, av. J-B Clément - F-93430 Villetaneuse {firstname.secondname}@lipn.univ-paris13.fr

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Supplementary Figure 1. Decoding results broken down for different ROIs

Supplementary Figure 1. Decoding results broken down for different ROIs Supplementary Figure 1 Decoding results broken down for different ROIs Decoding results for areas V1, V2, V3, and V1 V3 combined. (a) Decoded and presented orientations are strongly correlated in areas

More information

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan Explore Co-clustering on Job Applications Qingyun Wan SUNet ID:qywan 1 Introduction In the job marketplace, the supply side represents the job postings posted by job posters and the demand side presents

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 11, November 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Image Segmentation for Image Object Extraction

Image Segmentation for Image Object Extraction Image Segmentation for Image Object Extraction Rohit Kamble, Keshav Kaul # Computer Department, Vishwakarma Institute of Information Technology, Pune kamble.rohit@hotmail.com, kaul.keshav@gmail.com ABSTRACT

More information

Logical Templates for Feature Extraction in Fingerprint Images

Logical Templates for Feature Extraction in Fingerprint Images Logical Templates for Feature Extraction in Fingerprint Images Bir Bhanu, Michael Boshra and Xuejun Tan Center for Research in Intelligent Systems University of Califomia, Riverside, CA 9252 1, USA Email:

More information

Model-based Clustering With Probabilistic Constraints

Model-based Clustering With Probabilistic Constraints To appear in SIAM data mining Model-based Clustering With Probabilistic Constraints Martin H. C. Law Alexander Topchy Anil K. Jain Abstract The problem of clustering with constraints is receiving increasing

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

On Error Correlation and Accuracy of Nearest Neighbor Ensemble Classifiers

On Error Correlation and Accuracy of Nearest Neighbor Ensemble Classifiers On Error Correlation and Accuracy of Nearest Neighbor Ensemble Classifiers Carlotta Domeniconi and Bojun Yan Information and Software Engineering Department George Mason University carlotta@ise.gmu.edu

More information

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification Flora Yu-Hui Yeh and Marcus Gallagher School of Information Technology and Electrical Engineering University

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Including the Size of Regions in Image Segmentation by Region Based Graph

Including the Size of Regions in Image Segmentation by Region Based Graph International Journal of Emerging Engineering Research and Technology Volume 3, Issue 4, April 2015, PP 81-85 ISSN 2349-4395 (Print) & ISSN 2349-4409 (Online) Including the Size of Regions in Image Segmentation

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Fast Fuzzy Clustering of Infrared Images. 2. brfcm Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.

More information

Lecture 10: Semantic Segmentation and Clustering

Lecture 10: Semantic Segmentation and Clustering Lecture 10: Semantic Segmentation and Clustering Vineet Kosaraju, Davy Ragland, Adrien Truong, Effie Nehoran, Maneekwan Toyungyernsub Department of Computer Science Stanford University Stanford, CA 94305

More information

Cost-sensitive Boosting for Concept Drift

Cost-sensitive Boosting for Concept Drift Cost-sensitive Boosting for Concept Drift Ashok Venkatesan, Narayanan C. Krishnan, Sethuraman Panchanathan Center for Cognitive Ubiquitous Computing, School of Computing, Informatics and Decision Systems

More information

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM 96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation ÖGAI Journal 24/1 11 Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation Michael Bleyer, Margrit Gelautz, Christoph Rhemann Vienna University of Technology

More information

Graph-based High Level Motion Segmentation using Normalized Cuts

Graph-based High Level Motion Segmentation using Normalized Cuts Graph-based High Level Motion Segmentation using Normalized Cuts Sungju Yun, Anjin Park and Keechul Jung Abstract Motion capture devices have been utilized in producing several contents, such as movies

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information