Incorporating Spatial Similarity into Ensemble Clustering

Size: px
Start display at page:

Download "Incorporating Spatial Similarity into Ensemble Clustering"

Transcription

1 Incorporating Spatial Similarity into Ensemble Clustering M. Hidayath Ansari Dept. of Computer Sciences University of Wisconsin-Madison Madison, WI, USA Nathanael Fillmore Dept. of Computer Sciences University of Wisconsin-Madison Madison, WI, USA Michael H. Coen Dept. of Computer Sciences Dept. of Biostatistics and Medical Informatics University of Wisconsin-Madison Madison, WI, USA ABSTRACT This paper addresses a fundamental problem in ensemble clustering namely, how should one compare the similarity of two clusterings? The vast majority of prior techniques for comparing clusterings are entirely partitional, i.e., they examine assignments of points in set theoretic terms after they have been partitioned. In doing so, these methods ignore the spatial layout of the data, disregarding the fact that this information is responsible for generating the clusterings to begin with. In this paper, we demonstrate the importance of incorporating spatial information into forming ensemble clusterings. We investigate the use of a recently proposed measure, called CDistance, which uses both spatial and partitional information to compare clusterings. We demonstrate that CDistance can be applied in a wellmotivated way to four areas fundamental to existing ensemble techniques: the correspondence problem, subsampling, stability analysis and diversity detection. Categories and Subject Descriptors I. [Clustering]: Similarity Measures Keywords Ensemble methods, Comparing clusterings, Clustering stability, Clustering robustness. INTRODUCTION Many of the most exciting recent advances in clustering seek to take advantage of ensembles of outputs of clustering algorithms. Ensemble methods have the potential to provide several important benefits over single clustering algorithms. As discussed in [], advantages of ensemble techniques in clustering problems include improved robustness across datasets and domains, increased stability of clustering solutions, and perhaps most importantly, the ability to construct clusterings Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MultiClust KDD 00 Washington, DC, USA Copyright 00 ACM X-XXXXX-XX-X/XX/XX...$0.00. not individually attainable by any single algorithm. Ensemble methods generally rely, at least implicitly and often explicitly on a notion of a distance or similarity between clusterings. This is not surprising; for example, if the stability of a clustering algorithm is taken to mean that the algorithm does not produce highly different outputs on very similar inputs, then in order to measure stability one must be able to determine exactly how similar such outputs are to each other. These inputs can be perturbed versions of each other or one may be a subsample of the other. Likewise, in order to create an integrated ensemble clustering from several individual clusterings information about which clusterings are more similar or less similar or which parts of which clusterings are compatible with each other and can be combined is frequently useful. Previous approaches to ensemble clustering have generally used partitional techniques to measure similarity between clusterings, i.e., they do not take into account spatial information about the points in each clustering, but only their identities. Such techniques include the Hubert-Arabie index [], the Rand index [9], van Dongen s metric [], Variation of Information [7], and nearly all other previous work on comparing clusterings []. As Table shows, it is in many contexts a major weakness to ignore spatial information in clustering comparisons, because ignoring the location of data points for which labels change across clusterings can lead to a misleading quantification of the difference between the clusterings. The present paper proposes the use of a very recently introduced method for comparing clusterings, called CDistance [], which incorporates spatial information about points in each clustering. Specifically, as we show later in Section and in more detail in [], CDistance gives us a measure of the extent to which the constituent clusters of each clustering overlap with the other. We have found that CDistance is an effective tool to measure the similarity of multiple clusterings. It should be stressed that the goal of this paper is not to introduce an end-to-end ensemble clustering framework but rather to introduce more robust alternatives for steps in the generation, evaluation and combination of clusterings commonly used in ensemble methods. We provide a brief theoretical treatment of selected issues arising in popular ensemble clustering techniques. Specifically, we identify four critical areas in which CDistance can be used in existing ensemble clustering techniques: the correspondence problem, subsampling, stability analysis, and detecting diversity.

2 In Section we establish preliminary definitions and notation, and in Section we construct CDistance. We give a short introduction to ensemble clustering in Section. In Section we discuss how CDistance can be applied in the setting of ensemble clustering. We discuss related work in Section, and indicate directions for future work in Section 7.. PRELIMINARIES As a brief introduction (further detail is provided in Section ), CDistance works as follows. Given two clusterings we first compute a pairwise distance matrix between each cluster in one to clusters in the other. The distance function used takes into account the interpoint distances between elements of the two clusters being compared and is called Optimal Transportation Distance. The clusterings can then be thought of as two subsets (consisting of their clusters as elements) of a metric space with Optimal Transportation Distance as the metric, where we know the value of the metric at each point of interest (the entries of the matrix). We then compute the so-called Similarity Distance between these two collections; Similarity Distance between two point sets in a metric space measures the degree to which they overlap in that space, in a sense we will make precise. In the following subsections, we construct the components required to compute CDistance. CDistance is defined in terms of optimal transportation distance and naive transportation distance, and we begin by briefly describing these.. Clustering A (hard) clustering A is a partition of a dataset D into sets A, A,..., A K called clusters such that A i A j = for all i j, and K A k = D. k= We assume D, and therefore A,..., A k, are finite subsets of a metric space (Ω, d Ω).. Optimal Transportation Distance The optimal transportation problem [, 8] asks what is the cheapest way to move a set of masses from sources to sinks, who are some distance away? Here cost is defined as the total mass distance moved. For example, one can think of the sources as factories and the sinks as warehouses to make the problem concrete. We assume that the sources are shipping exactly as much mass as the sinks are expecting. Formally, in the optimal transportation problem, we are given weighted point sets (A, p) and (B, q), where A = {a,..., a A } is a finite subset of the metric space (Ω, d Ω), p = (p,..., p A ) is a vector of associated nonnegative weights summing to one, and similar definitions hold for B and q. The optimal transportation distance between (A, p) and (B, q) is defined as A B d OT(A, B; p, q, d Ω) = fij d Ω(a i, b j), i= j= where the optimal flow F = (f ij) between (A, p) and (B, q) is the solution of the linear program minimize A i= A B f ij d Ω(a i, b j) over F = (f ij) subj. to B j= A i= B j= i= j= f ij 0, i A, j B fij = pi, i A fij = qj, j B fij =. It is useful to view the optimal flow as a representation of the maximally cooperative way to transport masses between sources and sinks. Here, cooperative means that the sources are allowed to exchange delivery obligations in order to transport their masses with a globally minimal cost.. Naive Transportation Distance In contrast with the optimal algorithm proposed above, we define here a naive solution to the transportation problem. Here, the sources are all responsible for individually distributing their masses proportionally to the sinks. In this case, none of the sources cooperate, leading to inefficiency in shipping the overall mass to the sinks. Formally, in the naive transportation problem, we are given weighted point sets (A, p) and (B, q) as above. We define the naive transportation distance between (A, p) and (B, q) as A B d NT(A, B; p, q, d Ω) = p i q j d Ω(a i, b j). i= j=. Similarity Distance Our next definition concerns the relationship between the optimal transportation distance and the naive transportation distance. Here we are interested in the degree to which cooperation reduces the cost of moving the source A onto the sink B. Formally, we are given weighted point sets (A, p) and (B, q) as above. We define the similarity distance between (A, p) and (B, q) as the ratio d S(A, B; p, q, d Ω) = dot(a, B; p, q, dω) d NT(A, B; p, q, d Ω). When A and B perfectly overlap, there is no cost to moving A onto B, so both the optimal transportation distance and the similarity distance between A and B are 0. On the other hand, when A and B are very distant from each other, each point in A is much closer to all other points in A than to any points in B, and vice-versa. Thus, in this case, cooperation does not yield any significant benefit, so d OT is nearly as large as d NT, and d S is close to. (This behavior is illustrated in Figure.) In this sense, the similarity distance between two point sets measures the degree to which they spatially overlap, or are similar, to each other. (Note, the relationship is actually inverse; similarity distance is the degree of spatial overlap subtracted from one.) Another illustration of this behavior is given in Figure. Each panel in this figure shows two overlapping point sets to which we assign uniform weight distributions. The degree of spatial overlap of the two point sets in Example A is far higher than the degree of spatial overlap of the point sets in Example B, even though in absolute terms the amount of work required to

3 Table : Distances to Modified Clusterings []. Each row in Table (a) depicts a dataset with points colored according to a reference clustering R (left column) and two different modifications of this clustering (center and right columns). For each example, Table (b) presents the distance between the reference clustering and each modification for the indicated clustering comparison techniques. The column labeled? indicates whether the technique provides sensible output. In Examples, we only modify cluster assignments; the points are stationary. Since the pairwise relationships among the points change in the same way in each modification, only ADCO and CDistance detect that the modifications are not identical with respect to the reference. In Example, the reference clustering is modified by moving three points of the bottom right cluster by small amounts. However, in Modification, the points do not move across a bin boundary, whereas in Modification, the points do move across a bin boundary. As a result, ADCO detects no change between Modification and the reference clustering but detects a large change between Modification and the reference clustering, even though the two modifications differ by only a small amount. CDistance correctly reports a similar change between Modification and the reference and Modification and the reference. (Since the points actually move, other clustering comparison techniques are not applicable to this example.) Table (a) Reference Clustering (R) Modification Modification Example Example Example Example Table (b) Technique Example Example Example Example Name d(r,) d(r,)? d(r,) d(r,)? d(r,) d(r,)? d(r,) d(r,)? Hubert N/A N/A Rand N/A N/A van Dongen N/A N/A VI N/A N/A Zhou N/A N/A ADCO CDistance

4 distance (CDistance) between A and B as the quantity Similarity Distance Separating distance Figure : The plot on the right shows similarity distance as a function of separation distance between the two point sets displayed on the left. Similarity distance is zero when the point sets perfectly overlap and approaches one as the distance increases Example A Example B Figure : Similarity distance measures the degree of spatial overlap. We consider the two point sets in Example A to be far more similar to one another than those in Example B. This is the case even though they occupy far more area in absolute terms and would be deemed further apart by many distance metrics. Here, d S (Example A) = 0. and d S (Example B) = optimally move one point set onto the other is much higher in Example A than in Example B.. CDISTANCE In the clustering distance problem, we are given two clusterings A and B of points in a metric space (Ω, d Ω). Our goal is to determine a distance between them that captures both spatial and partitional information. We do this using the framework defined above. Conceptually, our approach is as follows. Given a pair of clusterings A and B, we first construct a new metric space distinct from the metric space in which the original data lie which contains one distinct element for each cluster in A and B. We define the distance between any two elements of this new space to be the optimal transportation distance between the corresponding clusters. The clusterings A and B can now be thought of as weighted point sets in this new space, and the degree of similarity between A and B can be thought of as the degree of spatial overlap between the corresponding weighted point sets in this new space. Formally, we are given clusterings A and B of datasets D and E in the metric space (Ω, d Ω). We define the clustering d(a, B) = d S(A, B; π, ρ, d OT), where the weights π = ( A / D : A A) and ρ = ( B / E : B B) are proportional to the number of points in the clusters, and where the distance d OT between clusters A A and B B is the optimal transportation distance d OT(A, B) = d OT(A, B; p, q, d Ω) with uniform weights p = (/ A ) and q = (/ B ). The Clustering Distance Algorithm proceeds in two steps: Step. For every pair of clusters (A, B) A B, compute the optimal transportation distance d (A, B) = d OT(A, B; p, q, d Ω) between A and B, based on the distances according to d Ω between points in A and B. Step Compute the similarity distance d S(A, B; π, ρ, d OT) between the clusterings A and B, based on the optimal transportation distances d OT between clusters in A and B that were computed in Step. Call this quantity the clustering distance d(a, B) between the clusterings A and B. The reader will notice that we use optimal transportation distance to measure the distance between individual clusters (Step ), while we use similarity distance to measure the distance between the clusterings as a whole (Step ). The reason for this difference is as follows. In Step, we are interested in the absolute distances between clusters: we want to know how much work is needed to move all the points in one cluster onto the other cluster. We will use this as a measure of how far one cluster is from another. In contrast, in Step, we want to know the degree to which the clusters in one clustering spatially overlap with those in another clustering using the distances derived in Step. Our choice to use uniform weights in Step but proportional weights in Step has a similar motivation. In a (hard) clustering, each point in a given cluster contributes as much to that cluster as any other point in the cluster contributes, so we weight points uniformly when comparing individual clusters. In contrast, Step proportionally distributes the influence of each cluster in the overall computation of CDistance according to its relative weight, as determined by the number of data points it contains. Thus this single figure captures both spatial and categorical information into a single measure. Figure contains some comparisons between example clusterings and the associated CDistance value.. Computational Complexity The complexity of the algorithm presented above for computing CDistance is dominated by the complexity of computing optimal transportation distance. Optimal transportation distance between two point sets of cardinality at most n can be computed in worst case O(n log n) time [0]. Recently a number of linear or sublinear time approximation algorithms have been developed for several variants of the optimal transportation problem, e.g., [0, ]. To verify that our method is efficient in practice, we tested our implementation for computing optimal transportation distance, which uses the transportation simplex algorithm, on several hundred thousand pairs of large point sets sampled from a variety of Gaussian distributions. The average running times fit the expression ( )n. seconds with an R value of, where n is the cardinality of the larger

5 another set of algorithms can find a different structure (for an example see Figure 7). An ensemble clustering framework tries to constructively combine inputs from different clusterings to generate a consensus clustering. There are two main steps involved in any ensemble framework: ) generation of a number of clusterings, and ) their combination. An intermediate step may be to evaluate the diversity within the generated set of clusterings. Clusterings may be generated by a number of ways: by using different algorithms, by supplying different parameters to algorithms, by clustering subsets of the data and by clustering the data with different feature sets [8]. The consensus clustering can also be obtained through a number of ways such as linkage clustering on pairwise co-association matrices [], graphbased methods [9] and maximum likelihood methods []. (a) (c) (d) (b) Figure : (a) Identical clusterings and identical datasets. CDistance is 0. (b) Similar clusterings, but over slightly perturbed datasets. CDistance is (c) Two different algorithms were used to cluster the same dataset. CDistance is 0., indicating a moderate mismatch of clusterings. (d) Two very different clusterings generated by spectral clustering over almost-identical datasets. CDistance is 0.90, suggesting instability in the clustering algorithm s output paired with this dataset. Note: In each subfigure, we are comparing the two clusterings, one of which has been translated for visualization purposes. For example, in (a), the two clusterings spatially overlap perfectly, making their CDistance value 0. of the two point sets being compared. For enormous point sets, one can employ standard binning techniques to further reduce the runtime []. In contrast, the only other spatially aware clustering comparison method that we know of requires explicitly enumerating the exponential space of all possible permutations of matches between clusters in A to clusters in B [].. ENSEMBLE CLUSTERING An ensemble clustering is a consensus clustering of a dataset D based on a collection of R clusterings. The goal is to combine aspects of each of these clusterings in a meaningful way so as to obtain a more robust and superior clustering that combines the contributions of each clustering. It is frequently the case that one set of algorithms can find a certain kind of structure or patterns within a dataset and. USING CDISTANCE In this section, we show how CDistance can be used to replace parts of the combination step in ensemble clustering.. The Correspondence Problem One of the major problems in the combination step of ensemble clustering - especially in algorithms that use coassociation matrices - has been the correspondence problem: given a pair of clusterings A and B, for each cluster A in A, which cluster B in B corresponds to A? The reason this problem is difficult is that unlike in classification, classes (clusters) in a clustering are unlabeled and unordered, so relationships between the clusters of one clustering and the clusters of another clustering are not immediately obvious - nor is it even obvious whether a meaningful relationship exists at all. Current solutions to this problem rely on the degree of set overlap to determine matching clusters [0]. Between any two clusterings A and B, the pairwise association matrix contains at the (i, j) th entry the number of elements common to cluster i of clustering A and cluster j of clustering B. An algorithm such as the Hungarian algorithm may be employed to determine the highest weight matching (correspondence) between clusters. The correspondence problem becomes even more difficult when clustering A has a different number of clusters than clustering B. What are the cluster correspondences in that case? Some approaches (e.g., []) restrict all members of an ensemble to the same number of clusters. Others use the Hungarian algorithm on the association matrix followed by greedy matching to determine correspondences between clusters. A further complication to the correspondence problem arises when clusterings A and B are clusterings of different underlying data points. This frequently happens, for example, when members of an ensemble are generated by clustering subsamples of the original dataset. In this case also, the traditional method of solving the correspondence problem fails. We show how both problems can be approached neatly using CDistance in this section and section.. We formulate an elegant solution to the correspondence problem, that also addresses these complications, using a variant of the first step of CDistance, as follows. Given clusterings A = {A,..., A n} and B = {B,..., B m}, we construct a pairwise association matrix M using optimal transportation distance with equal, unnormalized weights on each point of both clusters. That is, we set the (i, j) entry M ij = d OT(A i, B j;,, d Ω), where = (,..., ), for i =,..., n and j =,..., m. This is exactly what is needed to determine whether two clusters should share the same label. Figure has demonstrative examples on the behavior of this function. We can then proceed with the application of the Hungarian algorithm as with other techniques to determine the correspondences between clusters, or simply assign the entry with the lowest value in each row and column as being the corresponding cluster. Since this can be done pair-wise for a set of clusterings in an ensemble, a consensus function of choice

6 may then be applied in order to generate the integrated clustering. The idea is that we can conjecture with more confidence that two clusters correspond to each other if we know they occupy similar regions in space, rather than if their points share membership more than with other clusters. Looking at the actual data and the space it occupies is more informative than looking at the labels alone. Another advantage that comes with this method is the absence of any requirement on the point sets being the same in the two clusterings. For label-based correspondence solutions, set arithmetic is performed on point labels, the underlying points of which must stay the same in both clusterings. The generality of d OT allows any two point sets to be given to it as input. Finally, yet another advantage with using this technique is that it is able to determine matches between a whole cluster on one side and smaller clusters on the other side that when taken together correspond to the first cluster. An illustration of this can be seen in Figures and where the criss-crossing lines indicate matches between clusters with zero or almost-zero d OT value.. Subsampling Many ensemble clustering methods use subsampling to generate multiple clusterings (e.g., [9]). A fraction of the data is sampled at random, with or without replacement, and clustered in each run. Comparison, if needed, is typically performed via one of a number of partitional methods. In this subsection, we illustrate how, instead of relying on a partitional method, CDistance can be used to compare clusterings from subsampled datasets. Figure depicts an example in which we repeatedly cluster subsamples of a dataset with both Self-Tuning Spectral Clustering [] and Affinity Propagation [7]. Although these algorithms rely on mathematically distinct properties, we see their resultant clusterings on subsampled data agree to a surprising extent according to CDistance. This agreement suggests a phenomenon that a practitioner or algorithm can use to inform further investigation or inference.. Algorithm-Dataset Match and Stability Analysis CDistance can provide guidance on the suitability of a particular algorithm family for a given dataset. Depending on the kind of structure one wishes to detect in the data, some clustering algorithms may not be suitable. In [], the authors propose that the stability of a clustering can be determined by repeatedly clustering subsamples of its dataset. Finding consistent high similarity across clusterings indicates consistency in locating similar substructures in the dataset. This can increase confidence in the applicability of an algorithm to a particular distribution of data. In other words, by comparing the resultant clusterings, one can obtain a goodness-of-fit between a dataset and a clustering algorithm. The clustering comparisons in [] were all via partitional methods. Figure (d) shows a striking example of an algorithmdataset mismatch. In the two panels shown, one dataset is a slightly modified version of the other having had a small amount of noise added to it. A spectral clustering algorithm was used in both cases to cluster them resulting in wildly variant clusterings (CDistance = 0.7). This is an indication CDistance: 0. 7 vs 9 clusters CDistance: 0. vs 0 clusters CDistance: 0. vs 7 clusters CDistance: 0. 7 vs 9 clusters CDistance: 0. 7 vs 9 clusters CDistance: 0. 7 vs 9 clusters CDistance: vs 7 clusters CDistance: 0. 7 vs 8 clusters CDistance: 0. 7 vs 8 clusters Figure : Comparing outputs of different clustering algorithms over randomly sampled subsets of a dataset. The low values of CDistance indicate relative similarity and stability between the clusterings. The numbers of clusters involved in each comparison are displayed beside CDistance value. vs 7 CDistance: 0. vs 7 CDistance: 0.08 vs CDistance: 0.08 vs CDistance: 0. vs 7 CDistance: 0.0 vs CDistance: 0.0 vs 8 CDistance: 0. vs CDistance: 0.08 Figure : Comparing outputs of different clustering algorithms over randomly sampled subsets of a dataset. The low values of CDistance indicate relative similarity and stability between the clusterings. The numbers of clusters involved in each comparison are displayed beside CDistance value.

7 that spectral clustering is not an appropriate algorithm to use on this dataset. In Figures and we see multiple clusterings generated by clustering subsamples and clustering data with small amounts of noise added respectively. In both cases we see that CDistance maintains a relatively low value across instances and can conclude that the clusterings remain stable with this dataset and algorithm combination.. Detecting Diversity Some amount of diversity in clusters constituting the ensemble is crucial to the improvement of the final generated clustering []. In the previous subsection we noted that CDistance was able to detect similarity between clusterings of subsets of a dataset. A corollary of that is that CDistance is also able to detect diversity in members of an ensemble. Previous methods have used the Adjusted Rand Index or other such metrics to quantify diversity. As we demonstrate in Section, these measures are surprisingly insensitive to large differences in clusterings. CDistance provides a smoother, spatially-sensitive measure of the similarity or dissimilarity of two clusterings leading to a more effective characterization of diversity within an ensemble. Measurement of diversity using CDistance can guide the generation step by indicating whether the members of an ensemble are sufficiently diverse or differ from each other only minutely.. Graph-based methods Several ensemble clustering frameworks employ graphbased algorithms for the combination step [, 8]. One popular idea is to construct a suitable graph taking into account the similarity between clusterings and derive a graph cut or partition that optimizes some objective function. [9] describes three main graph-based methods: instance-based, cluster-based and hypergraph partitioning, also discussed in []. The cluster-based formulation in particular requires a measure of similarity between clusters, originally the Jaccard Index. The unsuitability of this index and other related indices are discussed in Section. CDistance can be easily substituted here as a more informative way of measuring similarity between clusterings specifically because it exploits both spatial and partitional information about the data.. Example As an example of CDistance being used in an ensemble clustering framework we applied it to an image segmentation problem. Segmentations of images can be viewed as clusterings of their constituent pixels. We examined images from the Berkeley Segmentation Database []. Each image is accompanied by segmentations of unknown origin collected by the database aggregators. Examining these image segmentations reveals both certain common features, e.g., sensitivity to edges, as well as extreme differences in the detail found in various image regions. We used CDistance to combine independent image segmentations, as shown in Figure 7. We computed correspondences between clusters in the two segmentations and identified clusters which have near-zero d OT to more than one cluster from the other segmentation. The combined segmentation was generated by replacing larger clusters in one clustering with smaller ones that matched it from the other clustering. Figure 7: Combining multiple image segmentations. The original image is shown in (a). Two segmentations, each providing differing regions of detail are shown in (b) and (c). We note the CDistance between the clusterings in (b) and (c) is We use CDistance to combine them in (d), rendering the false-color image in (e), displaying partitioned regions by color. Note that we are able to combine image segments that are not spatially adjacent, as indicated by the colors in (e). This image and the two segmentations were chosen among many for their relative compatibility (CDistance between the two segmentations is 0.09) as well as the fact that each of them provides detail in parts of the image where the other doesn t. Using CDistance enables us to ensure that only segmentations covering the same regions of image space are aggregated to yield a more informative segmentation. In other words, each segmentation is able to fill in gaps in another.. RELATED WORK In this section, we present simple examples that demonstrate the insensitivity of popular comparison techniques to spatial differences between clusterings. Our approach is straightforward: for each technique, we present a baseline reference clustering. We then compare it to two other clusterings that it cannot distinguish, i.e., the comparison method assigns the same distance from the baseline to each of these clusterings. However, in each case, the clusterings compared to the baseline are significantly different from one another, both from a spatial, intuitive perspective and according to our measure CDistance. Our intent is to demonstrate these methods are unable to detect change even when it appears clear that a non-trivial change has occurred. These examples are intended to be didactic, and as such, are composed of only enough points to be illustrative. Because of limitations in most of these approaches, we can only consider clusterings over identical

8 Figure : The behavior of d OT with weight vectors. In the four figures to the left, the cluster of red points is the same in all figures. In the first and third figures from the left where one cluster almost perfectly overlaps with the other, d OT is very small in both cases (0. and 0.8 respectively) when compared, for example, to d OT between the red cluster and the blue cluster in the fourth figure from the left (d OT =.09). This is because we would like our distance measure between clusters to heavily focus on spatial locality and be close to zero when one cluster occupies partly or wholly the same region in space. In the fourth figure from the left, the two smaller clusters (red and blue) have near-zero d OT s to the larger green cluster (0. and 0. respectively). In the second figure from the left, the green cluster is a slightly translated version of the red one but is still very close to it spatially. d OT in this case is still very small (0.). The fifth figure from the left shows two point sets shaped like arcs that overlap to a large degree. In this case, d OT is 0., indicating that the point sets are very close to each other. The significance of d OT is apparent when dealing with data in high dimensions where complete visualization is not possible. sets of points; they are unable to tolerate noise in the data. Among the earliest known methods for comparing clusterings is the Jaccard index []. It measures the fraction of assignments on which different partitionings agree. The Rand index [9], among the best known of these techniques, is based on changes in point assignments. It is calculated by the fraction of points taken pairwise whose assignments are consistent between two clusterings. This approach has been built upon by many others. For example, [] addresses the Rand index s well-known problem of overestimating similarity on randomly clustered datasets. These three measures are part of a general class of clustering comparison techniques based on tuple-counting or set-based membership. Other measures in this category include the Mirkin, the van Dongen index, Fowlkes-Mallows and Wallace indices, a discussion of which can be found in [, 7]. The Variation of Information approach was introduced by [7], who defined an information theoretic metric between clusterings. It measures the information lost and gained in moving from one clustering to another, over the same set of points. Perhaps its most important property is that of convex additivity, which insures locality in the metric; namely, changes within a cluster (e.g., refinement) cannot affect the similarity of the rest of the clustering. Normalized Mutual Information, used in [] shares similar properties also being based on information theory. The work presented in [] shares our motivation of incorporating some notion of distance into comparing clusterings. However, the distance is computed over a space of indicator vectors for each cluster, representing whether each data point is a member of that cluster. Thus, similarity over clusters is measured by their sharing points and does not take into account the locations of these points in a metric space. Finally, we examine [], who present the only other method that directly employs spatial information about the points. Their approach first bins all points being clustered along each attribute. They then determine the density of each cluster over the bins. Finally, the distance between two clusterings is defined as the minimal sum of pairwise clusterdensity dot products (derived from the binning), taken over all possible permutations matching the clusters. We note that this is in general not a feasible computation. The number of bins grows exponentially with the dimensionality of the space, and more importantly, examining all matchings between clusters requires O(n!) time, where n is the number of clusters. 7. DISCUSSION AND FUTURE WORK We have shown how the clustering distance (CDistance) measure introduced in [] can be used in the ensemble methods framework for several applications. In future work we will incorporate CDistance into existing ensemble clustering algorithms and evaluate its ability to improve upon current methodologies. We are also continuing to study the theoretical properties of CDistance, with applications in ensemble clustering and unsupervised data mining. 8. ACKNOWLEDGEMENTS This work was supported by the School of Medicine and Public Health, the Wisconsin Alumni Research Foundation, the Department of Biostatistics and Medical Informatics, and the Department of Computer Sciences at the University of Wisconsin-Madison. 9. REFERENCES [] A. K. J. Alexander Topchy and W. Punch. A mixture model for clustering ensembles. In Proceedings of SIAM International Conference on Data Mining (SDM), pages 79 90, 00. [] K. D. Ba, H. L. Nguyen, H. N. Nguyen, and R. Rubinfeld. Sublinear time algorithms for earth mover s distance. arxiv abs/090.09, 009. [] E. Bae, J. Bailey, and G. Dong. Clustering similarity comparison using density profiles. In Proceedings of

9 the 9th Australian Joint Conference on Artificial Intelligence, pages. Springer LNCS, 00. [] A. Ben-Hur, A. Elisseeff, and I. Guyon. A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pages 7, 00. [] M. H. Coen, M. H. Ansari, and N. Fillmore. Comparing clusterings in space. In ICML 00: Proceedings of the 7th International Conference on Machine Learning (to appear), 00. [] S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 9(9): , 00. [7] D. Dueck and J. Frey, B. Non-metric affinity propagation for unsupervised image categorization. IEEE International Conference on Computer Vision, 0: 8, 007. [8] X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of 0th International Conference on Machine Learning, pages 8 9, 00. [9] X. Z. Fern and C. E. Brodley. Solving cluster ensemble problems by bipartite graph partitioning. In ICML 0: Proceedings of the twenty-first international conference on Machine learning, page, New York, NY, USA, 00. ACM. [0] A. Fred. Finding consistent clusters in data partitions. In In Proc. d Int. Workshop on Multiple Classifier, pages Springer, 00. [] A. L. N. Fred and A. K. Jain. Data clustering using evidence accumulation. In ICPR, pages 7 80, 00. [] S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova. Moderate diversity for better cluster ensembles. Inf. Fusion, 7(): 7, 00. [] F. Hillier and G. Lieberman. Introduction to Mathematical Programming. McGraw-Hill, 99. [] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, :9 8, 98. [] E. Levina and P. Bickel. The earth mover s distance is the mallows distance: Some insights from statistics. In ICCV, 00. [] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int l Conf. Computer Vision, volume, pages, July 00. [7] M. Meilǎ. Comparing clusterings: an axiomatic view. In ICML, 00. [8] S. T. Rachev and L. Ruschendorf. Mass Transportation Problems: Volume I: Theory (Probability and its Applications). Springer, 998. [9] W. M. Rand. Objective criteria for the evaluation of clustering methods. J. Amer. Statist. Assoc., pages 8 80, 97. [0] S. Shirdhonkar and D. Jacobs. Approximate earth moverâăźs distance in linear time. In CVPR, 008. [] A. Strehl and J. Ghosh. Cluster ensembles a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., :8 7, 00. [] S. van Dongen. Performance criteria for graph clustering and markov cluster experiments. Technical report, National Research Institute for Mathematics and Computer Science in the Netherlands, 000. [] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, 00. [] D. Zhou, J. Li, and H. Zha. A new mallows distance based metric for comparing clusterings. In Proceedings of the nd International Conference on Machine Learning, pages 08 0, 00.

Comparing Clusterings in Space

Comparing Clusterings in Space Michael H. Coen mhcoen@cs.wisc.edu M. Hidayath Ansari ansari@cs.wisc.edu Nathanael Fillmore nathanae@cs.wisc.edu University of Wisconsin-Madison, University Ave, Madison, WI 576 USA Abstract This paper

More information

A Comparison of Resampling Methods for Clustering Ensembles

A Comparison of Resampling Methods for Clustering Ensembles A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei-Bidgoli Computer Science Department Michigan State University East Lansing, MI, 48824, USA Alexander Topchy Computer Science Department

More information

Meta-Clustering. Parasaran Raman PhD Candidate School of Computing

Meta-Clustering. Parasaran Raman PhD Candidate School of Computing Meta-Clustering Parasaran Raman PhD Candidate School of Computing What is Clustering? Goal: Group similar items together Unsupervised No labeling effort Popular choice for large-scale exploratory data

More information

Using the Kolmogorov-Smirnov Test for Image Segmentation

Using the Kolmogorov-Smirnov Test for Image Segmentation Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering The ensemble of classifiers is a well established strategy in supervised learning Unsupervised learning aims the same goal:

More information

The Projected Dip-means Clustering Algorithm

The Projected Dip-means Clustering Algorithm Theofilos Chamalis Department of Computer Science & Engineering University of Ioannina GR 45110, Ioannina, Greece thchama@cs.uoi.gr ABSTRACT One of the major research issues in data clustering concerns

More information

Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

Overview of Clustering

Overview of Clustering based on Loïc Cerfs slides (UFMG) April 2017 UCBL LIRIS DM2L Example of applicative problem Student profiles Given the marks received by students for different courses, how to group the students so that

More information

THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS

THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS Marcin Pełka 1 1 Wroclaw University of Economics, Faculty of Economics, Management and Tourism, Department of Econometrics

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Consensus clustering by graph based approach

Consensus clustering by graph based approach Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 11, November 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Identifying Layout Classes for Mathematical Symbols Using Layout Context

Identifying Layout Classes for Mathematical Symbols Using Layout Context Rochester Institute of Technology RIT Scholar Works Articles 2009 Identifying Layout Classes for Mathematical Symbols Using Layout Context Ling Ouyang Rochester Institute of Technology Richard Zanibbi

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

Alternative Clusterings: Current Progress and Open Challenges

Alternative Clusterings: Current Progress and Open Challenges Alternative Clusterings: Current Progress and Open Challenges James Bailey Department of Computer Science and Software Engineering The University of Melbourne, Australia 1 Introduction Cluster analysis:

More information

A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing

A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing Asif Dhanani Seung Yeon Lee Phitchaya Phothilimthana Zachary Pardos Electrical Engineering and Computer Sciences

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Clustering Documentation

Clustering Documentation Clustering Documentation Release 0.3.0 Dahua Lin and contributors Dec 09, 2017 Contents 1 Overview 3 1.1 Inputs................................................... 3 1.2 Common Options.............................................

More information

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS An Undergraduate Research Scholars Thesis by DENISE IRVIN Submitted to the Undergraduate Research Scholars program at Texas

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

CS 664 Segmentation. Daniel Huttenlocher

CS 664 Segmentation. Daniel Huttenlocher CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical

More information

Clustering. Supervised vs. Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Object Classification Problem

Object Classification Problem HIERARCHICAL OBJECT CATEGORIZATION" Gregory Griffin and Pietro Perona. Learning and Using Taxonomies For Fast Visual Categorization. CVPR 2008 Marcin Marszalek and Cordelia Schmid. Constructing Category

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

Relative Constraints as Features

Relative Constraints as Features Relative Constraints as Features Piotr Lasek 1 and Krzysztof Lasek 2 1 Chair of Computer Science, University of Rzeszow, ul. Prof. Pigonia 1, 35-510 Rzeszow, Poland, lasek@ur.edu.pl 2 Institute of Computer

More information

Generalized Transitive Distance with Minimum Spanning Random Forest

Generalized Transitive Distance with Minimum Spanning Random Forest Generalized Transitive Distance with Minimum Spanning Random Forest Author: Zhiding Yu and B. V. K. Vijaya Kumar, et al. Dept of Electrical and Computer Engineering Carnegie Mellon University 1 Clustering

More information

Redefining and Enhancing K-means Algorithm

Redefining and Enhancing K-means Algorithm Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,

More information

Topological Classification of Data Sets without an Explicit Metric

Topological Classification of Data Sets without an Explicit Metric Topological Classification of Data Sets without an Explicit Metric Tim Harrington, Andrew Tausz and Guillaume Troianowski December 10, 2008 A contemporary problem in data analysis is understanding the

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016 Machine Learning for Signal Processing Clustering Bhiksha Raj Class 11. 13 Oct 2016 1 Statistical Modelling and Latent Structure Much of statistical modelling attempts to identify latent structure in the

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

A Miniature-Based Image Retrieval System

A Miniature-Based Image Retrieval System A Miniature-Based Image Retrieval System Md. Saiful Islam 1 and Md. Haider Ali 2 Institute of Information Technology 1, Dept. of Computer Science and Engineering 2, University of Dhaka 1, 2, Dhaka-1000,

More information

Improving Image Segmentation Quality Via Graph Theory

Improving Image Segmentation Quality Via Graph Theory International Symposium on Computers & Informatics (ISCI 05) Improving Image Segmentation Quality Via Graph Theory Xiangxiang Li, Songhao Zhu School of Automatic, Nanjing University of Post and Telecommunications,

More information

Interactive Campaign Planning for Marketing Analysts

Interactive Campaign Planning for Marketing Analysts Interactive Campaign Planning for Marketing Analysts Fan Du University of Maryland College Park, MD, USA fan@cs.umd.edu Sana Malik Adobe Research San Jose, CA, USA sana.malik@adobe.com Eunyee Koh Adobe

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

Complexity Results on Graphs with Few Cliques

Complexity Results on Graphs with Few Cliques Discrete Mathematics and Theoretical Computer Science DMTCS vol. 9, 2007, 127 136 Complexity Results on Graphs with Few Cliques Bill Rosgen 1 and Lorna Stewart 2 1 Institute for Quantum Computing and School

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

CS395T Visual Recogni5on and Search. Gautam S. Muralidhar

CS395T Visual Recogni5on and Search. Gautam S. Muralidhar CS395T Visual Recogni5on and Search Gautam S. Muralidhar Today s Theme Unsupervised discovery of images Main mo5va5on behind unsupervised discovery is that supervision is expensive Common tasks include

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan Explore Co-clustering on Job Applications Qingyun Wan SUNet ID:qywan 1 Introduction In the job marketplace, the supply side represents the job postings posted by job posters and the demand side presents

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

An advanced greedy square jigsaw puzzle solver

An advanced greedy square jigsaw puzzle solver An advanced greedy square jigsaw puzzle solver Dolev Pomeranz Ben-Gurion University, Israel dolevp@gmail.com July 29, 2010 Abstract The jigsaw puzzle is a well known problem, which many encounter during

More information

Ensemble Combination for Solving the Parameter Selection Problem in Image Segmentation

Ensemble Combination for Solving the Parameter Selection Problem in Image Segmentation Ensemble Combination for Solving the Parameter Selection Problem in Image Segmentation Pakaket Wattuya and Xiaoyi Jiang Department of Mathematics and Computer Science University of Münster, Germany {wattuya,xjiang}@math.uni-muenster.de

More information

Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning

Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning Izumi Suzuki, Koich Yamada, Muneyuki Unehara Nagaoka University of Technology, 1603-1, Kamitomioka Nagaoka, Niigata

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

A tutorial for blockcluster R package Version 1.01

A tutorial for blockcluster R package Version 1.01 A tutorial for blockcluster R package Version 1.01 Parmeet Singh Bhatia INRIA-Lille, parmeet.bhatia@inria.fr Contents 1 Introduction 1 2 Package details 2 2.1 cocluster function....................................

More information

Introduction to Computer Science

Introduction to Computer Science DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Image Segmentation. Shengnan Wang

Image Segmentation. Shengnan Wang Image Segmentation Shengnan Wang shengnan@cs.wisc.edu Contents I. Introduction to Segmentation II. Mean Shift Theory 1. What is Mean Shift? 2. Density Estimation Methods 3. Deriving the Mean Shift 4. Mean

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

On Order-Constrained Transitive Distance

On Order-Constrained Transitive Distance On Order-Constrained Transitive Distance Author: Zhiding Yu and B. V. K. Vijaya Kumar, et al. Dept of Electrical and Computer Engineering Carnegie Mellon University 1 Clustering Problem Important Issues:

More information

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T. Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Fast trajectory matching using small binary images

Fast trajectory matching using small binary images Title Fast trajectory matching using small binary images Author(s) Zhuo, W; Schnieders, D; Wong, KKY Citation The 3rd International Conference on Multimedia Technology (ICMT 2013), Guangzhou, China, 29

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

A Virtual Laboratory for Study of Algorithms

A Virtual Laboratory for Study of Algorithms A Virtual Laboratory for Study of Algorithms Thomas E. O'Neil and Scott Kerlin Computer Science Department University of North Dakota Grand Forks, ND 58202-9015 oneil@cs.und.edu Abstract Empirical studies

More information

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov Structured Light II Johannes Köhler Johannes.koehler@dfki.de Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov Introduction Previous lecture: Structured Light I Active Scanning Camera/emitter

More information

Normalized Texture Motifs and Their Application to Statistical Object Modeling

Normalized Texture Motifs and Their Application to Statistical Object Modeling Normalized Texture Motifs and Their Application to Statistical Obect Modeling S. D. Newsam B. S. Manunath Center for Applied Scientific Computing Electrical and Computer Engineering Lawrence Livermore

More information

Exploring the Landscape of Clusterings

Exploring the Landscape of Clusterings Exploring the Landscape of Clusterings Advisor: Suresh Venkatasubramanian Clustering Lattice... in the current form the work is extremely theoretical... unclear whether your distance function is meaningful

More information

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University. 3D Computer Vision Structured Light II Prof. Didier Stricker Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de 1 Introduction

More information

Theoretical Foundations of Clustering. Margareta Ackerman

Theoretical Foundations of Clustering. Margareta Ackerman Theoretical Foundations of Clustering Margareta Ackerman The Theory-Practice Gap Clustering is one of the most widely used tools for exploratory data analysis. Identifying target markets Constructing phylogenetic

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Association Pattern Mining. Lijun Zhang

Association Pattern Mining. Lijun Zhang Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

More information

A Generalized Method to Solve Text-Based CAPTCHAs

A Generalized Method to Solve Text-Based CAPTCHAs A Generalized Method to Solve Text-Based CAPTCHAs Jason Ma, Bilal Badaoui, Emile Chamoun December 11, 2009 1 Abstract We present work in progress on the automated solving of text-based CAPTCHAs. Our method

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Chapter 6 Continued: Partitioning Methods

Chapter 6 Continued: Partitioning Methods Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal

More information

Cluster Ensembles for High Dimensional Clustering: An Empirical Study

Cluster Ensembles for High Dimensional Clustering: An Empirical Study Cluster Ensembles for High Dimensional Clustering: An Empirical Study Xiaoli Z. Fern xz@ecn.purdue.edu School of Electrical and Computer Engineering, Purdue University, W. Lafayette, IN 47907, USA Carla

More information

Pattern Clustering with Similarity Measures

Pattern Clustering with Similarity Measures Pattern Clustering with Similarity Measures Akula Ratna Babu 1, Miriyala Markandeyulu 2, Bussa V R R Nagarjuna 3 1 Pursuing M.Tech(CSE), Vignan s Lara Institute of Technology and Science, Vadlamudi, Guntur,

More information

Image Mining: frameworks and techniques

Image Mining: frameworks and techniques Image Mining: frameworks and techniques Madhumathi.k 1, Dr.Antony Selvadoss Thanamani 2 M.Phil, Department of computer science, NGM College, Pollachi, Coimbatore, India 1 HOD Department of Computer Science,

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

Consistency and Set Intersection

Consistency and Set Intersection Consistency and Set Intersection Yuanlin Zhang and Roland H.C. Yap National University of Singapore 3 Science Drive 2, Singapore {zhangyl,ryap}@comp.nus.edu.sg Abstract We propose a new framework to study

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Discrete Optimization. Lecture Notes 2

Discrete Optimization. Lecture Notes 2 Discrete Optimization. Lecture Notes 2 Disjunctive Constraints Defining variables and formulating linear constraints can be straightforward or more sophisticated, depending on the problem structure. The

More information

Multiobjective Data Clustering

Multiobjective Data Clustering To appear in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Multiobjective Data Clustering Martin H. C. Law Alexander P. Topchy Anil K. Jain Department of Computer Science

More information