A Study of K-Means-Based Algorithms for Constrained Clustering

Size: px
Start display at page:

Download "A Study of K-Means-Based Algorithms for Constrained Clustering"

Transcription

1 A Study of K-Means-Based Algorithms for Constrained Clustering Thiago F. Covões, Eduardo R. Hruschka,, Joydeep Ghosh University of Sao Paulo (USP) at Sao Carlos, Brazil University of Texas (UT) at Austin, USA Abstract The problem of clustering with constraints has received considerable attention in the last decade. Indeed, several algorithms have been proposed, but only a few studies have (partially) compared their performances. In this work, three well-known algorithms for k-means-based clustering with soft constraints Constrained Vector Quantization Error (CVQE), its variant named LCVQE, and the Metric Pairwise Constrained K-Means (MPCK-Means) are systematically compared according to three criteria: Adjusted Rand Index, Normalized Mutual Information, and the number of violated constraints. Experiments were performed on 0 datasets, and for each of them 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance are presented. In terms of accuracy, LCVQE has shown to be competitive with CVQE, while violating less constraints. In most of the datasets, both CVQE and LCVQE presented better accuracy compared to MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. The robustness of the algorithms with respect to noisy constraints was also analyzed. From this perspective, the most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments. The computational complexities of the algorithms are also presented. Finally, a variety of (more specific) new experimental findings are discussed in the paper e.g., deduced constraints usually do not help finding better data partitions.. Introduction Constrained clustering arises from the need for incorporating known information about the desired data partitions into the process of clustering data [4]. There are several types of constraints, e.g., about pairs of objects [0], clusters [], and partitions []. Among these, the most usual ones are the constraints about pairs of objects specifically the Must-Link (ML) and Cannot-Link (CL) constraints [0,, 4, 7, 5, 7, 6]. Considering a set X = {x i } N i= of N objects, each one represented by a vector x i R M, a ML constraint c = (i, j) indicates that the objects x i and x j should lie in the same cluster, whereas a CL constraint c (i, j) indicates that x i and x j should lie in different clusters. From a set of ML constraints, new constraints can be deduced by using the transitivity property. Although CL constraints do not present this property, the combination of ML and CL constraints can yield to new constraints, e.g. given the constraints c = (i, j) and c (j, k), the constraint c (i, k) can be deduced. Some algorithms for clustering with constraints [0,, 7] do not allow any violation of the constraints in the process of clustering i.e. in all the iterations of the algorithm the resulting partitions must satisfy all the constraints. While this may be interesting in some circumstances, it is important to have in mind that, in many practical applications, the constraints are usually provided by users who are unaware of the spatial disposition of the data. Therefore, the need to satisfy all the constraints can make the clustering process intractable, and an empty partition is often returned by such an algorithm. To illustrate this, consider Figure, which depicts the petal and sepal areas for the well-known Iris dataset. The three classes are represented by different markers, and the centroid of each class is represented by a dot. The markers enclosed by rectangles correspond to some objects whose nearest centroids are not those of their respective classes. Thus, assuming that every class corresponds to a different cluster, a constrained clustering algorithm based on k-means will not be able to satisfy some pair-wise constraints derived from the classes. To overcome this limitation, more flexible algorithms have been developed [5, 7, 6]. These algorithms seek to minimize the number of violated constraints. This way a partition that agrees as much as possible with the user s constraints can be found. For this reason, such constraints are sometimes

2 called soft constraints. Petal Area Setosa Versicolor Virginica Centroid Setosa Centroid Versicolor Centroid Virginica Sepal Area Figure. Iris Data rectangles highlight some objects whose nearest centroids are not those of their respective classes. Despite the increasing number of studies on clustering with constraints, there is a lack of studies providing empirical comparisons among algorithms. Moreover, frequently the adopted experimental methodology differs from one paper to another, e.g., due to the use of different performance measures and number of constraints, which makes it virtually impossible the (indirect) comparison between algorithms described in different papers. Also, the identification of classes of problems for which a particular algorithm could be preferred is very difficult. This paper presents an extensive comparative analysis of three well-known k-means-based algorithms for constrained clustering, namely: the Constrained Vector Quantization Error (CVQE) algorithm [7], its variant called LCVQE [6], and the MPCK-Means algorithm [5]. We performed computational complexity analyses not reported in the original references. From the experimental point of view, three criteria were used in our study Adjusted Rand Index, Normalized Mutual Information, and number of violated constraints. Experiments were performed on 0 datasets, and for each of them 800 sets of constraints were generated. The statistical significance of the obtained results was also addressed. It is worth noting that the authors of CVQE, LCVQE, and MPCK-Means employed a relatively limited number of datasets (five on average) and/or sets of constraints (typically three). The accuracies of the obtained partitions were only assessed in [6] and [5]. We also analyzed the robustness of the algorithms with respect to noisy constraints. This issue was briefly analyzed in [6] by considering the case when noisy constraints are rare. Also, the effect of noisy constraints on probabilistic models was studied in [5]. In our experiments, we assess the robustness of each algorithm by considering different proportions of noisy constraints. To sum up, we provide a richer evaluation on all these aspects than previous studies. Finally, the compared algorithms are based on the well-known k-means algorithm [8, ], which is widely used in practice. Thus, by performing experiments with k-means based algorithms, interesting conclusions can be drawn for a large audience e.g., k-means users and data mining practitioners potentially interested in this kind of algorithm. The remainder of this paper is organized as follows. In the next section the algorithms under study are briefly described. Section 3 addresses the methodology adopted to perform comparisons among algorithms and presents our experimental results. Finally, Section 4 concludes our work. Notation. A hard partition of the data is a collection P = {C i } k i= of k disjoint clusters such that C i = X and C i C j =, i j, and C i 0, i, where C i denotes the number of objects in cluster C i. Each cluster C i is represented by a prototype µ i. The distance between an object x i and a prototype µ j is calculated by using the squared Euclidean distance, i.e., x i µ j = (x i µ j ) T (x i µ j ), where T denotes the transposed matrix. It is assumed that each algorithm takes a set M of ML constraints and a set C of CL constraints. Using o M (l) and o M (l), to denote the functions that return the first and second objects of the l th ML constraint, it is possible to define the functions g M (l) and g M (l) that return, respectively, the indices of the clusters that the first and second objects of the l th ML constraint belong to, i.e., g M (l) = {j o M (l) C j } and g M (l) = {t o M (l) C t}. Similarly, the functions o C (l), o C (l), g C(l), and g C (l) can be defined for CL constraints. The set of ML constraints being violated is defined as V M = {i M i M, g M (i) g M (i)} and, similarly, the set of CL constraints being violated is defined as V C = {i C i C, g C (i) = g C (i)}. Finally, [Condition] is an indicator function, that is equal to one when the condition is satisfied and 0 otherwise. Algorithms. CVQE The Constrained Vector Quantization Error (CVQE) algorithm [7] employs the objective function of k-means augmented by two terms that consider the costs of violating the constraints. The costs of violating ML and CL constraints are computed from distances between prototypes. For a ML constraint, the cost is the distance between the prototypes of the clusters that contain the objects that should be in the same cluster. For a CL constraint, the cost is the distance between the prototype of the cluster in which the objects rely on and the nearest neighbor prototype (second-closest cluster). More formally, the objective function is defined as:

3 J CV QE = k J CV QEj, () j= J CV QEj = µ j x i x i C j + µ j µ g M (l) + l V M,g M (l)=j l V C,g C (l)=j µ j µ h(g C (l)), () where h(i) returns the index of the cluster whose prototype is the nearest to the i th cluster. The CVQE algorithm assigns objects to clusters as follows: (i) objects that are not involved in any constraint are assigned to the closest cluster; (ii) pairs of objects involved in ML and CL constraints are assigned to clusters that minimize the objective function Eq. (). To do so, all possible assignment combinations are verified. The prototypes (µ j, j =...k) are updated according to: l V C,g C (l)=j µ j = y j z j, (3) y j = x i + µ g M (l) x i C j l V M,g M (l)=j + µ h(g C (l)), (4) z j = C j + [gm (l)=j] + [gc (l)=j]. (5) l V M l V C This update procedure can be interpreted as follows [7]: if a ML constraint is violated, the prototype of the cluster that has the first object of the constraint is moved towards the prototype of the cluster containing the second object of the constraint. In case of violation of a CL constraint, the prototype of the cluster containing the two objects in question is moved towards the nearest neighbor prototype of the second object of the constraint. As discussed in [6], CVQE has some drawbacks. First, the algorithm is sensitive to the order of the objects in each constraint. This can be readily seen from the prototype update rule for the case of a ML violation, because only the prototype of the cluster in which the first object relies on is affected. Second, checking all possible assignment combinations leads to O(k ) calculations, thus being computationally demanding for applications in which the number of clusters is high. Third, in case of violation of constraints, only the distances between prototypes are considered in the penalization, i.e., the object positions in relation to these prototypes are ignored. Aimed at circumventing these limitations, a variant of this algorithm, called Linear CVQE, has been presented in [6].. LCVQE The LCVQE algorithm [6] uses a modified version of the objective function in (). The cost of violating a ML constraint is now based on the distances between objects and prototypes. These distances are computed by considering the object of one cluster and the prototype of the other cluster. For a CL constraint, the object that is the farthest to the prototype is determined. Then, the distance between this object and its nearest neighbor prototype (second-closest cluster) is used as the violation cost. More precisely, the objective function J LCV QE is defined as: J LCV QE = k J LCV QEj (6) j= J LCV QEj = µ j x i x i C j + µ j o M(l) l V M,g M (l)=j + µ j o M (l) l V M,g M (l)=j + µ j R gc (l)(l) (7) l V C,V (l)=j The auxiliary functions, R j (l) and V (l), are defined in Equations (8) and (9), respectively. Intuitively, the former R j (l) returns the object of the l th CL constraint farthest to µ j, while the latter V (l) returns the index of the nearest neighbor prototype to the object R gc (l)(l), which is the object from the l th CL constraint farthest from the prototype of its cluster. R j (l) = V (l) = { o C (l) if o C (l) µ j > o C (l) µ j o C (l) otherwise, (8) arg min R gc (l)(l) µ m. (9) m {,...,k} {g C (l)} The assignment of objects to clusters also differs from CVQE. First, every object is assigned to the closest cluster. For each ML constraint being violated, only three assignment possibilities are examined: (i) maintain the violation; 3

4 (ii) assign the two objects to the cluster whose prototype is the nearest to the first object (o M (l)); (iii) assign the two objects to the cluster whose prototype is the nearest to the second object (o M (l)). For each CL constraint being violated, only two cases are checked: (i) maintain the violation; (ii) keep the object that is the closest to the cluster prototype as it is, and assign the farthest object (R gc (l)(l)) to the cluster with second-closest prototype (V (l)). More formally, the prototypes are updated as: µ j = y j z j, (0) y j = x i + o M(l) x i C j l V M,g M (l)=j + o M (l) l V M,g M (l)=j + R gc (l)(l) () l V C,V (l)=j z j = C j + l V M [gm (l)=j] + [g M (l)=j] + [V (l)=j] () l V M l V C The update rule can be interpreted as follows. Let l be a ML constraint that is being violated, i.e., o M (l) C j and o M (l) C n with j n. Then, the prototype µ j is moved towards the object o M (l) and the prototype µ n is moved towards the object o M (l). Now consider the case of a CL constraint being violated, i.e., o C (l) C j and o C (l) C j. Consider also that µ j o C (l) > µ j o C (l) and that µ n is the second-closest prototype of o C (l). Then, µ n is moved towards o C (l)..3 MPCK-Means The Metric Pairwise Constrained K-Means (MPCK-Means) algorithm [5] seeks to learn a distance metric that best fits the constraints. Specifically, a positive semi-definite matrix A j is learned for each cluster C j by parameterizing the Euclidean distance as follows: x i µ j Aj = (x i µ j ) T A j (x i µ j ). There are other algorithms capable of learning distance metrics by using constraints [3, 3]. These algorithms, however, do not perform clustering. Instead, a single metric is learned for all clusters, thus being limited to clusters with similar shapes [5]. The MPCK-Means objective function is: J mpckm = k j= x i C j [ ] x i µ j A j log(det(a j )) + w l f ML (l) + w l f CL (l), (3) l V M l V C f ML (l) = o M(l) o M(l) A gm (l) + o M(l) o M(l) A g M (l), (4) f CL (l) = x (g C(l)) d x (g C(l)) d A gc (l) o C (l) o C(l) A gc (l), (5) where log(det(a j )) arises due to the normalizing constant of a more generalized k-means model, w l and w l are user-defined weights to penalize the violation of the l th ML constraint and the l th CL constraint, respectively. As in [5], we set w l = w l =. The norm x (g C(l)) d x (g C(l)) d A gc (l) represents the maximum distance between two objects considering the distance metric of cluster C j. The auxiliary functions f ML and f CL compute the penalties for violating the l th ML and CL constraints, respectively. The former, f ML, penalizes the violation of a ML constraint proportionally to the distance between objects. As a violation of this type involves two clusters, the computation of the distance between the two objects in question considers two distance metrics, one from each cluster. The later, f CL, penalizes the violation of a CL constraint by making it inversely proportional to the distance between the two objects in question. The MPCK-Means algorithm uses a heuristic method for initializing the prototypes by considering the provided constraints. Initially, it deduces all the possible ML and CL constraints from the original set of constraints. To simplify the notation, the set M will here refer to the set of ML constraints provided and deduced. Similarly, the set C will represent CL constraints. Then, the connected components of a graph, obtained by considering that the objects are vertices and that each ML constraint is an edge, are found. Such connected components form the set of neighborhoods, Λ. If Λ k, Λ prototypes are initialized as the centroids of the objects that form each neighborhood, λ i, and the k Λ remaining prototypes are initialized as the overall data mean, added with random noise. If Λ > k, k neighbors are chosen by using a weighted variant of the farthest-first algorithm [], whose weights correspond to the number of ob- The k-means algorithm can be seen as a particular case of the wellknown EM algorithm for learning Gaussian Mixture Models. In this case, the Gaussian representing a cluster C j has a covariance matrix A j. 4

5 jects in each neighborhood. Thus, this initialization procedure is biased towards distant neighborhoods representing a large number of objects [5]. In order to assign objects to clusters, MPCK-Means uses a strategy that is sensitive to the data presentation order. In brief, objects are assigned to the clusters that minimize the cost in (3). Consider that the object x i has been assigned to a given cluster. This choice is not revised in the respective iteration, i.e., it cannot be changed, and it is only considered when checking for possible violations involving other objects, as well as for assessing the impact of such assignments on the objective function. To alleviate this problem, the objects are processed in a random order at each iteration. The prototypes are updated according to the k-means rule, and each matrix A j is updated according to (6), for which additional care must be taken [5]. In particular, initially it is necessary to check if the sum of the covariance matrices on the right hand side of (6) is singular. If so, a fraction of the trace is added on the main diagonal, i.e., A j = A j + ɛ tr(a j )I. If the matrix A j resulting from the inversion is not positive semi-definite, it is necessary to project it into the set of positive semi-definite matrices to ensure that it can parameterize a distance metric [5]. This is accomplished by the procedure described in [3], which consists of decomposing the matrix A j = X T ΛX, where X is the matrix of eigenvectors of A j and Λ is the diagonal matrix with the eigenvalues of A j. After the decomposition, the eigenvalues in Λ that are smaller than zero are replaced by 0, and then A j is reconstructed. ( A j = C j (x i µ j )(x i µ j ) T x i C j l V M,g M (l)=j l V M,g M (l)=j l V C,g C (l)=j w l(o M (l) o M (l))(o M(l) o M (l))t w l(o M (l) o M (l))(o M(l) o M (l))t w l ( (x (g C (l)) d x (g C (l)) d )(x (g C (l)) d (o C (l) o C (l))(o C(l) o C (l))t )).4 Computational Complexity x (g C (l)) d ) T We studied the computational costs (per iteration) of CVQE, LCVQE, and MPCK-Means. Here we provide a summary of our analyses. Recall that k, N, M are the numbers of clusters, objects, and attributes, respectively. Also, M and C are the sets of ML and CL constraints, respectively. The computational complexity of CVQE is O(k M (N + M + C )+k (M + M + C )). Thus, as k- means, it has linear complexity with the number of objects and attributes. However, differently from k-means, CVQE (6) has quadratic complexity with k. The computational complexity of LCVQE is O(k M (N + M + C )), thus being more computationally efficient than CVQE with regard to k. Finally, MPCK-Means complexity is O(k M ( M + C )+ M ( M + C ) + k M 3 + k N M + N M ), where the cubic complexity with the number of attributes comes from the computation of determinants and eigendecomposition of each matrix A j, and the quadratic complexity with N is due to the updates of the pair of farthest objects for each metric. 3 Empirical Evaluation 3. Methodology In order to compare the algorithms CVQE, LCVQE and MPCK-Means, experiments were performed on two sets of datasets as reported in Table. The first set is composed of 0 datasets commonly used as benchmarks in the literature. Most of them are available at the UCI Repository []. In addition, we have used the 9Gauss dataset [6], which is formed by nine balanced clusters arranged according to Gaussian distributions that have some degree of overlapping, as well as the Protein dataset [5]. Following [5], for the Letters dataset only classes I, J, and L were considered and, for Pendigits, only classes 3, 8, and 9 were considered. According to [5] these classes represent difficult classification problems. The second set, denoted by Bioinformatics for simplicity, is composed of 0 datasets that are benchmarks for the Cancer Gene Expression Data domain [8]. We selected only datasets containing more than 00 objects, so that a considerable number of constraints could be generated. For the experiments with gene expression datasets, objects have been normalized by using z-scores [4]. The main characteristics of the employed datasets are given in Table. The quality of the obtained partitions is assessed by means of two well-known external validity criteria, namely: Adjusted Rand Index (ARI) [3] and Normalized Mutual Information (NMI) [9]. All algorithms were executed using the a priori known number of clusters. For the UCI datasets, it is assumed that classes correspond to clusters. Sets of constraints with eight different sizes were generated so that M + C R = {5, 50, 75, 00, 5, 50, 75, 00}. Each constraint was generated as follows: let i and j be two distinct integers randomly chosen from the range [, N]. The labels of the objects x i and x j are verified and, if they are the same, a ML constraint is added, otherwise, a CL constraint is added. If the constraint has already been added (possibly CVQE and LCVQE were implemented in Matlab, whereas a Java implementation available at ml/risc/code/ was used for running experiments with MPCK-Means. 5

6 Set Dataset N M k UCI Bioinformatics Breast Cancer E-coli Ionosphere Iris Letters Pendigits Pima Wine Gauss Protein Bhattacharjee Chen Chowdary Gordon Lapointe-004-v Ramaswamy Singh Tomlins Yeoh-00-v Yeoh-00-v Table. Datasets used in the experiments. with the reversed order of objects), it is discarded. For each number of constraints, T = 00 repetitions were performed, thus totalling 800 sets of constraints for each dataset. All the studied algorithms are based on k-means, which is known to be sensitive to different initializations. Therefore, each algorithm was run five times. For both CVQE and LCVQE, the initializations differ by the objects randomly selected to be the initial prototypes, whereas for MPCK-Means the initializations differ on the point coordinates obtained through the addition of random noise. The best partition obtained from different initializations is determined by considering the value of the objective function of each algorithm, thus simulating a practical application of the algorithms being compared. We have used the standard k-means algorithm as a baseline for assessing the impact of incorporating constraints into the clustering process. Original and deduced constraints were considered when assessing the performance of CVQE and LCVQE. If only the provided constraints are considered, then their respective results are referenced by the algorithm names. If the original set of provided constraints is augmented by deduced constraints, then the correspondent results are referenced by CVQE-A and LCVQE-A. By comparing these different approaches, it is possible to assess to what extent the deduced constraints can improve the quality of the obtained partitions. MPCK-Means uses the deduced constraints as part of its initialization. As a consequence, both provided and deduced constraints are used when running this algorithm. We also investigate if learning a particular distance metric per cluster provides better results than learning a single metric for all clusters. The algorithm that learns multiple metrics is denoted by MPCK-M. Analogously, MPCK-S denotes the algorithm that learns a single metric for all clusters. In order to provide some reassurance about the validity and non-randomness of the obtained results, we present the results of statistical tests by following the approach proposed by Demšar [9]. In brief, this approach is aimed at comparing multiple algorithms on multiple datasets, and it is based on the use of the well known Friedman test with a corresponding post-hoc test. The Friedman test is a non-parametric counterpart of the well known ANOVA. If the null hypothesis, which states that the algorithms under study have similar performances, is rejected, then we proceed with the Nemenyi post-hoc test for pair-wise comparisons between algorithms. Statistical tests were performed individually for every investigated criterion, namely: ARI, NMI, and number of violated constraints (VIO). The tests were performed separately for each number of constraints in R = {5, 50, 75, 00, 5, 50, 75, 00} by averaging the values obtained over T = 00 repetitions. 3. Results on UCI Datasets The results obtained for the ARI and NMI criteria are summarized by means of the percentage of wins/ties/losses for each pair of algorithms in Tables and 3, respectively (considering every analyzed case 3 ). For example, the value 45.6/0.5/43.9 in the nd row and 3 rd column of Table means that CVQE presented ARI values greater than those obtained by LCVQE in 45.6% cases, equal values in 0.5% cases, and smaller values in 43.9% cases. From these tables, one can see that the use of deduced constraints by CVQE and LCVQE resulted in a similar number of gains and losses compared to using only the provided, original constraints. For example, considering the ARI, CVQE-A showed better results than CVQE in 39.4% cases and worse results in 37.3% cases. This and other similar results show that increasing the number of constraints (especially if these are deduced from known constraints) does not necessarily yield to better data partitions. Another interesting observation is that MPCK-S performed equal or better than CVQE-A and LCVQE-A in more than 55% of cases, while MPCK-M has not presented better results than LCVQE-A. Comparing CVQE to LCVQE, it can be noted that CVQE obtained better results in a slightly larger number of cases. By considering the two versions of MPCK-Means, learning a single metric for all clusters (MPCK-S) showed equal or better results than learning a particular metric per cluster (MPCK-M) in 66% of the cases. As expected, all algorithms provided better results than k-means. Table 4 summarizes the percentage of wins/ties/losses for the number of violated constraints (VIO) in the partitions found by each algorithm. The number of constraints violated by MPCK-Means is less or equal than the number of constraints violated by CVQE-A and LCVQE-A in more 3 The overall number of cases is 8,000 (0 R T ). 6

7 Table. Win/Tie/Loss (%) w.r.t. Algorithm in the st Column ARI for UCI Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 45.6 / 0.5 / / 3.3 / / 0.0 / /.4 / / 0. / / 0.5 / 30.8 LCVQE 43.9 / 0.5 / / 9.5 / / 5.3 / /.4 / / 0. / / 0.5 / 4.7 CVQE-A 39.4 / 3.3 / / 9.5 / / 0. / /.3 / / 0.3 / / 0.5 / 3.7 LCVQE-A 44.9 / 0.0 / / 5.3 / / 0. / /.3 / / 0.3 / / 0.6 / 4. MPCK-S 58.6 /.4 / /.4 / /.3 / /.3 / / 4.4 / / 0.3 / 8.8 MPCK-M 5.8 / 0. / / 0. / / 0.3 / / 0.3 / / 4.4 / / 0.0 / 36.5 k-means 30.8 / 0.5 / / 0.5 / / 0.5 / / 0.6 / / 0.3 / / 0.0 / 63.4 Table 3. Win/Tie/Loss (%) w.r.t. Algorithm in st Column NMI for UCI Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 45.3 / 0.5 / / 3.4 / / 0.0 / /.4 / / 0. / / 0.5 / 4.6 LCVQE 44. / 0.5 / / 9.5 / / 5.4 / /.4 / / 0. / / 0.5 / 35.3 CVQE-A 38.7 / 3.4 / / 9.5 / / 0. / /.3 / / 0.3 / / 0.5 / 43.0 LCVQE-A 45. / 0.0 / / 5.4 / / 0. / /.3 / / 0.3 / / 0.6 / 35. MPCK-S 6. /.4 / /.4 / /.3 / /.3 / / 4.4 / / 0.3 / 30.6 MPCK-M 48.7 / 0. / / 0. / / 0.3 / / 0.3 / / 4.4 / / 0.0 / 47.6 k-means 4.6 / 0.5 / / 0.5 / / 0.5 / / 0.6 / / 0.3 / / 0.0 / 5.4 Table 4. Win/Tie/Loss (%) w.r.t. Algorithm in the st Column VIO for UCI Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 8.5 / 7.7 / / 37.8 / / 6.0 / /. / / 9.3 / /.9 /.0 LCVQE 63.8 / 7.7 / / 5.7 / / 45.0 / / 33.8 / / 3. / /.7 / 0.0 CVQE-A 3.4 / 37.8 / / 5.7 / / 6.7 / / 4. / /.0 / /.4 /.9 LCVQE-A 54.4 / 6.0 / / 45.0 / / 6.7 / / 38.0 / / 35.0 / /.0 / 3.6 MPCK-S 74.8 /. / / 33.8 / / 4. / / 38.0 / / 80.0 / /.0 / 0.0 MPCK-M 73.6 / 9.3 / / 3. / /.0 / / 35.0 / / 80.0 / /.3 / 0.7 k-means.0 /.9 / /.7 / /.4 / /.0 / /.0 / /.3 / 97.0 Table 5. Average Number of Violated Constraints (VIO) UCI Datasets. Size of set Type Provided Deduced CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M ML 6.0 (3.) 0. (0.) 0.7 (0.8) 0. (0.) 0.7 (0.9) 0. (0.) 0.0 (0.0) 0.0 (0.0) CL 9.0 (3.).4 (.0) 0.8 (0.9) 0. (0.) 0.9 (.0) 0. (0.) 0.0 (0.0) 0.0 (0.0) ML.0 (6.3) 0.4 (0.5).6 (.7) 0.6 (0.6).7 (.9) 0.6 (0.6) 0. (0.) 0. (0.) CL 38.0 (6.3) 5.8 (4.6).7 (.0) 0.4 (0.5).3 (.4) 0.6 (0.8) 0.0 (0.0) 0.0 (0.0) ML 8. (9.6). (.3).9 (.9). (.3) 3. (3.5).3 (.5) 0. (0.) 0. (0.) CL 56.8 (9.6) 3.6 (0.8) 3. (3.0) 0.8 (0.9) 4.5 (4.4).5 (.9) 0.0 (0.0) 0. (0.) ML 4.0 (.8).0 (.3) 4. (4.4). (.) 4.7 (5.4). (.5) 0. (0.) 0.3 (0.) CL 76.0 (.8) 4.9 (0.) 4.4 (4.0).5 (.7) 7. (6.8) 3. (4.) 0. (0.) 0. (0.) ML 30. (6.6) 3. (3.9) 5.8 (6.0) 3. (3.4) 6.9 (8.) 3.5 (4.4) 0.3 (0.5) 0.5 (0.4) CL 94.9 (6.6) 39.5 (3.7) 5.8 (5.5). (.4) 0.9 (0.) 4.9 (6.9) 0. (0.4) 0. (0.3) ML 36.0 (9.6) 4.8 (5.7) 7.3 (7.5) 4. (4.4) 9. (0.4) 4.8 (6.0) 0.4 (0.7) 0.7 (0.7) CL 4.0 (9.6) 57.9 (49.0) 7. (6.8).9 (3.3) 5.6 (4.6) 7.5 (0.) 0.3 (0.8) 0.5 (0.6) ML 4. (3.) 6.6 (7.9) 9. (9.6) 5.5 (6.0).7 (3.4) 6.6 (8.0) 0.5 (0.8) 0.8 (0.9) CL 3.8 (3.) 8.5 (7.) 8.8 (8.) 3.6 (4.).4 (.4) 0.6 (4.6) 0.5 (.) 0.8 (.) ML 48. (6.0) 8.7 (0.3) 0.7 (.) 6.7 (7.4) 4. (6.) 8.7 (0.7) 0.8 (.).3 (.4) CL 5.8 (6.0) 09. (96.5) 0. (9.7) 4.5 (5.) 7. (7.0) 5.0 (.) 0.9 (.).8 (.8) 7

8 than 96% of cases, thus suggesting that MPCK-Means can indeed learn a metric suitable for satisfying the constraints, but this property is not enough for generalization purposes with respect to the objects not directly affected by the constraints. It can also be observed from Table 4 that LCVQE violated a number of constraints smaller or equal to those violated by CVQE in 90% of cases, suggesting that the LCVQE s procedure for updating prototypes is better than the one adopted in CVQE. As expected, all algorithms performed better than k-means for the VIO criterion. Table 5 summarizes the average numbers of violated constraints for each algorithm (standard deviations between parentheses). The averages were computed from both the provided sets of constraints (randomly generated according to the procedure previously described) and deduced sets of constraints. One can see that MPCK-Means has presented the best results. Moreover, as the number of constraints increases, the use of a single metric for all clusters presented better results than the use of a particular metric per cluster. Also, LCVQE provided better results than CVQE for both ML and CL constraints. Table 6 summarizes the results of significance tests. Only the pairs of algorithms for which statistically significant differences (α = 5%) were found are listed. The first column indicates the investigated criteria (ARI, NMI, and VIO), and the second column lists the pairs of algorithms where the first listed algorithm obtained better results than the second one. The third column presents the respective sizes of the constraint sets. For instance, MPCK-M obtained better results than CVQE-A with respect to VIO for constraint sets of sizes {5, 50, 00, 5, 50, 75, 00} (see the 0 th row). For ARI and NMI, statistically significant differences were not observed. For VIO, CVQE has not shown significantly better results than k-means for any number of constraints. It is interesting to note that MPCK-Means obtained significantly better results than CVQE-A in most of the scenarios. However, the same does not hold with respect to LCVQE, which is more computationally efficient than MPCK-Means. Thus, by taking into account all the assessed criteria, LCVQE has shown the best trade-off solutions. 3.3 Results on Bioinformatics Datasets As in the previous section, we assessed the studied algorithms according to three criteria ARI, NMI, and VIO. Tables 7-9 present the relative percentages of wins/ties/losses for each pair of algorithms. For both ARI and NMI, LCVQE obtained equal or better results than CVQE in more than 74% of the cases. LCVQE also violated less constraints than CVQE in 55% of the cases. In this sense, note from Table 0 that the performance differences are particularly favorable to LCVQE for CL constraints. Table 6. Significant Differences Friedman/Nemenyi tests (α = 5%) on UCI Datasets. Criterion Pairs of Algorithms Size of constraint set ARI & NMI VIO <LCVQE, CVQE-A > { 75, 00 } <LCVQE, k-means > { 5,..., 00} <LCVQE-A,k-means > { 5,..., 50} <MPCK-S, CVQE> { 50, 75, 00, 5} <MPCK-S, CVQE-A > { 5,..., 00} <MPCK-S, k-means > { 5,..., 00} <MPCK-M, CVQE> { 5 } <MPCK-M, CVQE-A > { 5, 50, 00,..., 00} <MPCK-M, k-means > { 5,..., 00} The use of a particular metric per cluster (MPCK-M) has provided worse results than the use of a single metric for all clusters in about 65% of the cases (with respect to ARI and NMI). However, the use of multiple metrics allowed violating less or the same quantity of constraints in 8% of the cases. In essence, these results are in accordance with those reported for UCI datasets, thus suggesting that, for the employed Bioinformatics data, violating less constraints do not necessarily lead to more accurate clusterings. Surprisingly, even the standard k-means presented better partitions than MPCK-Means in more than 63% of the cases (considering ARI and NMI). Another interesting observation is that the use of a specific metric per cluster (MPCK-M) has resulted in particularly worse results when CL constraints are taken into account see Table 0. Table presents a summary of the outcomes of the performed statistical tests. For compactness, we only present the pairs of algorithms for which significant differences were observed for both accuracy measures (ARI and NMI). One can see that LCVQE and LCVQE-A presented significantly better results than MPCK-Means, especially when the number of provided constraints is small. Taking the number of violated constraints (VIO) into account, only CVQE has not presented significantly better results than k-means, and LCVQE violated less constraints than CVQE-A in most of the cases. Although CVQE and LCVQE have, in general, shown better results than MPCK-Means in our experiments, it is important to have in mind that, for particular datasets, learning distance metrics can indeed be advantageous. In order to illustrate this aspect, consider Figure, which shows the average values for ARI obtained by each algorithm, for different sets of constraints, in the Singh-00 dataset. It can be seen that MPCK-Means scores well above CVQE and LCVQE, especially when more than 75 constraints were provided. This result indicates that the Euclidean distance is not a good metric for this dataset. As a consequence, CVQE and LCVQE cannot satisfy the constraints while improving the data partitions. This fact becomes even clearer when 8

9 Table 7. Win/Tie/Loss (%) w.r.t. Algorithm in the st Column ARI for Bioinformatics Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 5.9 /. / / 7.3 / / 0.4 / /. / /. / /.9 / 3.4 LCVQE 5.9 /. / /.3 / / 8.6 / / 3. /.9 8. /. / / 3.0 / 9.8 CVQE-A 38.8 / 7.3 / /.3 / /.9 / /.3 / /. / / 3.3 / 3.4 LCVQE-A 5. / 0.4 / / 8.6 / /.9 / / 3.9 / /.3 / /.9 / 3.4 MPCK-S 3.9 /. / / 3. / /.3 / / 3.9 / / 6.9 / / 0. / 63.4 MPCK-M 4.0 /. / /. / /. / /.3 / / 6.9 / / 0.3 / 7.6 k-means 3.4 /.9 / / 3.0 / / 3.3 / /.9 / / 0. / / 0.3 / 8. Table 8. Win/Tie/Loss (%) w.r.t. Algorithm in the st Column NMI for Bioinformatics Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 3.4 /. / / 7.3 / / 0.4 / /. / /. / /.9 / 39.7 LCVQE 54.3 /. / /.3 / / 8.6 / / 3. / /. / / 3.0 / 5.6 CVQE-A 39. / 7.3 / /.3 / /.9 / /.3 / /. / / 3.3 / 38.8 LCVQE-A 54.8 / 0.4 / / 8.6 / /.9 / / 3.9 / /.3 / /.9 / 7.3 MPCK-S 35.4 /. / / 3. / /.3 / / 3.9 / / 6.9 / / 0. / 64. MPCK-M 4.6 /. / /. / /. / /.3 / / 6.9 / / 0.3 / 74.5 k-means 39.7 /.9 / / 3.0 / / 3.3 / /.9 / / 0. / / 0.3 / 5. Table 9. Win/Tie/Loss (%) w.r.t. Algorithm in the st Column VIO for Bioinformatics Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 0.0 / 34.7 / / 3.4 / / 30.7 / / 5.5 / /.7 / / 3.7 / 5. LCVQE 55.3 / 34.7 / / 30.4 / / 4. / / 34. / / 6.9 / / 3.3 / 0.3 CVQE-A 5.9 / 3.4 / / 30.4 / / 3.5 / / 7.3 / / 3.6 / / 4. / 6.9 LCVQE-A 40.4 / 30.7 / / 4. / / 3.5 / / 38.8 / / 30.5 / / 3.5 / 4.3 MPCK-S 68. / 5.5 / / 34. / / 7.3 / / 38.8 / / 77.5 / / 3. / 0.8 MPCK-M 59.5 /.7 / / 6.9 / / 3.6 / / 30.5 / / 77.5 / / 3.4 / 8.4 k-means 5. / 3.7 / / 3.3 / / 4. / / 3.5 / / 3. / / 3.4 / 88. Table 0. Average Number of Violated Constraints (VIO) Bioinformatics Datasets. Size of set Type Provided Deduced CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M ML. (3.9) 0.4 (0.4) 0.6 (0.7) 0.3 (0.5) 0.6 (0.8) 0.3 (0.4) 0.0 (0.0) 0.0 (0.) CL 3.9 (3.9).3 (.0) 0.8 (.6) 0. (0.3). (.) 0.3 (0.6) 0.0 (0.0) 0.4 (.0) ML.4 (7.9).9 (.9).7 (.9). (.6). (.7).3 (.9) 0. (0.) 0. (0.) CL 7.6 (7.9) 9.8 (8.4).0 (3.6) 0.6 (.0) 4.4 (8.5).6 (3.4) 0. (0.4) 0.5 (.0) ML 33.3 (.6) 4.5 (4.6) 3.3 (3.7).4 (3.) 4. (5.5).8 (4.3) 0. (0.4) 0.5 (0.8) CL 4.7 (.6) 3.7 (0.4) 3. (4.9). (.) 9.3 (7.8) 4. (8.6) 0.6 (.4).7 (.9) ML 44.6 (5.6) 8.6 (9.0) 4.8 (5.6) 3.7 (5.) 6. (7.7) 4.7 (7.) 0.4 (0.5) 0.9 (.4) CL 55.4 (5.6) 44.4 (39.7) 4.9 (6.8).0 (3.) 7.9 (35.4) 8. (7.) 0.9 (.6) 3. (6.5) ML 55.7 (9.6) 3.9 (4.9) 6.7 (7.7) 4.9 (6.4) 9. (0.8) 7.8 (3.0) 0.3 (0.5). (.3) CL 69.3 (9.6) 73.3 (67.) 6.5 (9.).7 (4.3) 8.9 (56.7) 6. (37.) 0.7 (.5) 5.4 (3.3) ML 66.9 (3.9).6 (3.5) 8.7 (0.0) 6.3 (8.).6 (4.9).5 (9.) 0.3 (0.6). (.9) CL 83. (3.9).4 (04.) 8.0 (0.9) 3.5 (5.5) 43.7 (84.0) 6.8 (6.) 0.7 (.8) 8.0 (0.5) ML 78. (7.9) 3.0 (33.8) 0.9 (.8) 7.8 (0.3) 6.8 (0.9) 5. (5.6) 0.3 (0.4).4 (3.) CL 96.9 (7.9) 6.8 (53.0) 9.7 (3.6) 4.4 (6.9) 58. (.3) 39.3 (89.8) 0.5 (.) 6.0 (46.7) ML 89.3 (3.6) 4.3 (45.).6 (4.7) 9. (.9). (8.5) 0. (33.6) 0. (0.).7 (4.4) CL 0.7 (3.6).9 (.6).4 (5.4) 5. (8.) 78.8 (53.9) 56.6 (3.3) 0. (0.3).5 (6.7) 9

10 Table. Statistically Significant Differences (α = 5%) Bioinformatics Datasets. Criteria Pairs Size of constraint set <LCVQE, MPCK-S > { 5, 50, 75, 00 } <LCVQE, MPCK-M > { 5,..., 75 } ARI & NMI <LCVQE-A,MPCK-S > { 5,..., 5 } <LCVQE-A,MPCK-M > { 5,..., 75 } <LCVQE-A, CVQE> { 5 } Adjusted Rand Index VIO 0 <LCVQE, k-means > { 5,..., 00 } <LCVQE, CVQE-A > { 50,..., 5, 75 } <LCVQE-A, k-means > { 5, 50, 75, 00 } <MPCK-S, CVQE> { 5, 50, 75, 00 } <MPCK-S, CVQE-A > { 5,..., 00 } <MPCK-S, k-means > { 5,..., 00 } <MPCK-M, CVQE-A > { 5, 50 } <MPCK-M, k-means > { 5, 50 } CVQE CVQE A LCVQE LCVQE A MPCK S MPCK M k means Number of Constraints Figure. ARI Values Singh-00 dataset. VIO is analyzed, e.g., for 00 constraints the average number of violations by CVQE, LCVQE, and MPCK-Means is 86, 57, and 0, respectively. Another interesting observation is that the results of MPCK-S were equal or better than those obtained by MPCK-M for any number of constraints, again suggesting that the additional computational cost of learning a distance metric for each cluster is unnecessary for these datasets. 3.4 Noisy Constraints We also assessed the performance of the algorithms in scenarios where some of the constraints are noisy. A related study was performed in [6], where only very small fractions of noisy constraints were considered for a few datasets. Our experimental setting is more representative: we performed experiments on twenty datasets (Table ) for different fractions of noisy constraints. More specifically, we considered scenarios where 00 constraints were provided. From these constraints, different proportions {5%, 0%, 5%, 0%, 5%, 30%} of noisy constraints were generated by randomly selecting a constraint and changing its type, i.e., turning a ML constraint into a CL one, and vice-versa. This procedure is inspired in the noisy edge noise model [0]. We run experiments by taking into account the same (00) trials described in Section 3., i.e., for each trial the 00 initial (not noisy) constraints are exactly as described in the previous experiments. For the sake of illustration, we show the NMI values obtained for the Iris dataset Figure 3. The trends depicted in this graph can be considered typical for our experiments with other datasets. One can observe from this figure that CVQE shows the best behavior. This is expected from the trend observed in our experiments reported in Sections 3. and 3.3, which do not involve noisy constraints, namely: it tends to violate more constraints than LCVQE. Also, algorithms using the augmented set of constraints (CVQE-A, LCVQE-A, and MPCK-Means) presented a more sensible decrease in performance. Since the augmented set may contain constraints that were derived from noisy ones, this is also expected. An interesting observation is that CVQE-A has shown more robustness to noisy constraints than LCVQE. Finally, Figure 3 also shows that MPCK-Means is less robust with respect to noisy constraints. This can be explained by the fact that it seeks to adapt a distance metric, which in this case can be seen as a projection [5] into a wrong direction, thus resulting in low NMI values. Normalized Mutual Information CVQE CVQE A LCVQE LCVQE A MPCK S MPCK M Percentage of Noise Constraints Figure 3. NMI Values Iris dataset. For a more general perspective, there are two important aspects to be evaluated in our experimental setting. The first one is to identify algorithms that are more robust to noise. We evaluate this aspect by means of the relative difference between the NMI 4 values obtained with and without noisy 4 NMI and ARI presented highly correlated values in our experiments. For this reason, only NMI values are here reported. 0

11 constraints. More precisely, we use the measure in Eq. (7), which captures the average difference, among the T trials performed, in terms of NMI values obtained by a particular algorithm (i), for a given amount of noise (n), in dataset d. We use NMI noisy and NMI to denote NMI values obtained when comparing the reference partition with the partition obtained with and without noisy constraints, respectively. NMI (i, n, d) = T T (NMI(i, t, d) NMI noisy (i, t, d, n)) t= (7) The average rankings of the NMI differences are summarized in Table, where, for example, the value of.00 in the 4 th row and 4 th column indicates that, after sorting the differences obtained by each algorithm (with 5% of the constraints being noisy), CVQE-A is the second best algorithm (on average). Note that the NMI values are always computed with respect to a reference partition, given by the classes. Thus, it is expected that the greater the impact of noisy constraints the higher the NMI difference. From these results, one can observe that CVQE is the most robust. [ ( ) Accuracy(i, n, d) = T T F t [l V t ] [l N t ] t= l F ( t )] + F t [l / V t ] [l / N t ] (8) l F t Table 3 presents the average rankings obtained from the accuracies. One can observe that, by considering 5-5% of noisy constraints, MPCK-S is the best algorithm (on average), whereas CVQE is the best one for 0-30% of noisy constraints. The accuracies obtained by MPCK-S are explained from the observation that the constraint sets are imbalanced. Recall from Sections 3. and 3.3 that MPCK-Means usually violates a very small (close to 0) number of constraints. Thus, for small proportions of noisy constraints, it is advantageous (from the accuracy viewpoint) to satisfy most of them (no matter whether they are noisy or not). When such an imbalance is lessened (i.e., the proportion of noisy constraints is increased), its accuracies decrease. As expected, the accuracy of LCVQE also deteriorates with the increasing number of noisy constraints, but for a small amount of noise (5%), it is still competitive with CVQE. Table. Average rankings of the relative difference of NMI values between partitions obtained with and without noisy constraints - Equation (7). % Noise CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M The second aspect to be analyzed refers to the algorithm s capability of correctly violating noisy constraints while respecting the remaining ones. To that end, two fractions are added up, namely: (i) the fraction of constraints that were correctly violated (i.e., are indeed noisy) and (ii) the fraction of constraints that were correctly satisfied (i.e., those that are not noisy). The resulting measure is analogous to the well-known accuracy. More precisely, consider that algorithm (i) was run in dataset d, which has a particular proportion of noisy constraints (n). The accuracy is then measured by means of Eq. (8), where F, V and N denote the full set of constraints (F = M C), full set of violated constraints (V = V M V C ), and the set of noisy constraints, respectively. A superscript t in a set denotes the set of the t th trial. Table 3. Average rankings for the accuracies over constraints: Equation (8). % Noise CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M In summary, LCVQE seems to be the most appropriate algorithm for scenarios where the expected amount of noise in the constraints is low. For noisier sets of constraints, the more computationally demanding CVQE tends to provide better results. 4 Conclusions In this study, three well-known algorithms for k-meansbased clustering with constraints Constrained Vector Quantization Error (CVQE) [7], its variant named LCVQE [6], and the Metric Pairwise Constrained K- Means (MPCK-Means) [5] were systematically compared. These algorithms are designed to minimize the number of violated constraints in the process of clustering. In addition, MPCK-Means also seeks to learn a distance metric that maximizes the agreement with the constraints. Besides evaluating the algorithms with respect to the number

12 of violated constraints, two criteria that capture the accuracy of the obtained data partitions have been adopted Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). Experiments were performed on 0 datasets of different characteristics. For each dataset, 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance were presented. Computational complexity analyses not available in the original references have also been presented. Thus, we provided a far richer evaluation on all these aspects compared to previous studies. In order to analyze the obtained results, we grouped the datasets into two subsets. The first subset is formed by UCI datasets, whereas the second set has Bioinformatics data (more precisely, benchmarks for Cancer Gene Expression Data). For both UCI and Bioinformatics datasets, LCVQE has shown to be competitive with CVQE in terms of accuracy, while violating less constraints and being more computationally efficient. From this viewpoint, our work confirms the claims made from a more limited set of experiments in [6]. Also, in most of the cases both CVQE and LCVQE presented better accuracy than MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. These results suggest that the Euclidean distance is appropriate to compute dissimilarities between objects for most of the investigated datasets we have normalized the bioinformatics datasets by means of z-scores, thus making the Euclidean distance consistent with the Pearson correlation, which is widely used for such datasets. Although CVQE and LCVQE have, in general, shown better results than MPCK-Means in our experiments, it is important to keep in mind that for particular datasets learning distance metrics can indeed be advantageous. As in [5], our more extensive study corroborates the conclusion that MPCK-Means presents better results when more constraints are available. We shall also observe that essentially the same conclusions can be drawn if one considers the 0 datasets altogether, namely: in most of the cases LCVQE provides the best clusterings and MPCK-Means violates less constraints. Overall, by taking into account all the assessed criteria, LCVQE has shown the best trade-off solutions. Some useful observations can also be made from our experiments with noisy constraints. For instance, CVQE has shown to be robust even in very noisy scenarios (e.g., 30% of noisy constraints). For more well-behaved scenarios (e.g., 5% of noisy constraints), its more computationally efficient counterpart LCVQE has presented a reasonable tradeoff between efficiency and accuracy. Finally, a variety of (more specific) new experimental findings emerged from our study. For instance, our results suggest that deduced constraints usually do not help finding better data partitions. In addition, we have observed that ML constraints are harder to be satisfied, especially by the algorithms that have found better clusterings. We shall stress that our study was focused on clustering problems, for which the underlying assumption is that both ML and CL constraints should help finding better data partitions. However, from a classification point of view, ML constraints may be misleading. For instance, if a particular class is formed by more than one cluster, then satisfying ML constraints may imply in having less accurate classifiers. Therefore, from the different viewpoint of building classifiers from constrained clustering algorithms (e.g. in semi-supervised learning settings), CL constraints tend to be more useful. Also, for these (and related) application scenarios, one should not assume that the number of clusters is equal to the number of classes. Instead, the number of clusters should be optimized from data, in such a way that ML constraints could also be used to induce better models. From these observations, we believe that there is still room for investigating how algorithms for constrained clustering can help semi-supervised learning. Acknowledgments This work has been supported by NSF Grants (IIS and IIS-0664) and by the Brazilian Research Agencies FAPESP and CNPq. References [] A. Asuncion and D. Newman. UCI machine learning repository. mlearn/ MLRepository.html, 007. [] A. Banerjee and J. Ghosh. Scalable clustering algorithms with balancing constraints. Data Mining and Knowledge Discovery, 3(3):65 95, November 006. [3] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning a mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, 6: , 005. [4] S. Basu, I. Davidson, and K. Wagstaff. Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC, 008. [5] M. Bilenko, S. Basu, and R. J. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the twenty-first international conference on Machine learning (ICML 04), page, New York, NY, USA, 004. ACM. [6] R. J. G. B. Campello, E. R. Hruschka, and V. S. Alves. On the efficiency of evolutionary fuzzy clustering. Journal of Heuristics, 5:43 75, 009. [7] I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In 5th SIAM Data Mining Conference, pages SIAM, 005.

Competitive Learning with Pairwise Constraints

Competitive Learning with Pairwise Constraints Competitive Learning with Pairwise Constraints Thiago F. Covões, Eduardo R. Hruschka, Joydeep Ghosh University of Texas (UT) at Austin, USA University of São Paulo (USP) at São Carlos, Brazil {tcovoes,erh}@icmc.usp.br;ghosh@ece.utexas.edu

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Constrained Clustering with Interactive Similarity Learning

Constrained Clustering with Interactive Similarity Learning SCIS & ISIS 2010, Dec. 8-12, 2010, Okayama Convention Center, Okayama, Japan Constrained Clustering with Interactive Similarity Learning Masayuki Okabe Toyohashi University of Technology Tenpaku 1-1, Toyohashi,

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

A Novel Approach for Weighted Clustering

A Novel Approach for Weighted Clustering A Novel Approach for Weighted Clustering CHANDRA B. Indian Institute of Technology, Delhi Hauz Khas, New Delhi, India 110 016. Email: bchandra104@yahoo.co.in Abstract: - In majority of the real life datasets,

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

A Unified Framework to Integrate Supervision and Metric Learning into Clustering A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Chapter 8 The C 4.5*stat algorithm

Chapter 8 The C 4.5*stat algorithm 109 The C 4.5*stat algorithm This chapter explains a new algorithm namely C 4.5*stat for numeric data sets. It is a variant of the C 4.5 algorithm and it uses variance instead of information gain for the

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Consensus clustering by graph based approach

Consensus clustering by graph based approach Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Global Metric Learning by Gradient Descent

Global Metric Learning by Gradient Descent Global Metric Learning by Gradient Descent Jens Hocke and Thomas Martinetz University of Lübeck - Institute for Neuro- and Bioinformatics Ratzeburger Allee 160, 23538 Lübeck, Germany hocke@inb.uni-luebeck.de

More information

Relative Constraints as Features

Relative Constraints as Features Relative Constraints as Features Piotr Lasek 1 and Krzysztof Lasek 2 1 Chair of Computer Science, University of Rzeszow, ul. Prof. Pigonia 1, 35-510 Rzeszow, Poland, lasek@ur.edu.pl 2 Institute of Computer

More information

The Projected Dip-means Clustering Algorithm

The Projected Dip-means Clustering Algorithm Theofilos Chamalis Department of Computer Science & Engineering University of Ioannina GR 45110, Ioannina, Greece thchama@cs.uoi.gr ABSTRACT One of the major research issues in data clustering concerns

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM 96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays

More information

AN ABSTRACT OF THE THESIS OF

AN ABSTRACT OF THE THESIS OF AN ABSTRACT OF THE THESIS OF Sicheng Xiong for the degree of Master of Science in Computer Science presented on April 25, 2013. Title: Active Learning of Constraints for Semi-Supervised Clustering Abstract

More information

Byzantine Consensus in Directed Graphs

Byzantine Consensus in Directed Graphs Byzantine Consensus in Directed Graphs Lewis Tseng 1,3, and Nitin Vaidya 2,3 1 Department of Computer Science, 2 Department of Electrical and Computer Engineering, and 3 Coordinated Science Laboratory

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

Semi-supervised Clustering

Semi-supervised Clustering Semi-supervised lustering BY: $\ S - MAI AMLT - 2016/2017 (S - MAI) Semi-supervised lustering AMLT - 2016/2017 1 / 26 Outline 1 Semisupervised lustering 2 Semisupervised lustering/labeled Examples 3 Semisupervised

More information

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering. Chapter 4 Fuzzy Segmentation 4. Introduction. The segmentation of objects whose color-composition is not common represents a difficult task, due to the illumination and the appropriate threshold selection

More information

Statistical Methods in AI

Statistical Methods in AI Statistical Methods in AI Distance Based and Linear Classifiers Shrenik Lad, 200901097 INTRODUCTION : The aim of the project was to understand different types of classification algorithms by implementing

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Your Name: Your student id: Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Problem 1 [5+?]: Hypothesis Classes Problem 2 [8]: Losses and Risks Problem 3 [11]: Model Generation

More information

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

More information

Nearest Cluster Classifier

Nearest Cluster Classifier Nearest Cluster Classifier Hamid Parvin, Moslem Mohamadi, Sajad Parvin, Zahra Rezaei, Behrouz Minaei Nourabad Mamasani Branch Islamic Azad University Nourabad Mamasani, Iran hamidparvin@mamasaniiau.ac.ir,

More information

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA) TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA) 1 S. ADAEKALAVAN, 2 DR. C. CHANDRASEKAR 1 Assistant Professor, Department of Information Technology, J.J. College of Arts and Science, Pudukkottai,

More information

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms Binoda Nand Prasad*, Mohit Rathore**, Geeta Gupta***, Tarandeep Singh**** *Guru Gobind Singh Indraprastha University,

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION 1 ZUHERMAN RUSTAM, 2 AINI SURI TALITA 1 Senior Lecturer, Department of Mathematics, Faculty of Mathematics and Natural

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

Intractability and Clustering with Constraints

Intractability and Clustering with Constraints Ian Davidson davidson@cs.albany.edu S.S. Ravi ravi@cs.albany.edu Department of Computer Science, State University of New York, 1400 Washington Ave, Albany, NY 12222 Abstract Clustering with constraints

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms Remco R. Bouckaert 1,2 and Eibe Frank 2 1 Xtal Mountain Information Technology 215 Three Oaks Drive, Dairy Flat, Auckland,

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Measuring Constraint-Set Utility for Partitional Clustering Algorithms

Measuring Constraint-Set Utility for Partitional Clustering Algorithms Measuring Constraint-Set Utility for Partitional Clustering Algorithms Ian Davidson 1, Kiri L. Wagstaff 2, and Sugato Basu 3 1 State University of New York, Albany, NY 12222, davidson@cs.albany.edu 2 Jet

More information

Semi-supervised learning

Semi-supervised learning Semi-supervised Learning COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Overview 2 Semi-supervised learning Semi-supervised classification Semi-supervised clustering Semi-supervised

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Pattern Clustering with Similarity Measures

Pattern Clustering with Similarity Measures Pattern Clustering with Similarity Measures Akula Ratna Babu 1, Miriyala Markandeyulu 2, Bussa V R R Nagarjuna 3 1 Pursuing M.Tech(CSE), Vignan s Lara Institute of Technology and Science, Vadlamudi, Guntur,

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering

Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining Systems, pp.42-49, Washington DC, August, 2003 Comparing and Unifying Search-Based

More information

Bagging for One-Class Learning

Bagging for One-Class Learning Bagging for One-Class Learning David Kamm December 13, 2008 1 Introduction Consider the following outlier detection problem: suppose you are given an unlabeled data set and make the assumptions that one

More information

Processing Missing Values with Self-Organized Maps

Processing Missing Values with Self-Organized Maps Processing Missing Values with Self-Organized Maps David Sommer, Tobias Grimm, Martin Golz University of Applied Sciences Schmalkalden Department of Computer Science D-98574 Schmalkalden, Germany Phone:

More information

Function approximation using RBF network. 10 basis functions and 25 data points.

Function approximation using RBF network. 10 basis functions and 25 data points. 1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data

More information

Detecting Clusters and Outliers for Multidimensional

Detecting Clusters and Outliers for Multidimensional Kennesaw State University DigitalCommons@Kennesaw State University Faculty Publications 2008 Detecting Clusters and Outliers for Multidimensional Data Yong Shi Kennesaw State University, yshi5@kennesaw.edu

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Training Digital Circuits with Hamming Clustering

Training Digital Circuits with Hamming Clustering IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: FUNDAMENTAL THEORY AND APPLICATIONS, VOL. 47, NO. 4, APRIL 2000 513 Training Digital Circuits with Hamming Clustering Marco Muselli, Member, IEEE, and Diego

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies. CSE 547: Machine Learning for Big Data Spring 2019 Problem Set 2 Please read the homework submission policies. 1 Principal Component Analysis and Reconstruction (25 points) Let s do PCA and reconstruct

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Approach to Evaluate Clustering using Classification Labelled Data

Approach to Evaluate Clustering using Classification Labelled Data Approach to Evaluate Clustering using Classification Labelled Data by Tuong Luu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Nearest Cluster Classifier

Nearest Cluster Classifier Nearest Cluster Classifier Hamid Parvin, Moslem Mohamadi, Sajad Parvin, Zahra Rezaei, and Behrouz Minaei Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran hamidparvin@mamasaniiau.ac.ir,

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Annotated multitree output

Annotated multitree output Annotated multitree output A simplified version of the two high-threshold (2HT) model, applied to two experimental conditions, is used as an example to illustrate the output provided by multitree (version

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION 1 M.S.Rekha, 2 S.G.Nawaz 1 PG SCALOR, CSE, SRI KRISHNADEVARAYA ENGINEERING COLLEGE, GOOTY 2 ASSOCIATE PROFESSOR, SRI KRISHNADEVARAYA

More information

An Efficient Learning of Constraints For Semi-Supervised Clustering using Neighbour Clustering Algorithm

An Efficient Learning of Constraints For Semi-Supervised Clustering using Neighbour Clustering Algorithm An Efficient Learning of Constraints For Semi-Supervised Clustering using Neighbour Clustering Algorithm T.Saranya Research Scholar Snr sons college Coimbatore, Tamilnadu saran2585@gmail.com Dr. K.Maheswari

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION

THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION Helena Aidos, Robert P.W. Duin and Ana Fred Instituto de Telecomunicações, Instituto Superior Técnico, Lisbon, Portugal Pattern Recognition

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

Network community detection with edge classifiers trained on LFR graphs

Network community detection with edge classifiers trained on LFR graphs Network community detection with edge classifiers trained on LFR graphs Twan van Laarhoven and Elena Marchiori Department of Computer Science, Radboud University Nijmegen, The Netherlands Abstract. Graphs

More information

Shared Kernel Models for Class Conditional Density Estimation

Shared Kernel Models for Class Conditional Density Estimation IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 987 Shared Kernel Models for Class Conditional Density Estimation Michalis K. Titsias and Aristidis C. Likas, Member, IEEE Abstract

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information