A Study of K-Means-Based Algorithms for Constrained Clustering

Size: px

Start display at page:

Download "A Study of K-Means-Based Algorithms for Constrained Clustering"

Lillian Jordan
5 years ago
Views:

1 A Study of K-Means-Based Algorithms for Constrained Clustering Thiago F. Covões, Eduardo R. Hruschka,, Joydeep Ghosh University of Sao Paulo (USP) at Sao Carlos, Brazil University of Texas (UT) at Austin, USA Abstract The problem of clustering with constraints has received considerable attention in the last decade. Indeed, several algorithms have been proposed, but only a few studies have (partially) compared their performances. In this work, three well-known algorithms for k-means-based clustering with soft constraints Constrained Vector Quantization Error (CVQE), its variant named LCVQE, and the Metric Pairwise Constrained K-Means (MPCK-Means) are systematically compared according to three criteria: Adjusted Rand Index, Normalized Mutual Information, and the number of violated constraints. Experiments were performed on 0 datasets, and for each of them 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance are presented. In terms of accuracy, LCVQE has shown to be competitive with CVQE, while violating less constraints. In most of the datasets, both CVQE and LCVQE presented better accuracy compared to MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. The robustness of the algorithms with respect to noisy constraints was also analyzed. From this perspective, the most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments. The computational complexities of the algorithms are also presented. Finally, a variety of (more specific) new experimental findings are discussed in the paper e.g., deduced constraints usually do not help finding better data partitions.. Introduction Constrained clustering arises from the need for incorporating known information about the desired data partitions into the process of clustering data [4]. There are several types of constraints, e.g., about pairs of objects [0], clusters [], and partitions []. Among these, the most usual ones are the constraints about pairs of objects specifically the Must-Link (ML) and Cannot-Link (CL) constraints [0,, 4, 7, 5, 7, 6]. Considering a set X = {x i } N i= of N objects, each one represented by a vector x i R M, a ML constraint c = (i, j) indicates that the objects x i and x j should lie in the same cluster, whereas a CL constraint c (i, j) indicates that x i and x j should lie in different clusters. From a set of ML constraints, new constraints can be deduced by using the transitivity property. Although CL constraints do not present this property, the combination of ML and CL constraints can yield to new constraints, e.g. given the constraints c = (i, j) and c (j, k), the constraint c (i, k) can be deduced. Some algorithms for clustering with constraints [0,, 7] do not allow any violation of the constraints in the process of clustering i.e. in all the iterations of the algorithm the resulting partitions must satisfy all the constraints. While this may be interesting in some circumstances, it is important to have in mind that, in many practical applications, the constraints are usually provided by users who are unaware of the spatial disposition of the data. Therefore, the need to satisfy all the constraints can make the clustering process intractable, and an empty partition is often returned by such an algorithm. To illustrate this, consider Figure, which depicts the petal and sepal areas for the well-known Iris dataset. The three classes are represented by different markers, and the centroid of each class is represented by a dot. The markers enclosed by rectangles correspond to some objects whose nearest centroids are not those of their respective classes. Thus, assuming that every class corresponds to a different cluster, a constrained clustering algorithm based on k-means will not be able to satisfy some pair-wise constraints derived from the classes. To overcome this limitation, more flexible algorithms have been developed [5, 7, 6]. These algorithms seek to minimize the number of violated constraints. This way a partition that agrees as much as possible with the user s constraints can be found. For this reason, such constraints are sometimes

2 called soft constraints. Petal Area Setosa Versicolor Virginica Centroid Setosa Centroid Versicolor Centroid Virginica Sepal Area Figure. Iris Data rectangles highlight some objects whose nearest centroids are not those of their respective classes. Despite the increasing number of studies on clustering with constraints, there is a lack of studies providing empirical comparisons among algorithms. Moreover, frequently the adopted experimental methodology differs from one paper to another, e.g., due to the use of different performance measures and number of constraints, which makes it virtually impossible the (indirect) comparison between algorithms described in different papers. Also, the identification of classes of problems for which a particular algorithm could be preferred is very difficult. This paper presents an extensive comparative analysis of three well-known k-means-based algorithms for constrained clustering, namely: the Constrained Vector Quantization Error (CVQE) algorithm [7], its variant called LCVQE [6], and the MPCK-Means algorithm [5]. We performed computational complexity analyses not reported in the original references. From the experimental point of view, three criteria were used in our study Adjusted Rand Index, Normalized Mutual Information, and number of violated constraints. Experiments were performed on 0 datasets, and for each of them 800 sets of constraints were generated. The statistical significance of the obtained results was also addressed. It is worth noting that the authors of CVQE, LCVQE, and MPCK-Means employed a relatively limited number of datasets (five on average) and/or sets of constraints (typically three). The accuracies of the obtained partitions were only assessed in [6] and [5]. We also analyzed the robustness of the algorithms with respect to noisy constraints. This issue was briefly analyzed in [6] by considering the case when noisy constraints are rare. Also, the effect of noisy constraints on probabilistic models was studied in [5]. In our experiments, we assess the robustness of each algorithm by considering different proportions of noisy constraints. To sum up, we provide a richer evaluation on all these aspects than previous studies. Finally, the compared algorithms are based on the well-known k-means algorithm [8, ], which is widely used in practice. Thus, by performing experiments with k-means based algorithms, interesting conclusions can be drawn for a large audience e.g., k-means users and data mining practitioners potentially interested in this kind of algorithm. The remainder of this paper is organized as follows. In the next section the algorithms under study are briefly described. Section 3 addresses the methodology adopted to perform comparisons among algorithms and presents our experimental results. Finally, Section 4 concludes our work. Notation. A hard partition of the data is a collection P = {C i } k i= of k disjoint clusters such that C i = X and C i C j =, i j, and C i 0, i, where C i denotes the number of objects in cluster C i. Each cluster C i is represented by a prototype µ i. The distance between an object x i and a prototype µ j is calculated by using the squared Euclidean distance, i.e., x i µ j = (x i µ j ) T (x i µ j ), where T denotes the transposed matrix. It is assumed that each algorithm takes a set M of ML constraints and a set C of CL constraints. Using o M (l) and o M (l), to denote the functions that return the first and second objects of the l th ML constraint, it is possible to define the functions g M (l) and g M (l) that return, respectively, the indices of the clusters that the first and second objects of the l th ML constraint belong to, i.e., g M (l) = {j o M (l) C j } and g M (l) = {t o M (l) C t}. Similarly, the functions o C (l), o C (l), g C(l), and g C (l) can be defined for CL constraints. The set of ML constraints being violated is defined as V M = {i M i M, g M (i) g M (i)} and, similarly, the set of CL constraints being violated is defined as V C = {i C i C, g C (i) = g C (i)}. Finally, [Condition] is an indicator function, that is equal to one when the condition is satisfied and 0 otherwise. Algorithms. CVQE The Constrained Vector Quantization Error (CVQE) algorithm [7] employs the objective function of k-means augmented by two terms that consider the costs of violating the constraints. The costs of violating ML and CL constraints are computed from distances between prototypes. For a ML constraint, the cost is the distance between the prototypes of the clusters that contain the objects that should be in the same cluster. For a CL constraint, the cost is the distance between the prototype of the cluster in which the objects rely on and the nearest neighbor prototype (second-closest cluster). More formally, the objective function is defined as:

3 J CV QE = k J CV QEj, () j= J CV QEj = µ j x i x i C j + µ j µ g M (l) + l V M,g M (l)=j l V C,g C (l)=j µ j µ h(g C (l)), () where h(i) returns the index of the cluster whose prototype is the nearest to the i th cluster. The CVQE algorithm assigns objects to clusters as follows: (i) objects that are not involved in any constraint are assigned to the closest cluster; (ii) pairs of objects involved in ML and CL constraints are assigned to clusters that minimize the objective function Eq. (). To do so, all possible assignment combinations are verified. The prototypes (µ j, j =...k) are updated according to: l V C,g C (l)=j µ j = y j z j, (3) y j = x i + µ g M (l) x i C j l V M,g M (l)=j + µ h(g C (l)), (4) z j = C j + [gm (l)=j] + [gc (l)=j]. (5) l V M l V C This update procedure can be interpreted as follows [7]: if a ML constraint is violated, the prototype of the cluster that has the first object of the constraint is moved towards the prototype of the cluster containing the second object of the constraint. In case of violation of a CL constraint, the prototype of the cluster containing the two objects in question is moved towards the nearest neighbor prototype of the second object of the constraint. As discussed in [6], CVQE has some drawbacks. First, the algorithm is sensitive to the order of the objects in each constraint. This can be readily seen from the prototype update rule for the case of a ML violation, because only the prototype of the cluster in which the first object relies on is affected. Second, checking all possible assignment combinations leads to O(k ) calculations, thus being computationally demanding for applications in which the number of clusters is high. Third, in case of violation of constraints, only the distances between prototypes are considered in the penalization, i.e., the object positions in relation to these prototypes are ignored. Aimed at circumventing these limitations, a variant of this algorithm, called Linear CVQE, has been presented in [6].. LCVQE The LCVQE algorithm [6] uses a modified version of the objective function in (). The cost of violating a ML constraint is now based on the distances between objects and prototypes. These distances are computed by considering the object of one cluster and the prototype of the other cluster. For a CL constraint, the object that is the farthest to the prototype is determined. Then, the distance between this object and its nearest neighbor prototype (second-closest cluster) is used as the violation cost. More precisely, the objective function J LCV QE is defined as: J LCV QE = k J LCV QEj (6) j= J LCV QEj = µ j x i x i C j + µ j o M(l) l V M,g M (l)=j + µ j o M (l) l V M,g M (l)=j + µ j R gc (l)(l) (7) l V C,V (l)=j The auxiliary functions, R j (l) and V (l), are defined in Equations (8) and (9), respectively. Intuitively, the former R j (l) returns the object of the l th CL constraint farthest to µ j, while the latter V (l) returns the index of the nearest neighbor prototype to the object R gc (l)(l), which is the object from the l th CL constraint farthest from the prototype of its cluster. R j (l) = V (l) = { o C (l) if o C (l) µ j > o C (l) µ j o C (l) otherwise, (8) arg min R gc (l)(l) µ m. (9) m {,...,k} {g C (l)} The assignment of objects to clusters also differs from CVQE. First, every object is assigned to the closest cluster. For each ML constraint being violated, only three assignment possibilities are examined: (i) maintain the violation; 3

4 (ii) assign the two objects to the cluster whose prototype is the nearest to the first object (o M (l)); (iii) assign the two objects to the cluster whose prototype is the nearest to the second object (o M (l)). For each CL constraint being violated, only two cases are checked: (i) maintain the violation; (ii) keep the object that is the closest to the cluster prototype as it is, and assign the farthest object (R gc (l)(l)) to the cluster with second-closest prototype (V (l)). More formally, the prototypes are updated as: µ j = y j z j, (0) y j = x i + o M(l) x i C j l V M,g M (l)=j + o M (l) l V M,g M (l)=j + R gc (l)(l) () l V C,V (l)=j z j = C j + l V M [gm (l)=j] + [g M (l)=j] + [V (l)=j] () l V M l V C The update rule can be interpreted as follows. Let l be a ML constraint that is being violated, i.e., o M (l) C j and o M (l) C n with j n. Then, the prototype µ j is moved towards the object o M (l) and the prototype µ n is moved towards the object o M (l). Now consider the case of a CL constraint being violated, i.e., o C (l) C j and o C (l) C j. Consider also that µ j o C (l) > µ j o C (l) and that µ n is the second-closest prototype of o C (l). Then, µ n is moved towards o C (l)..3 MPCK-Means The Metric Pairwise Constrained K-Means (MPCK-Means) algorithm [5] seeks to learn a distance metric that best fits the constraints. Specifically, a positive semi-definite matrix A j is learned for each cluster C j by parameterizing the Euclidean distance as follows: x i µ j Aj = (x i µ j ) T A j (x i µ j ). There are other algorithms capable of learning distance metrics by using constraints [3, 3]. These algorithms, however, do not perform clustering. Instead, a single metric is learned for all clusters, thus being limited to clusters with similar shapes [5]. The MPCK-Means objective function is: J mpckm = k j= x i C j [ ] x i µ j A j log(det(a j )) + w l f ML (l) + w l f CL (l), (3) l V M l V C f ML (l) = o M(l) o M(l) A gm (l) + o M(l) o M(l) A g M (l), (4) f CL (l) = x (g C(l)) d x (g C(l)) d A gc (l) o C (l) o C(l) A gc (l), (5) where log(det(a j )) arises due to the normalizing constant of a more generalized k-means model, w l and w l are user-defined weights to penalize the violation of the l th ML constraint and the l th CL constraint, respectively. As in [5], we set w l = w l =. The norm x (g C(l)) d x (g C(l)) d A gc (l) represents the maximum distance between two objects considering the distance metric of cluster C j. The auxiliary functions f ML and f CL compute the penalties for violating the l th ML and CL constraints, respectively. The former, f ML, penalizes the violation of a ML constraint proportionally to the distance between objects. As a violation of this type involves two clusters, the computation of the distance between the two objects in question considers two distance metrics, one from each cluster. The later, f CL, penalizes the violation of a CL constraint by making it inversely proportional to the distance between the two objects in question. The MPCK-Means algorithm uses a heuristic method for initializing the prototypes by considering the provided constraints. Initially, it deduces all the possible ML and CL constraints from the original set of constraints. To simplify the notation, the set M will here refer to the set of ML constraints provided and deduced. Similarly, the set C will represent CL constraints. Then, the connected components of a graph, obtained by considering that the objects are vertices and that each ML constraint is an edge, are found. Such connected components form the set of neighborhoods, Λ. If Λ k, Λ prototypes are initialized as the centroids of the objects that form each neighborhood, λ i, and the k Λ remaining prototypes are initialized as the overall data mean, added with random noise. If Λ > k, k neighbors are chosen by using a weighted variant of the farthest-first algorithm [], whose weights correspond to the number of ob- The k-means algorithm can be seen as a particular case of the wellknown EM algorithm for learning Gaussian Mixture Models. In this case, the Gaussian representing a cluster C j has a covariance matrix A j. 4

5 jects in each neighborhood. Thus, this initialization procedure is biased towards distant neighborhoods representing a large number of objects [5]. In order to assign objects to clusters, MPCK-Means uses a strategy that is sensitive to the data presentation order. In brief, objects are assigned to the clusters that minimize the cost in (3). Consider that the object x i has been assigned to a given cluster. This choice is not revised in the respective iteration, i.e., it cannot be changed, and it is only considered when checking for possible violations involving other objects, as well as for assessing the impact of such assignments on the objective function. To alleviate this problem, the objects are processed in a random order at each iteration. The prototypes are updated according to the k-means rule, and each matrix A j is updated according to (6), for which additional care must be taken [5]. In particular, initially it is necessary to check if the sum of the covariance matrices on the right hand side of (6) is singular. If so, a fraction of the trace is added on the main diagonal, i.e., A j = A j + ɛ tr(a j )I. If the matrix A j resulting from the inversion is not positive semi-definite, it is necessary to project it into the set of positive semi-definite matrices to ensure that it can parameterize a distance metric [5]. This is accomplished by the procedure described in [3], which consists of decomposing the matrix A j = X T ΛX, where X is the matrix of eigenvectors of A j and Λ is the diagonal matrix with the eigenvalues of A j. After the decomposition, the eigenvalues in Λ that are smaller than zero are replaced by 0, and then A j is reconstructed. ( A j = C j (x i µ j )(x i µ j ) T x i C j l V M,g M (l)=j l V M,g M (l)=j l V C,g C (l)=j w l(o M (l) o M (l))(o M(l) o M (l))t w l(o M (l) o M (l))(o M(l) o M (l))t w l ( (x (g C (l)) d x (g C (l)) d )(x (g C (l)) d (o C (l) o C (l))(o C(l) o C (l))t )).4 Computational Complexity x (g C (l)) d ) T We studied the computational costs (per iteration) of CVQE, LCVQE, and MPCK-Means. Here we provide a summary of our analyses. Recall that k, N, M are the numbers of clusters, objects, and attributes, respectively. Also, M and C are the sets of ML and CL constraints, respectively. The computational complexity of CVQE is O(k M (N + M + C )+k (M + M + C )). Thus, as k- means, it has linear complexity with the number of objects and attributes. However, differently from k-means, CVQE (6) has quadratic complexity with k. The computational complexity of LCVQE is O(k M (N + M + C )), thus being more computationally efficient than CVQE with regard to k. Finally, MPCK-Means complexity is O(k M ( M + C )+ M ( M + C ) + k M 3 + k N M + N M ), where the cubic complexity with the number of attributes comes from the computation of determinants and eigendecomposition of each matrix A j, and the quadratic complexity with N is due to the updates of the pair of farthest objects for each metric. 3 Empirical Evaluation 3. Methodology In order to compare the algorithms CVQE, LCVQE and MPCK-Means, experiments were performed on two sets of datasets as reported in Table. The first set is composed of 0 datasets commonly used as benchmarks in the literature. Most of them are available at the UCI Repository []. In addition, we have used the 9Gauss dataset [6], which is formed by nine balanced clusters arranged according to Gaussian distributions that have some degree of overlapping, as well as the Protein dataset [5]. Following [5], for the Letters dataset only classes I, J, and L were considered and, for Pendigits, only classes 3, 8, and 9 were considered. According to [5] these classes represent difficult classification problems. The second set, denoted by Bioinformatics for simplicity, is composed of 0 datasets that are benchmarks for the Cancer Gene Expression Data domain [8]. We selected only datasets containing more than 00 objects, so that a considerable number of constraints could be generated. For the experiments with gene expression datasets, objects have been normalized by using z-scores [4]. The main characteristics of the employed datasets are given in Table. The quality of the obtained partitions is assessed by means of two well-known external validity criteria, namely: Adjusted Rand Index (ARI) [3] and Normalized Mutual Information (NMI) [9]. All algorithms were executed using the a priori known number of clusters. For the UCI datasets, it is assumed that classes correspond to clusters. Sets of constraints with eight different sizes were generated so that M + C R = {5, 50, 75, 00, 5, 50, 75, 00}. Each constraint was generated as follows: let i and j be two distinct integers randomly chosen from the range [, N]. The labels of the objects x i and x j are verified and, if they are the same, a ML constraint is added, otherwise, a CL constraint is added. If the constraint has already been added (possibly CVQE and LCVQE were implemented in Matlab, whereas a Java implementation available at ml/risc/code/ was used for running experiments with MPCK-Means. 5

6 Set Dataset N M k UCI Bioinformatics Breast Cancer E-coli Ionosphere Iris Letters Pendigits Pima Wine Gauss Protein Bhattacharjee Chen Chowdary Gordon Lapointe-004-v Ramaswamy Singh Tomlins Yeoh-00-v Yeoh-00-v Table. Datasets used in the experiments. with the reversed order of objects), it is discarded. For each number of constraints, T = 00 repetitions were performed, thus totalling 800 sets of constraints for each dataset. All the studied algorithms are based on k-means, which is known to be sensitive to different initializations. Therefore, each algorithm was run five times. For both CVQE and LCVQE, the initializations differ by the objects randomly selected to be the initial prototypes, whereas for MPCK-Means the initializations differ on the point coordinates obtained through the addition of random noise. The best partition obtained from different initializations is determined by considering the value of the objective function of each algorithm, thus simulating a practical application of the algorithms being compared. We have used the standard k-means algorithm as a baseline for assessing the impact of incorporating constraints into the clustering process. Original and deduced constraints were considered when assessing the performance of CVQE and LCVQE. If only the provided constraints are considered, then their respective results are referenced by the algorithm names. If the original set of provided constraints is augmented by deduced constraints, then the correspondent results are referenced by CVQE-A and LCVQE-A. By comparing these different approaches, it is possible to assess to what extent the deduced constraints can improve the quality of the obtained partitions. MPCK-Means uses the deduced constraints as part of its initialization. As a consequence, both provided and deduced constraints are used when running this algorithm. We also investigate if learning a particular distance metric per cluster provides better results than learning a single metric for all clusters. The algorithm that learns multiple metrics is denoted by MPCK-M. Analogously, MPCK-S denotes the algorithm that learns a single metric for all clusters. In order to provide some reassurance about the validity and non-randomness of the obtained results, we present the results of statistical tests by following the approach proposed by Demšar [9]. In brief, this approach is aimed at comparing multiple algorithms on multiple datasets, and it is based on the use of the well known Friedman test with a corresponding post-hoc test. The Friedman test is a non-parametric counterpart of the well known ANOVA. If the null hypothesis, which states that the algorithms under study have similar performances, is rejected, then we proceed with the Nemenyi post-hoc test for pair-wise comparisons between algorithms. Statistical tests were performed individually for every investigated criterion, namely: ARI, NMI, and number of violated constraints (VIO). The tests were performed separately for each number of constraints in R = {5, 50, 75, 00, 5, 50, 75, 00} by averaging the values obtained over T = 00 repetitions. 3. Results on UCI Datasets The results obtained for the ARI and NMI criteria are summarized by means of the percentage of wins/ties/losses for each pair of algorithms in Tables and 3, respectively (considering every analyzed case 3 ). For example, the value 45.6/0.5/43.9 in the nd row and 3 rd column of Table means that CVQE presented ARI values greater than those obtained by LCVQE in 45.6% cases, equal values in 0.5% cases, and smaller values in 43.9% cases. From these tables, one can see that the use of deduced constraints by CVQE and LCVQE resulted in a similar number of gains and losses compared to using only the provided, original constraints. For example, considering the ARI, CVQE-A showed better results than CVQE in 39.4% cases and worse results in 37.3% cases. This and other similar results show that increasing the number of constraints (especially if these are deduced from known constraints) does not necessarily yield to better data partitions. Another interesting observation is that MPCK-S performed equal or better than CVQE-A and LCVQE-A in more than 55% of cases, while MPCK-M has not presented better results than LCVQE-A. Comparing CVQE to LCVQE, it can be noted that CVQE obtained better results in a slightly larger number of cases. By considering the two versions of MPCK-Means, learning a single metric for all clusters (MPCK-S) showed equal or better results than learning a particular metric per cluster (MPCK-M) in 66% of the cases. As expected, all algorithms provided better results than k-means. Table 4 summarizes the percentage of wins/ties/losses for the number of violated constraints (VIO) in the partitions found by each algorithm. The number of constraints violated by MPCK-Means is less or equal than the number of constraints violated by CVQE-A and LCVQE-A in more 3 The overall number of cases is 8,000 (0 R T ). 6

7 Table. Win/Tie/Loss (%) w.r.t. Algorithm in the st Column ARI for UCI Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 45.6 / 0.5 / / 3.3 / / 0.0 / /.4 / / 0. / / 0.5 / 30.8 LCVQE 43.9 / 0.5 / / 9.5 / / 5.3 / /.4 / / 0. / / 0.5 / 4.7 CVQE-A 39.4 / 3.3 / / 9.5 / / 0. / /.3 / / 0.3 / / 0.5 / 3.7 LCVQE-A 44.9 / 0.0 / / 5.3 / / 0. / /.3 / / 0.3 / / 0.6 / 4. MPCK-S 58.6 /.4 / /.4 / /.3 / /.3 / / 4.4 / / 0.3 / 8.8 MPCK-M 5.8 / 0. / / 0. / / 0.3 / / 0.3 / / 4.4 / / 0.0 / 36.5 k-means 30.8 / 0.5 / / 0.5 / / 0.5 / / 0.6 / / 0.3 / / 0.0 / 63.4 Table 3. Win/Tie/Loss (%) w.r.t. Algorithm in st Column NMI for UCI Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 45.3 / 0.5 / / 3.4 / / 0.0 / /.4 / / 0. / / 0.5 / 4.6 LCVQE 44. / 0.5 / / 9.5 / / 5.4 / /.4 / / 0. / / 0.5 / 35.3 CVQE-A 38.7 / 3.4 / / 9.5 / / 0. / /.3 / / 0.3 / / 0.5 / 43.0 LCVQE-A 45. / 0.0 / / 5.4 / / 0. / /.3 / / 0.3 / / 0.6 / 35. MPCK-S 6. /.4 / /.4 / /.3 / /.3 / / 4.4 / / 0.3 / 30.6 MPCK-M 48.7 / 0. / / 0. / / 0.3 / / 0.3 / / 4.4 / / 0.0 / 47.6 k-means 4.6 / 0.5 / / 0.5 / / 0.5 / / 0.6 / / 0.3 / / 0.0 / 5.4 Table 4. Win/Tie/Loss (%) w.r.t. Algorithm in the st Column VIO for UCI Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 8.5 / 7.7 / / 37.8 / / 6.0 / /. / / 9.3 / /.9 /.0 LCVQE 63.8 / 7.7 / / 5.7 / / 45.0 / / 33.8 / / 3. / /.7 / 0.0 CVQE-A 3.4 / 37.8 / / 5.7 / / 6.7 / / 4. / /.0 / /.4 /.9 LCVQE-A 54.4 / 6.0 / / 45.0 / / 6.7 / / 38.0 / / 35.0 / /.0 / 3.6 MPCK-S 74.8 /. / / 33.8 / / 4. / / 38.0 / / 80.0 / /.0 / 0.0 MPCK-M 73.6 / 9.3 / / 3. / /.0 / / 35.0 / / 80.0 / /.3 / 0.7 k-means.0 /.9 / /.7 / /.4 / /.0 / /.0 / /.3 / 97.0 Table 5. Average Number of Violated Constraints (VIO) UCI Datasets. Size of set Type Provided Deduced CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M ML 6.0 (3.) 0. (0.) 0.7 (0.8) 0. (0.) 0.7 (0.9) 0. (0.) 0.0 (0.0) 0.0 (0.0) CL 9.0 (3.).4 (.0) 0.8 (0.9) 0. (0.) 0.9 (.0) 0. (0.) 0.0 (0.0) 0.0 (0.0) ML.0 (6.3) 0.4 (0.5).6 (.7) 0.6 (0.6).7 (.9) 0.6 (0.6) 0. (0.) 0. (0.) CL 38.0 (6.3) 5.8 (4.6).7 (.0) 0.4 (0.5).3 (.4) 0.6 (0.8) 0.0 (0.0) 0.0 (0.0) ML 8. (9.6). (.3).9 (.9). (.3) 3. (3.5).3 (.5) 0. (0.) 0. (0.) CL 56.8 (9.6) 3.6 (0.8) 3. (3.0) 0.8 (0.9) 4.5 (4.4).5 (.9) 0.0 (0.0) 0. (0.) ML 4.0 (.8).0 (.3) 4. (4.4). (.) 4.7 (5.4). (.5) 0. (0.) 0.3 (0.) CL 76.0 (.8) 4.9 (0.) 4.4 (4.0).5 (.7) 7. (6.8) 3. (4.) 0. (0.) 0. (0.) ML 30. (6.6) 3. (3.9) 5.8 (6.0) 3. (3.4) 6.9 (8.) 3.5 (4.4) 0.3 (0.5) 0.5 (0.4) CL 94.9 (6.6) 39.5 (3.7) 5.8 (5.5). (.4) 0.9 (0.) 4.9 (6.9) 0. (0.4) 0. (0.3) ML 36.0 (9.6) 4.8 (5.7) 7.3 (7.5) 4. (4.4) 9. (0.4) 4.8 (6.0) 0.4 (0.7) 0.7 (0.7) CL 4.0 (9.6) 57.9 (49.0) 7. (6.8).9 (3.3) 5.6 (4.6) 7.5 (0.) 0.3 (0.8) 0.5 (0.6) ML 4. (3.) 6.6 (7.9) 9. (9.6) 5.5 (6.0).7 (3.4) 6.6 (8.0) 0.5 (0.8) 0.8 (0.9) CL 3.8 (3.) 8.5 (7.) 8.8 (8.) 3.6 (4.).4 (.4) 0.6 (4.6) 0.5 (.) 0.8 (.) ML 48. (6.0) 8.7 (0.3) 0.7 (.) 6.7 (7.4) 4. (6.) 8.7 (0.7) 0.8 (.).3 (.4) CL 5.8 (6.0) 09. (96.5) 0. (9.7) 4.5 (5.) 7. (7.0) 5.0 (.) 0.9 (.).8 (.8) 7

8 than 96% of cases, thus suggesting that MPCK-Means can indeed learn a metric suitable for satisfying the constraints, but this property is not enough for generalization purposes with respect to the objects not directly affected by the constraints. It can also be observed from Table 4 that LCVQE violated a number of constraints smaller or equal to those violated by CVQE in 90% of cases, suggesting that the LCVQE s procedure for updating prototypes is better than the one adopted in CVQE. As expected, all algorithms performed better than k-means for the VIO criterion. Table 5 summarizes the average numbers of violated constraints for each algorithm (standard deviations between parentheses). The averages were computed from both the provided sets of constraints (randomly generated according to the procedure previously described) and deduced sets of constraints. One can see that MPCK-Means has presented the best results. Moreover, as the number of constraints increases, the use of a single metric for all clusters presented better results than the use of a particular metric per cluster. Also, LCVQE provided better results than CVQE for both ML and CL constraints. Table 6 summarizes the results of significance tests. Only the pairs of algorithms for which statistically significant differences (α = 5%) were found are listed. The first column indicates the investigated criteria (ARI, NMI, and VIO), and the second column lists the pairs of algorithms where the first listed algorithm obtained better results than the second one. The third column presents the respective sizes of the constraint sets. For instance, MPCK-M obtained better results than CVQE-A with respect to VIO for constraint sets of sizes {5, 50, 00, 5, 50, 75, 00} (see the 0 th row). For ARI and NMI, statistically significant differences were not observed. For VIO, CVQE has not shown significantly better results than k-means for any number of constraints. It is interesting to note that MPCK-Means obtained significantly better results than CVQE-A in most of the scenarios. However, the same does not hold with respect to LCVQE, which is more computationally efficient than MPCK-Means. Thus, by taking into account all the assessed criteria, LCVQE has shown the best trade-off solutions. 3.3 Results on Bioinformatics Datasets As in the previous section, we assessed the studied algorithms according to three criteria ARI, NMI, and VIO. Tables 7-9 present the relative percentages of wins/ties/losses for each pair of algorithms. For both ARI and NMI, LCVQE obtained equal or better results than CVQE in more than 74% of the cases. LCVQE also violated less constraints than CVQE in 55% of the cases. In this sense, note from Table 0 that the performance differences are particularly favorable to LCVQE for CL constraints. Table 6. Significant Differences Friedman/Nemenyi tests (α = 5%) on UCI Datasets. Criterion Pairs of Algorithms Size of constraint set ARI & NMI VIO <LCVQE, CVQE-A > { 75, 00 } <LCVQE, k-means > { 5,..., 00} <LCVQE-A,k-means > { 5,..., 50} <MPCK-S, CVQE> { 50, 75, 00, 5} <MPCK-S, CVQE-A > { 5,..., 00} <MPCK-S, k-means > { 5,..., 00} <MPCK-M, CVQE> { 5 } <MPCK-M, CVQE-A > { 5, 50, 00,..., 00} <MPCK-M, k-means > { 5,..., 00} The use of a particular metric per cluster (MPCK-M) has provided worse results than the use of a single metric for all clusters in about 65% of the cases (with respect to ARI and NMI). However, the use of multiple metrics allowed violating less or the same quantity of constraints in 8% of the cases. In essence, these results are in accordance with those reported for UCI datasets, thus suggesting that, for the employed Bioinformatics data, violating less constraints do not necessarily lead to more accurate clusterings. Surprisingly, even the standard k-means presented better partitions than MPCK-Means in more than 63% of the cases (considering ARI and NMI). Another interesting observation is that the use of a specific metric per cluster (MPCK-M) has resulted in particularly worse results when CL constraints are taken into account see Table 0. Table presents a summary of the outcomes of the performed statistical tests. For compactness, we only present the pairs of algorithms for which significant differences were observed for both accuracy measures (ARI and NMI). One can see that LCVQE and LCVQE-A presented significantly better results than MPCK-Means, especially when the number of provided constraints is small. Taking the number of violated constraints (VIO) into account, only CVQE has not presented significantly better results than k-means, and LCVQE violated less constraints than CVQE-A in most of the cases. Although CVQE and LCVQE have, in general, shown better results than MPCK-Means in our experiments, it is important to have in mind that, for particular datasets, learning distance metrics can indeed be advantageous. In order to illustrate this aspect, consider Figure, which shows the average values for ARI obtained by each algorithm, for different sets of constraints, in the Singh-00 dataset. It can be seen that MPCK-Means scores well above CVQE and LCVQE, especially when more than 75 constraints were provided. This result indicates that the Euclidean distance is not a good metric for this dataset. As a consequence, CVQE and LCVQE cannot satisfy the constraints while improving the data partitions. This fact becomes even clearer when 8

9 Table 7. Win/Tie/Loss (%) w.r.t. Algorithm in the st Column ARI for Bioinformatics Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 5.9 /. / / 7.3 / / 0.4 / /. / /. / /.9 / 3.4 LCVQE 5.9 /. / /.3 / / 8.6 / / 3. /.9 8. /. / / 3.0 / 9.8 CVQE-A 38.8 / 7.3 / /.3 / /.9 / /.3 / /. / / 3.3 / 3.4 LCVQE-A 5. / 0.4 / / 8.6 / /.9 / / 3.9 / /.3 / /.9 / 3.4 MPCK-S 3.9 /. / / 3. / /.3 / / 3.9 / / 6.9 / / 0. / 63.4 MPCK-M 4.0 /. / /. / /. / /.3 / / 6.9 / / 0.3 / 7.6 k-means 3.4 /.9 / / 3.0 / / 3.3 / /.9 / / 0. / / 0.3 / 8. Table 8. Win/Tie/Loss (%) w.r.t. Algorithm in the st Column NMI for Bioinformatics Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 3.4 /. / / 7.3 / / 0.4 / /. / /. / /.9 / 39.7 LCVQE 54.3 /. / /.3 / / 8.6 / / 3. / /. / / 3.0 / 5.6 CVQE-A 39. / 7.3 / /.3 / /.9 / /.3 / /. / / 3.3 / 38.8 LCVQE-A 54.8 / 0.4 / / 8.6 / /.9 / / 3.9 / /.3 / /.9 / 7.3 MPCK-S 35.4 /. / / 3. / /.3 / / 3.9 / / 6.9 / / 0. / 64. MPCK-M 4.6 /. / /. / /. / /.3 / / 6.9 / / 0.3 / 74.5 k-means 39.7 /.9 / / 3.0 / / 3.3 / /.9 / / 0. / / 0.3 / 5. Table 9. Win/Tie/Loss (%) w.r.t. Algorithm in the st Column VIO for Bioinformatics Datasets. Algorithm CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M k-means CVQE 0.0 / 34.7 / / 3.4 / / 30.7 / / 5.5 / /.7 / / 3.7 / 5. LCVQE 55.3 / 34.7 / / 30.4 / / 4. / / 34. / / 6.9 / / 3.3 / 0.3 CVQE-A 5.9 / 3.4 / / 30.4 / / 3.5 / / 7.3 / / 3.6 / / 4. / 6.9 LCVQE-A 40.4 / 30.7 / / 4. / / 3.5 / / 38.8 / / 30.5 / / 3.5 / 4.3 MPCK-S 68. / 5.5 / / 34. / / 7.3 / / 38.8 / / 77.5 / / 3. / 0.8 MPCK-M 59.5 /.7 / / 6.9 / / 3.6 / / 30.5 / / 77.5 / / 3.4 / 8.4 k-means 5. / 3.7 / / 3.3 / / 4. / / 3.5 / / 3. / / 3.4 / 88. Table 0. Average Number of Violated Constraints (VIO) Bioinformatics Datasets. Size of set Type Provided Deduced CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M ML. (3.9) 0.4 (0.4) 0.6 (0.7) 0.3 (0.5) 0.6 (0.8) 0.3 (0.4) 0.0 (0.0) 0.0 (0.) CL 3.9 (3.9).3 (.0) 0.8 (.6) 0. (0.3). (.) 0.3 (0.6) 0.0 (0.0) 0.4 (.0) ML.4 (7.9).9 (.9).7 (.9). (.6). (.7).3 (.9) 0. (0.) 0. (0.) CL 7.6 (7.9) 9.8 (8.4).0 (3.6) 0.6 (.0) 4.4 (8.5).6 (3.4) 0. (0.4) 0.5 (.0) ML 33.3 (.6) 4.5 (4.6) 3.3 (3.7).4 (3.) 4. (5.5).8 (4.3) 0. (0.4) 0.5 (0.8) CL 4.7 (.6) 3.7 (0.4) 3. (4.9). (.) 9.3 (7.8) 4. (8.6) 0.6 (.4).7 (.9) ML 44.6 (5.6) 8.6 (9.0) 4.8 (5.6) 3.7 (5.) 6. (7.7) 4.7 (7.) 0.4 (0.5) 0.9 (.4) CL 55.4 (5.6) 44.4 (39.7) 4.9 (6.8).0 (3.) 7.9 (35.4) 8. (7.) 0.9 (.6) 3. (6.5) ML 55.7 (9.6) 3.9 (4.9) 6.7 (7.7) 4.9 (6.4) 9. (0.8) 7.8 (3.0) 0.3 (0.5). (.3) CL 69.3 (9.6) 73.3 (67.) 6.5 (9.).7 (4.3) 8.9 (56.7) 6. (37.) 0.7 (.5) 5.4 (3.3) ML 66.9 (3.9).6 (3.5) 8.7 (0.0) 6.3 (8.).6 (4.9).5 (9.) 0.3 (0.6). (.9) CL 83. (3.9).4 (04.) 8.0 (0.9) 3.5 (5.5) 43.7 (84.0) 6.8 (6.) 0.7 (.8) 8.0 (0.5) ML 78. (7.9) 3.0 (33.8) 0.9 (.8) 7.8 (0.3) 6.8 (0.9) 5. (5.6) 0.3 (0.4).4 (3.) CL 96.9 (7.9) 6.8 (53.0) 9.7 (3.6) 4.4 (6.9) 58. (.3) 39.3 (89.8) 0.5 (.) 6.0 (46.7) ML 89.3 (3.6) 4.3 (45.).6 (4.7) 9. (.9). (8.5) 0. (33.6) 0. (0.).7 (4.4) CL 0.7 (3.6).9 (.6).4 (5.4) 5. (8.) 78.8 (53.9) 56.6 (3.3) 0. (0.3).5 (6.7) 9

10 Table. Statistically Significant Differences (α = 5%) Bioinformatics Datasets. Criteria Pairs Size of constraint set <LCVQE, MPCK-S > { 5, 50, 75, 00 } <LCVQE, MPCK-M > { 5,..., 75 } ARI & NMI <LCVQE-A,MPCK-S > { 5,..., 5 } <LCVQE-A,MPCK-M > { 5,..., 75 } <LCVQE-A, CVQE> { 5 } Adjusted Rand Index VIO 0 <LCVQE, k-means > { 5,..., 00 } <LCVQE, CVQE-A > { 50,..., 5, 75 } <LCVQE-A, k-means > { 5, 50, 75, 00 } <MPCK-S, CVQE> { 5, 50, 75, 00 } <MPCK-S, CVQE-A > { 5,..., 00 } <MPCK-S, k-means > { 5,..., 00 } <MPCK-M, CVQE-A > { 5, 50 } <MPCK-M, k-means > { 5, 50 } CVQE CVQE A LCVQE LCVQE A MPCK S MPCK M k means Number of Constraints Figure. ARI Values Singh-00 dataset. VIO is analyzed, e.g., for 00 constraints the average number of violations by CVQE, LCVQE, and MPCK-Means is 86, 57, and 0, respectively. Another interesting observation is that the results of MPCK-S were equal or better than those obtained by MPCK-M for any number of constraints, again suggesting that the additional computational cost of learning a distance metric for each cluster is unnecessary for these datasets. 3.4 Noisy Constraints We also assessed the performance of the algorithms in scenarios where some of the constraints are noisy. A related study was performed in [6], where only very small fractions of noisy constraints were considered for a few datasets. Our experimental setting is more representative: we performed experiments on twenty datasets (Table ) for different fractions of noisy constraints. More specifically, we considered scenarios where 00 constraints were provided. From these constraints, different proportions {5%, 0%, 5%, 0%, 5%, 30%} of noisy constraints were generated by randomly selecting a constraint and changing its type, i.e., turning a ML constraint into a CL one, and vice-versa. This procedure is inspired in the noisy edge noise model [0]. We run experiments by taking into account the same (00) trials described in Section 3., i.e., for each trial the 00 initial (not noisy) constraints are exactly as described in the previous experiments. For the sake of illustration, we show the NMI values obtained for the Iris dataset Figure 3. The trends depicted in this graph can be considered typical for our experiments with other datasets. One can observe from this figure that CVQE shows the best behavior. This is expected from the trend observed in our experiments reported in Sections 3. and 3.3, which do not involve noisy constraints, namely: it tends to violate more constraints than LCVQE. Also, algorithms using the augmented set of constraints (CVQE-A, LCVQE-A, and MPCK-Means) presented a more sensible decrease in performance. Since the augmented set may contain constraints that were derived from noisy ones, this is also expected. An interesting observation is that CVQE-A has shown more robustness to noisy constraints than LCVQE. Finally, Figure 3 also shows that MPCK-Means is less robust with respect to noisy constraints. This can be explained by the fact that it seeks to adapt a distance metric, which in this case can be seen as a projection [5] into a wrong direction, thus resulting in low NMI values. Normalized Mutual Information CVQE CVQE A LCVQE LCVQE A MPCK S MPCK M Percentage of Noise Constraints Figure 3. NMI Values Iris dataset. For a more general perspective, there are two important aspects to be evaluated in our experimental setting. The first one is to identify algorithms that are more robust to noise. We evaluate this aspect by means of the relative difference between the NMI 4 values obtained with and without noisy 4 NMI and ARI presented highly correlated values in our experiments. For this reason, only NMI values are here reported. 0

11 constraints. More precisely, we use the measure in Eq. (7), which captures the average difference, among the T trials performed, in terms of NMI values obtained by a particular algorithm (i), for a given amount of noise (n), in dataset d. We use NMI noisy and NMI to denote NMI values obtained when comparing the reference partition with the partition obtained with and without noisy constraints, respectively. NMI (i, n, d) = T T (NMI(i, t, d) NMI noisy (i, t, d, n)) t= (7) The average rankings of the NMI differences are summarized in Table, where, for example, the value of.00 in the 4 th row and 4 th column indicates that, after sorting the differences obtained by each algorithm (with 5% of the constraints being noisy), CVQE-A is the second best algorithm (on average). Note that the NMI values are always computed with respect to a reference partition, given by the classes. Thus, it is expected that the greater the impact of noisy constraints the higher the NMI difference. From these results, one can observe that CVQE is the most robust. [ ( ) Accuracy(i, n, d) = T T F t [l V t ] [l N t ] t= l F ( t )] + F t [l / V t ] [l / N t ] (8) l F t Table 3 presents the average rankings obtained from the accuracies. One can observe that, by considering 5-5% of noisy constraints, MPCK-S is the best algorithm (on average), whereas CVQE is the best one for 0-30% of noisy constraints. The accuracies obtained by MPCK-S are explained from the observation that the constraint sets are imbalanced. Recall from Sections 3. and 3.3 that MPCK-Means usually violates a very small (close to 0) number of constraints. Thus, for small proportions of noisy constraints, it is advantageous (from the accuracy viewpoint) to satisfy most of them (no matter whether they are noisy or not). When such an imbalance is lessened (i.e., the proportion of noisy constraints is increased), its accuracies decrease. As expected, the accuracy of LCVQE also deteriorates with the increasing number of noisy constraints, but for a small amount of noise (5%), it is still competitive with CVQE. Table. Average rankings of the relative difference of NMI values between partitions obtained with and without noisy constraints - Equation (7). % Noise CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M The second aspect to be analyzed refers to the algorithm s capability of correctly violating noisy constraints while respecting the remaining ones. To that end, two fractions are added up, namely: (i) the fraction of constraints that were correctly violated (i.e., are indeed noisy) and (ii) the fraction of constraints that were correctly satisfied (i.e., those that are not noisy). The resulting measure is analogous to the well-known accuracy. More precisely, consider that algorithm (i) was run in dataset d, which has a particular proportion of noisy constraints (n). The accuracy is then measured by means of Eq. (8), where F, V and N denote the full set of constraints (F = M C), full set of violated constraints (V = V M V C ), and the set of noisy constraints, respectively. A superscript t in a set denotes the set of the t th trial. Table 3. Average rankings for the accuracies over constraints: Equation (8). % Noise CVQE LCVQE CVQE-A LCVQE-A MPCK-S MPCK-M In summary, LCVQE seems to be the most appropriate algorithm for scenarios where the expected amount of noise in the constraints is low. For noisier sets of constraints, the more computationally demanding CVQE tends to provide better results. 4 Conclusions In this study, three well-known algorithms for k-meansbased clustering with constraints Constrained Vector Quantization Error (CVQE) [7], its variant named LCVQE [6], and the Metric Pairwise Constrained K- Means (MPCK-Means) [5] were systematically compared. These algorithms are designed to minimize the number of violated constraints in the process of clustering. In addition, MPCK-Means also seeks to learn a distance metric that maximizes the agreement with the constraints. Besides evaluating the algorithms with respect to the number

12 of violated constraints, two criteria that capture the accuracy of the obtained data partitions have been adopted Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). Experiments were performed on 0 datasets of different characteristics. For each dataset, 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance were presented. Computational complexity analyses not available in the original references have also been presented. Thus, we provided a far richer evaluation on all these aspects compared to previous studies. In order to analyze the obtained results, we grouped the datasets into two subsets. The first subset is formed by UCI datasets, whereas the second set has Bioinformatics data (more precisely, benchmarks for Cancer Gene Expression Data). For both UCI and Bioinformatics datasets, LCVQE has shown to be competitive with CVQE in terms of accuracy, while violating less constraints and being more computationally efficient. From this viewpoint, our work confirms the claims made from a more limited set of experiments in [6]. Also, in most of the cases both CVQE and LCVQE presented better accuracy than MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. These results suggest that the Euclidean distance is appropriate to compute dissimilarities between objects for most of the investigated datasets we have normalized the bioinformatics datasets by means of z-scores, thus making the Euclidean distance consistent with the Pearson correlation, which is widely used for such datasets. Although CVQE and LCVQE have, in general, shown better results than MPCK-Means in our experiments, it is important to keep in mind that for particular datasets learning distance metrics can indeed be advantageous. As in [5], our more extensive study corroborates the conclusion that MPCK-Means presents better results when more constraints are available. We shall also observe that essentially the same conclusions can be drawn if one considers the 0 datasets altogether, namely: in most of the cases LCVQE provides the best clusterings and MPCK-Means violates less constraints. Overall, by taking into account all the assessed criteria, LCVQE has shown the best trade-off solutions. Some useful observations can also be made from our experiments with noisy constraints. For instance, CVQE has shown to be robust even in very noisy scenarios (e.g., 30% of noisy constraints). For more well-behaved scenarios (e.g., 5% of noisy constraints), its more computationally efficient counterpart LCVQE has presented a reasonable tradeoff between efficiency and accuracy. Finally, a variety of (more specific) new experimental findings emerged from our study. For instance, our results suggest that deduced constraints usually do not help finding better data partitions. In addition, we have observed that ML constraints are harder to be satisfied, especially by the algorithms that have found better clusterings. We shall stress that our study was focused on clustering problems, for which the underlying assumption is that both ML and CL constraints should help finding better data partitions. However, from a classification point of view, ML constraints may be misleading. For instance, if a particular class is formed by more than one cluster, then satisfying ML constraints may imply in having less accurate classifiers. Therefore, from the different viewpoint of building classifiers from constrained clustering algorithms (e.g. in semi-supervised learning settings), CL constraints tend to be more useful. Also, for these (and related) application scenarios, one should not assume that the number of clusters is equal to the number of classes. Instead, the number of clusters should be optimized from data, in such a way that ML constraints could also be used to induce better models. From these observations, we believe that there is still room for investigating how algorithms for constrained clustering can help semi-supervised learning. Acknowledgments This work has been supported by NSF Grants (IIS and IIS-0664) and by the Brazilian Research Agencies FAPESP and CNPq. References [] A. Asuncion and D. Newman. UCI machine learning repository. mlearn/ MLRepository.html, 007. [] A. Banerjee and J. Ghosh. Scalable clustering algorithms with balancing constraints. Data Mining and Knowledge Discovery, 3(3):65 95, November 006. [3] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning a mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, 6: , 005. [4] S. Basu, I. Davidson, and K. Wagstaff. Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC, 008. [5] M. Bilenko, S. Basu, and R. J. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the twenty-first international conference on Machine learning (ICML 04), page, New York, NY, USA, 004. ACM. [6] R. J. G. B. Campello, E. R. Hruschka, and V. S. Alves. On the efficiency of evolutionary fuzzy clustering. Journal of Heuristics, 5:43 75, 009. [7] I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In 5th SIAM Data Mining Conference, pages SIAM, 005.

Competitive Learning with Pairwise Constraints

Competitive Learning with Pairwise Constraints Thiago F. Covões, Eduardo R. Hruschka, Joydeep Ghosh University of Texas (UT) at Austin, USA University of São Paulo (USP) at São Carlos, Brazil {tcovoes,erh}@icmc.usp.br;ghosh@ece.utexas.edu