Integrating Constraints and Metric Learning in Semi-Supervised Clustering

Size: px

Start display at page:

Download "Integrating Constraints and Metric Learning in Semi-Supervised Clustering"

Gillian Bruce
5 years ago
Views:

1 Integrating Constraints and Metric Learning in Semi-Supervised Clustering Mikail Bilenko Sugato Basu Raymond J. Mooney Department of Computer Sciences, University of Texas at Austin, Austin, TX USA Abstract Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in te area as utilized supervised data in one of two approaces: 1) constraint-based metods tat guide te clustering algoritm towards a better grouping of te data, and 2) distance-function learning metods tat adapt te underlying similarity metric used by te clustering algoritm. Tis paper provides new metods for te two approaces as well as presents a new semi-supervised clustering algoritm tat integrates bot of tese tecniques in a uniform, principled framework. Experimental results demonstrate tat te unified approac produces better clusters tan bot individual approaces as well as previously proposed semisupervised clustering algoritms. 1. Introduction In many learning tasks, unlabeled data is plentiful but labeled data is limited and expensive to generate. Consequently, semi-supervised learning, wic employs bot labeled and unlabeled data, as become a topic of significant interest. More specifically, semi-supervised clustering, te use of class labels or pairwise constraints on some examples to aid unsupervised clustering, as been te focus of several recent projects (Wagstaff et al., 2001; Basu et al., 2002; Klein et al., 2002; Xing et al., 2003; Bar-Hillel et al., 2003; Segal et al., 2003). Existing metods for semi-supervised clustering fall into two general approaces we call constraint-based and metric-based. In constraint-based approaces, te clustering algoritm itself is modified so tat user-provided labels or pairwise constraints are used to guide te algoritm towards a more appropriate data partitioning. Tis is done by modifying te clustering objective function so tat it includes satisfaction of constraints (Demiriz et al., Appearing in Proceedings of te 21 st International Conference on Macine Learning, Banff, Canada, Copyrigt 2004 by te autors. 1999), enforcing constraints during te clustering process (Wagstaff et al., 2001), or initializing and constraining clustering based on labeled examples (Basu et al., 2002). In metric-based approaces, an existing clustering algoritm tat uses a distance metric is employed; owever, te metric is first trained to satisfy te labels or constraints in te supervised data. Several distance measures ave been used for metric-based semi-supervised clustering including Euclidean distance trained by a sortest-pat algoritm (Klein et al., 2002), string-edit distance learned using Expectation Maximization (EM) (Bilenko & Mooney, 2003), KL divergence adapted using gradient descent (Con et al., 2003), and Maalanobis distances trained using convex optimization (Xing et al., 2003; Bar-Hillel et al., 2003). Previous metric-based semi-supervised clustering algoritms exclude unlabeled data from te metric training step, as well as separate metric learning from te clustering process. Also, existing metric-based metods use a single distance metric for all clusters, forcing tem to ave similar sapes. We propose a new semi-supervised clustering algoritm derived from, MPCK-MEANS, tat incorporates bot metric learning and te use of pairwise constraints in a principled manner. MPCK-MEANS performs distance-metric training wit eac clustering iteration, utilizing bot unlabeled data and pairwise constraints. Te algoritm is able to learn individual metrics for eac cluster, wic permits clusters of different sapes. MPCK- MEANS also allows violation of constraints if it leads to a more coesive clustering, wereas earlier constraint-based metods forced satisfaction of all constraints, leaving tem vulnerable to noisy supervision. By ablating te metric-based and constraint-based components of our unified metod, we present experimental results comparing and combining te two approaces on multiple datasets. Te two metods for semi-supervision individually improve clustering accuracy, and our unified approac integrates teir strengts. Finally, we demonstrate tat te semi-supervised metric learning in our approac outperforms previously proposed metods tat learn metrics prior to clustering, and tat learning multiple clusterspecific metrics can lead to better results.

2 2. Problem Formulation 2.1. Clustering wit is a clustering algoritm based on iterative relocation tat partitions a dataset into K clusters, locally minimizing te total squared Euclidean distance between te data points and te cluster centroids. Let X = {x i } N i=1,x i Ê m be a set of data points, x id be te d-t component of x i, {µ } K =1 represent te K cluster centroids, and l i be te cluster assignment of a point x i, were l i {1,...,K}. Te Euclidean algoritm creates a K-partitioning {X } K =1 of X so tat te objective function x x i X i µ li 2 is locally minimized. It can be sown tat te algoritm is essentially an EM algoritm on a mixture of K Gaussians under assumptions of identity covariance of te Gaussians, uniform mixture component priors and expectation under a particular type of conditional distribution (Basu et al., 2002). In te Euclidean formulation, te squared L 2 -norm x i µ li 2 = (x i µ li ) T (x i µ li ) between a point x i and its corresponding cluster centroid µ li is used as te distance measure, wic is a direct consequence of te identity covariance assumption of te underlying Gaussians Semi-supervised Clustering wit Constraints In semi-supervised clustering, a small amount of labeled data is available to aid te clustering process. Our framework uses bot must-link and cannot-link constraints between pairs of instances (Wagstaff et al., 2001), wit an associated cost for violating eac constraint. In many unsupervised-learning applications, e.g., clustering for speaker identification in a conversation (Bar-Hillel et al., 2003), or clustering GPS data for lane-finding (Wagstaff et al., 2001), considering supervision in te form of constraints is more realistic tan providing class labels. Wile class labels may be unknown, a user can still specify weter pairs of points belong to same or different clusters. Constraint-based supervision is also more general tan class labels: a set of classified points implies an equivalent set of pairwise constraints, but not vice versa. Since cannot directly andle pairwise constraints, we formulate te goal of pairwise constrained clustering as minimizing a combined objective function, defined as te sum of te total squared distances between te points and teir cluster centroids, and te cost incurred by violating any pairwise constraints. Let M be a set of must-link pairs were (x i,x j ) M implies x i and x j sould be in te same cluster, and C be a set of cannot-link pairs were (x i,x j ) C implies x i and x j sould be in different clusters. Let W = {w ij } and W = {w ij } be penalty costs for violating te constraints in M and C respectively. Terefore, te goal of pairwise constrained is to minimize te following objective function, were point x i is assigned to te partition X li wit centroid µ li : J pckmeans = X x i µ li 2 x i X (x i,x j ) C (x i,x j ) M w ij½[l i l j] w ij½[l i = l j] (1) were ½ is te indicator function, ½[true] = 1 and ½[false] = 0. Tis matematical formulation is motivated by te metric labeling problem wit te generalized Potts model (Kleinberg & Tardos, 1999) Semi-supervised Clustering wit Metric Learning Wile pairwise constraints can guide a clustering algoritm towards a better grouping, tey can also be used to adapt te underlying distance metric. Pairwise constraints effectively represent te user s view of similarity in te domain. Since te original data representation may not specify a space were clusters are sufficiently separated, modifying te distance metric warps te space to minimize distances between same-cluster objects, wile maximizing distances between different-cluster objects. As a result, clusters discovered using learned metrics adere more closely to te notion of similarity embodied in te supervision. We parameterize Euclidean distance using a symmetric positive-definite matrix A as follows: x i x j A = (x i µ li ) T A(x i µ li ); te same parameterization was previously used by Xing et al. (2003) and Bar-Hillel et al. (2003). If A is restricted to a diagonal matrix, it scales eac dimension by a different weigt and corresponds to feature weigting; oterwise new features are created tat are linear combinations of te original ones. In previous work on adaptive metrics for clustering (Con et al., 2003; Xing et al., 2003; Bar-Hillel et al., 2003), metric weigts are trained to simultaneously minimize te distance between must-linked instances and maximize te distance between cannot-linked instances. A fundamental limitation of tese approaces is tat tey assume a single metric for all clusters, preventing tem from aving different sapes. We allow a separate weigt matrix for eac cluster, denoted A for cluster. Tis is equivalent to a generalized version of te model described in section 2.1, were cluster is generated by a Gaussian wit covariance matrix A 1 (Bilmes, 1997). It can be sown tat maximizing te complete data log-likeliood under tis generalized model is equivalent to minimizing te objective function: J mkmeans = X ` xi µ li 2 A log(det(a li l i )) (2) x i X were te second term arises due to te normalizing constant of l i -t Gaussian wit covariance matrix A 1 l i.

3 2.4. Integrating Constraints and Metric Learning Combining Eqns.(1) and (2) leads to te following objective function tat minimizes cluster dispersion under te learned metrics wile reducing constraint violations: J combined = X ` xi µ li 2 A log(det(a li l i )) x i X (x i,x j ) M w ij½[l i l j] w ij½[l i = l j] (3) (x i,x j ) C If we assume uniform constraint costs w ij and w ij, all constraint violations are treated equally. However, te penalty for violating a must-link constraint between distant points sould be iger tan tat between nearby points. Intuitively, tis captures te fact tat if two must-linked points are far apart according to te current metric, te metric is grossly inadequate and needs severe modification. Since two clusters are involved in a must-link violation, te corresponding penalty sould affect te metrics for bot clusters. Tis can be accomplised via multiplying te penalty in te second summation of Eqn.(3) by te following function: f M(x i,x j) = 1 2 xi xj 2 A li xi xj 2 A lj (4) Analogously, te penalty for violating a cannot-link constraint between two points tat are nearby according to te current metric sould be iger tan for two distant points. To reflect tis intuition, te following penalty term can be used wit violated cannot-link constraints tat are assigned to te same cluster (l i = l j ): f C(x i,x j) = x l i x l i 2 A li x i x j 2 A li (5) were (x l i,x l i ) is te maximally separated pair of points in te dataset according to l i -t metric. Tis form of f C ensures tat te penalty for violating a cannot-link constraint remains non-negative since te second term is never greater tan te first. Te combined objective function ten becomes: J mpckm = X ` xi µ li 2 A log(det(a li l i )) x i X w ijf M(x i,x j)½[l i l j] (6) (x i,x j ) M w ijf C(x i,x j)½[l i = l j] (x i,x j ) C Costs w ij and w ij provide a way of specifying te relative importance of te labeled versus unlabeled data wile allowing individual constraint weigts. Te following section describes ow J mpckm can be greedily optimized by our proposed metric pairwise constrained (MPCK- MEANS) algoritm. 3. MPCK-MEANS Algoritm Given a set of data points X, a set of must-link constraints M, a set of cannot-link constraints C, corresponding cost sets W and W, and te desired number of clusters K, MPCK-MEANS finds a disjoint K-partitioning {X } K =1 of X (wit eac cluster aving a centroid µ and a local weigt matrix A ) suc tat J mpckm is (locally) minimized. Te algoritm integrates te use of constraints and metric learning. Constraints are utilized during cluster initialization and wen assigning points to clusters, and te distance metric is adapted by re-estimating te weigt matrices A during eac iteration based on te current cluster assignments and constraint violations. Pseudocode for te algoritm is presented in Fig.1. Algoritm: Input: Set of data points X = {x i } N i=1, set of must-link constraints M = {(x i,x j )}, set of cannot-link constraints C = {(x i,x j )}, number of clusters K, sets of constraint costs W and W. Output: Disjoint K-partitioning {X } K =1 of X suc tat objective function J mpckm is (locally) minimized. Metod: 1. Initialize clusters: 1a. create te λ neigboroods {N p } λ p=1 from M and C 1b. if λ K initialize {µ (0) }K =1 using weigted fartest-first traversal starting from te largest N p else if λ < K initialize {µ (0) }λ =1 wit centroids of {N p} λ p=1 initialize remaining clusters at random 2. Repeat until convergence 2a. assign cluster: Assign eac data point x i to cluster (i.e. set X (t+1) ), for ( = arg min xi µ (t) 2 log(det(a A )) 2b. estimate means: {µ (t+1) } K =1 { 1 X (t+1) + (xi,xj) M w ijf M (x i,x j )½[ l j ] + (xi,xj) C w ijf C (x i,x j )½[ = l j ] ) x} K x X (t+1) =1 2c. update metrics: A = X ( xi X (x i µ )(x i µ ) T + (xi,xj) M 1 2 w ij(x i x j )(x i x j ) T ½[l i l j ] + (xi,xj) C w ij( (x x )(x x )T 2d. t (t + 1) 3.1. Initialization (x i x j )(x i x j ) T) ½[l i = l j ] Figure 1. MPCK-MEANS algoritm ) 1 Good initial centroids are critical to te success of greedy clustering algoritms suc as. To infer te initial clusters from te constraints, we take te transitive closure of te must-link constraints and augment te set M wit tese entailed constraints (assuming consistency of te constraints). Let λ be te number of connected components in te augmented set M. Tese connected components are used to create λ neigborood sets {N p } λ p=1, were eac neigborood consists of points connected by must-links. For every pair of neigboroods N p and N p tat ave at least one cannot-link between tem, we add cannot-link constraints between every pair of points in N p and N p and augment te cannot-link set C wit tese entailed constraints. We will overload notation from tis point and refer

4 to te augmented must-link and cannot-link sets as M and C respectively. After tis preprocessing step, we get λ neigborood sets {N p } λ p=1. Tese neigboroods provide initial clusters for te MPCK-MEANS algoritm. If λ K, we initialize λ cluster centers wit te centroids of all te λ neigborood sets. If λ < K, we initialize te remaining K λ clusters wit points obtained by random perturbations of te global centroid of X. If λ > K, we select K neigborood sets using a weigted variant of te fartest-first algoritm, wic is a good euristic for initialization in centroid-based clustering algoritms like. In weigted fartest-first traversal, te goal is to find K points wic are maximally separated from eac oter in terms of a weigted distance. In our case, te points are te centroids of te λ neigboroods, and te weigt of eac centroid is te size of its corresponding neigborood. Tus, we bias fartest-first to select centroids wic are relatively far apart but also represent large neigboroods, in order to obtain good initial clusters. In weigted fartest-first traversal, we maintain a set of traversed points at every step, and pick te following point aving te fartest weigted distance from te traversed set (using te standard notion of distance from a set: d(x,s) = min y S d(x,y)), and so on. Finally, we initialize te K cluster centers wit te centroids of te K neigboroods cosen by weigted fartest-first traversal E-step MPCK-MEANS alternates between cluster assignment in te E-step, and centroid estimation and metric learning in te M-step (see Step 2 in Fig.1). In te E-step, every point x is assigned to te cluster tat minimizes te sum of te distance of x to te cluster centroid according to te local metric and te cost of any constraint violations incurred by tis cluster assignment. Points are randomly re-ordered for eac assignment sequence, and once a point x is assigned to a cluster, te subsequent points in te random ordering use te current cluster assignment of x to calculate possible constraint violations. Note tat tis assignment step is order-dependent, since te subsets of M and C relevant to eac cluster may cange wit te assignment of a point. We experimented wit random ordering as well as a greedy strategy tat first assigned instances tat are closest to te cluster centroid and involved in a minimal number of constraints. Tese experiments sowed tat te order of assignment does not result in statistically significant differences in clustering quality; terefore, we used random ordering in our evaluation. In te E-step, eac point moves to a new cluster only if te component of J mpckm contributed by tis point decreases. So wen all points are given teir new assignment, J mpckm will decrease or remain te same M-step In te M-step, every cluster centroid µ is first re-estimated using te points in corresponding X. As a result, te contribution of eac cluster to J mpckm is minimized. Te pairwise constraints do not take part in tis centroid reestimation step because te constraint violations only depend on cluster assignments, wic do not cange in tis step. Tus, only te first term (te distance component) of J mpckm is minimized. Te centroid re-estimation step effectively remains te same as in. Te second part of te M-step performs metric learning, were te matrices {A } K =1 are re-estimated to decrease te objective function J mpckm. Eac updated matrix of local weigts A is obtained by taking te partial derivative J mpckm A and setting it to zero, resulting in: X A = X x i X (x i µ )(x i µ ) T 1 2 wij(xi xj)(xi xj)t ½[l i l j] (7) (x i,x j ) M `(x x )(x x ) T (x i,x j ) C w ij (x i x j)(x i x j) T ½[l i = l j] «1 were M and C are subsets of must-link and cannotlink constraints respectively tat contain points currently assigned to te -t cluster. Since eac A is obtained by inverting te summation of covariance matrices in Eqn.(7), A 1, tat summation must not be singular. If any of te obtained A 1 are singular, tey can be conditioned via adding te identity matrix multiplied by a small fraction of te trace of A 1 : A 1 = A 1 + ǫ tr(a 1 )I (Saul & Roweis, 2003). If te A resulting from te inversion is negative definite, it is mended by projecting on te set C = {A : A 0} of positive semi-definite matrices as described by Xing et al. (2003) to ensure tat it parameterizes a distance metric. For ig-dimensional or large datasets, estimating te full matrix A can be computationally expensive. In suc cases diagonal weigt matrices can be used, wic is equivalent to feature weigting, wile using te full matrix corresponds to feature generation. In te case of diagonal A, te d-t diagonal element, a () dd, corresponds to te weigt of te d-t feature for te -t cluster metric: a () dd = X X x i X (x id µ d ) wij(x id x jd ) 2 ½[l i l j] (8) (x i,x j ) M w ij`(x d x d) 2 (x id x jd ) 2 ½[l «1 i = l j] (x i,x j ) C

5 Intuitively, te first term in te sum, x i X (x id µ d ) 2, scales te weigt of eac feature proportionately to te feature s contribution to te overall cluster dispersion, analogously to scaling performed wen computing unsupervised Maalanobis distance. Te last two terms tat depend on constraint violations stretc eac dimension attempting to mend te current violations. Tus, te metric weigts are adjusted at eac iteration in suc a way tat te contribution of different attributes to distance is variance-normalized, wile constraint violations are minimized. Instead of multiple metrics {A } K =1 te algoritm can use a single metric A for all clusters. Te metric would be used and updated similarly to te description above, except tat summations in Eqns.(7) and (8) would be over X, M, and C instead of X, M, and C respectively. Te objective function decreases after every cluster assignment, centroid re-estimation and metric learning step till convergence, implying tat te MPCK-MEANS algoritm will converge to a local minima of J mpckm as long as matrices {A } K =1 are obtained directly from Eqn.(7). If any A 1 is conditioned as described above to make it positive definite or if te maximally separated points {(x,x )}K =1 cange between iterations, convergence is no longer guaranteed teoretically; owever, empirically tis as not been a problem in our experience. 4. Experiments 4.1. Metodology and Datasets Experiments were conducted on tree datasets from te UCI repository: Iris, Wine, and Ionospere (Blake & Merz, 1998); te Protein dataset used by Xing et al. (2003) and Bar-Hillel et al. (2003), and randomly sampled subsets from te Digits and Letters andwritten caracter recognition datasets, also from te UCI repository. For Digits and Letters, we cose two sets of tree classes: {I, J, L} from Letters and {3, 8, 9} from Digits, sampling 10% of te data points from te original datasets randomly. Tese classes were cosen since tey represent difficult visual discrimination problems. Table 1 summarizes te properties of te datasets: te number of instances N, te number of dimensions D, and te number of classes K. Table 1. Datasets used in experimental evaluation Iris Wine Ionospere Protein Letters Digits N D K We ave used pairwise to evaluate te clustering results based on te underlying classes. relies on te traditional information retrieval measures, adapted for evaluating clustering by considering same-cluster pairs: Precision = #PairsCorrectlyPredictedInSameCluster #T otalp airsp redictedinsamecluster Recall = #PairsCorrectlyPredictedInSameCluster #T otalp airsinsamecluster F Measure = 2 Precision Recall P recision + Recall We generated learning curves wit 5-fold cross-validation for eac dataset to determine te effect of utilizing te pairwise constraints. Eac point on te learning curve represents a particular number of randomly selected pairwise constraints given as input to te algoritm. Unit constraint costs W and W were used for all constraints, original and inferred, since te datasets did not provide individual weigts for te constraints. Te clustering algoritm was run on te wole dataset, but te pairwise was calculated only on te test set. Results were averaged over 50 runs of 5 folds Results and Discussion First, we compared constraint-based and metric-based semi-supervised clustering wit te integrated framework as well as purely unsupervised and supervised approaces. Figs.2-7 sow learning curves for te six datasets. For eac dataset, we compared five clustering scemes: MPCK-MEANS clustering, wic involves bot seeding and metric learning in te unified framework described in Section 2.4; a single metric parameterized by a diagonal matrix is used for all clusters; MK-MEANS, wic is clustering wit te metric learning component described in Section 3.3, witout utilizing constraints for initialization; a single metric parameterized by a diagonal matrix is used for all clusters; PCK-MEANS clustering, wic utilizes constraints for seeding te initial clusters and directs te cluster assignments to respect te constraints witout doing any metric learning, as outlined in Section 2.2; K-MEANS unsupervised clustering; SUPERVISED-MEANS, wic performs assignment of points to nearest cluster centroids inferred from constraints, as described in Section 3.1. Tis algoritm provides a baseline for performance of pure supervised learning based on constraints. On te presented datasets, te unified approac (MPCK- MEANS) outperforms individual seeding (PCK-MEANS) and metric learning (MK-MEANS). Superiority of semisupervised over unsupervised clustering illustrates tat providing pairwise constraints is beneficial to clustering quality. Improvements of semi-supervised clustering over SUPERVISED-MEANS indicate tat iterative refinement of

6 PC 5 5 PC PC 0.25 Figure 2. Iris: ablations Figure 3. Wine: ablations 0.2 Figure 4. Protein: ablations PC 8 7 Figure 5. Ionospere: ablations 5 5 PC Figure 6. Digits-389: ablations PC 0.35 Figure 7. Letters-IJL: ablations centroids using bot constraints and unlabeled data outperforms purely supervised assignment based on neigboroods inferred from constraints (for Ionospere, MPCK- MEANS requires eiter te full weigt matrix or individual cluster metrics to outperform SUPERVISED-MEANS, results for tese experiments are sown on Fig.11). For te Wine, Protein, and Letter-IJL datasets, te difference between metods tat utilize metric learning (MPCK- MEANS and MK-MEANS) and tose tat do not (PCK- MEANS and regular ) wit no pairwise constraints indicates tat even in te absence of constraints, weigting features by teir variance (essentially using unsupervised Maalanobis distance) improves clustering accuracy. For te Wine dataset, additional constraints provide an insubstantial improvement in cluster quality on tis dataset, wic sows tat meaningful feature weigts are obtained from scaling by variance using just te unlabeled data. Some of te metric learning curves display a caracteristic dip, were clustering accuracy decreases wen initial constraints are provided, but after a certain point starts to increase and eventually rises above te initial point on te learning curve. We conjecture tat tis penomenon is due to te fact tat metric parameters learned using few constraints are unreliable, and a significant number of constraints is required by te metric learning mecanism to estimate parameters accurately. On te oter and, seeding te clusters wit a small number of pairwise constraints as an immediate positive effect on te final cluster quality, wile providing more pairwise constraints as diminising returns, i.e., PCK-MEANS learning curves rise slowly. Wen bot seeding and metric learning are utilized, te unified approac benefits from te individual strengts of te two metods, as can be seen from te MPCK-MEANS results. In anoter set of experiments, we evaluated te utility of using individual metrics for eac cluster and te usefulness of learning a full weigt matrix A (feature generation) as opposed to a diagonal matrix (feature weigting). We ave also compared our metods wit, a semi-supervised clustering algoritm tat performs metric learning separately from te clustering process (Bar-Hillel et al., 2003), and tat as been sown to outperform a similar approac by Xing et al. (2003). Figs.8-13 sow learning curves for te six datasets on te following clustering scemes: MPCK-MEANS-S-D, wic is same as MPCK- MEANS on Figs.2-7 and involves bot seeding and metric learning; a single metric (S) parameterized by a diagonal matrix (D) is used for all clusters; MPCK-MEANS-M-D, wic involves bot seeding and metric learning; multiple metrics (M) parameterized by diagonal matrices (D) are used; MPCK-MEANS-S-F, wic involves bot seeding and metric learning; a single metric (S) parameterized by a full matrix (F) is used for all clusters; MPCK-MEANS-M-F, wic involves bot seeding and metric learning; multiple metrics (M) parameterized by full matrices (F) are used;

7 M-F Figure 8. Iris: metric learning 5 5 -M-F Figure 9. Wine: metric learning M-F 0.2 Figure 10. Protein: metric learning 5 -M-F Figure 11. Ionospere: metric learning M-F Figure 12. Digits-389: metric learning 5 5 -M-F 0.45 Figure 13. Letters-IJL: metric learning clustering, wic uses distance metric learning described in (Bar-Hillel et al., 2003) and initialization inferred from constraints as described in Section 3.1. As can be seen from results, bot full matrix parameterization and individual metrics for eac cluster can lead to significant improvements in clustering quality. However, te relative usefulness of tese two tecniques varies between te datasets, e.g., multiple metrics are particularly beneficial for Protein and Digits datasets, wile switcing from a diagonal to a full weigt matrix leads to large improvements on Wine, Ionospere, and Letters. Tese results can be explained by te fact tat te relative success of te two tecniques depends on te properties of a particular dataset: using a full weigt matrix elps wen te attributes are igly correlated, wile multiple metrics lead to improvements wen clusters in te dataset are of different sapes or lie in different subspaces of te original space. A combination of te two tecniques is most elpful wen bot of tese requirements are satisfied, as for Iris and Digits, wic was observed by visualizing tese datasets. For oter datasets, eiter multiple metrics or full weigt matrix lead to maximum performance in isolation. Comparing te performance of different variants of MPCK-MEANS wit, we can see tat early on te learning curves, were few pairwise constraints are available, leads to better metrics tan MPCK-MEANS. However, as more training data is provided, te ability of MPCK-MEANS to learn from bot supervised and unsupervised data as well as use individual metrics allows MPCK-MEANS to produce better clustering. Overall, our results indicate tat te integrated approac to utilizing pairwise constraints in clustering wit individual metrics outperforms seeding and metric learning individually and leads to improvements in cluster quality. Extending te basic approac wit a full parameterization matrix and individual metrics for eac cluster can lead to significant improvements over te basic metod. 5. Related work In previous work on constrained pairwise clustering, Wagstaff et al. (2001) proposed te COP-KMeans algoritm tat as a euristically motivated objective function. Our formulation, on te oter and, as an underlying generative model based on Hidden Markov Random Fields (see (Basu et al., 2004) for a detailed analysis). Bansal et al. (2002) also proposed a framework for pairwise constrained clustering, but teir model performs clustering using only te constraints, wereas our formulation uses bot constraints and an underlying distance metric between te points for clustering. Scultz and Joacims (2004) recently introduced a metod for learning distance metric parameters based on relative comparisons. In unsupervised clustering, Domeniconi (2002) proposed a variant of tat incorporated learning individual Euclidean metric weigts for eac cluster; our approac is more general since it allows metric learning to utilize pairwise constraints along wit unlabeled data.

8 In recent work on semi-supervised clustering wit pairwise constraints, Con et al. (2003) used gradient descent for weigted Jensen-Sannon divergence in te context of EM clustering. Xing et al. (2003) utilized a combination of gradient descent and iterative projections to learn a Maalanobis metric for clustering. Also, Bar-Hillel et al. (2003) proposed a Redundant Component Analysis () algoritm tat uses only must-link constraints to learn a Maalanobis metric using convex optimization. All tese metric learning tecniques for clustering train a single metric first using only supervised data, and ten perform clustering on te unsupervised data. In contrast, our metod integrates distance metric learning wit te clustering process and utilizes bot supervised and unsupervised data to learn multiple metrics, wic experimentally leads to improved results. Finally, a unified objective function for semi-supervised clustering wit constraints was recently proposed by Segal et al. (2003), owever, it did not incorporate distance metric learning. 6. Conclusions and Future Work Tis paper as presented MPCK-MEANS, a new approac to semi-supervised clustering tat unifies te previous constraint-based and metric-based metods. It is based on a variation of te standard clustering algoritm and uses pairwise constraints along wit unlabeled data for constraining te clustering and learning distance metrics. In contrast to previously proposed semi-supervised clustering algoritms, MPCK-MEANS also allows clusters to lie in different subspaces and ave different sapes. By ablating te individual components of our integrated approac, we ave experimentally compared metric learning and constraints in isolation wit te combined algoritm. Our results ave sown tat by unifying te advantages of bot tecniques, te integrated approac outperforms te two tecniques individually. We ave sown tat using individual metrics for different clusters, as well as performing feature generation via a full weigt matrix in contrast to feature weigting wit a diagonal weigt matrix, can lead to improvements over our basic algoritm. Extending our approac to ig-dimensional datasets, were Euclidean distance performs poorly, is te primary avenue for future researc. Oter interesting topics for future work include selection of most informative pairwise constraints tat would facilitate accurate metric learning and obtaining good initial centroids, as well as metodology for andling noisy constraints and cluster initialization sensitive to constraint costs. 7. Acknowledgments We would like to tank anonymous reviewers and Joel Tropp for insigtful comments. Tis researc was supported in part by NSF grants IIS and IIS , and by a Faculty Fellowsip from IBM Corp. References Bansal, N., Blum, A., & Cawla, S. (2002). Correlation clustering. Proceedings of te 43rd IEEE Symposium on Foundations of Computer Science (FOCS-02) (pp ). Bar-Hillel, A., Hertz, T., Sental, N., & Weinsall, D. (2003). Learning distance functions using equivalence relations. Proceedings of 20t International Conference on Macine Learning (ICML-2003) (pp ). Basu, S., Banerjee, A., & Mooney, R. J. (2002). Semi-supervised clustering by seeding. Proceedings of 19t International Conference on Macine Learning (ICML-2002) (pp ). Basu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilistic framework for semi-supervised clustering. In submission, available at ttp:// ml/publication. Bilenko, M., & Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. Proceedings of te Nint ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003) (pp ). Bilmes, J. (1997). A gentle tutorial on te EM algoritm and its application to parameter estimation for Gaussian mixture and idden Markov models (Tec. Report ICSI-TR ). ICSI. Blake, C. L., & Merz, C. J. (1998). UCI repository of macine learning databases. ttp:// mlearn/mlrepository.tml. Con, D., Caruana, R., & McCallum, A. (2003). Semi-supervised clustering wit user feedback (Tec. Report TR ). Cornell University. Demiriz, A., Bennett, K. P., & Embrects, M. J. (1999). Semisupervised clustering using genetic algoritms. Artificial Neural Networks in Engineering (ANNIE-99) (pp ). Domeniconi, C. (2002). Locally adaptive tecniques for pattern classification. Doctoral dissertation, University of California, Riverside. Klein, D., Kamvar, S. D., & Manning, C. (2002). From instancelevel constraints to space-level constraints: Making te most of prior knowledge in data clustering. Proceedings of te Te Nineteent International Conference on Macine Learning (ICML-2002) (pp ). Kleinberg, J., & Tardos, E. (1999). Approximation algoritms for classification problems wit pairwise relationsips: Metric labeling and Markov random fields. Proceedings of te 40t IEEE Symposium on Foundations of Computer Science (FOCS-99) (pp ). Saul, L., & Roweis, S. (2003). Tink globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Macine Learning Researc, 4, Segal, E., Wang, H., & Koller, D. (2003). Discovering molecular patways from protein interaction and gene expression data. Bioinformatics, 19, i264 i272. Scultz, M., and Joacims, T. (2004). Learning a distance metric from relative comparisons. Advances in Neural Information Processing Systems 16. Wagstaff, K., Cardie, C., Rogers, S., & Scroedl, S. (2001). Constrained clustering wit background knowledge. Proceedings of 18t International Conference on Macine Learning (ICML-2001) (pp ). Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). Distance metric learning, wit application to clustering wit sideinformation. Advances in Neural Information Processing Systems 15 (pp ).

Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering

Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining Systems, pp.42-49, Washington DC, August, 2003 Comparing and Unifying Search-Based