A Unified Framework to Integrate Supervision and Metric Learning into Clustering

Similar documents
Administration. Final Exam: Next Tuesday, 12/6 12:30, in class. HW 7: Due on Thursday 12/1. Final Projects:

An Adaptive Kernel Method for Semi-Supervised Clustering

Model-based Clustering With Probabilistic Constraints

Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches

Kernel-Based Metric Adaptation with Pairwise Constraints

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Computing Gaussian Mixture Models with EM using Equivalence Constraints

Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering

Constrained Clustering with Interactive Similarity Learning

Unsupervised Learning

Classification with Partial Labels

Computing Gaussian Mixture Models with EM using Equivalence Constraints

Neighbourhood Components Analysis

Semi-Supervised Clustering with Partial Background Information

Data-driven Kernels via Semi-Supervised Clustering on the Manifold

Conditional Random Fields for Object Recognition

Information Integration of Partially Labeled Data

ECG782: Multidimensional Digital Signal Processing

A Data-dependent Distance Measure for Transductive Instance-based Learning

Based on Raymond J. Mooney s slides

Semi-supervised Clustering

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

A Novel Approach for Weighted Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Semi-supervised learning and active learning

Global Metric Learning by Gradient Descent

C-NBC: Neighborhood-Based Clustering with Constraints

Measuring Constraint-Set Utility for Partitional Clustering Algorithms

PATTERN CLASSIFICATION AND SCENE ANALYSIS

ICA as a preprocessing technique for classification

Automatic Domain Partitioning for Multi-Domain Learning

Univariate and Multivariate Decision Trees

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Bagging for One-Class Learning

A Taxonomy of Semi-Supervised Learning Algorithms

Comparison of supervised self-organizing maps using Euclidian or Mahalanobis distance in classification context

MULTICORE LEARNING ALGORITHM

1) Give decision trees to represent the following Boolean functions:

Locally Smooth Metric Learning with Application to Image Retrieval

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Discretizing Continuous Attributes Using Information Theory

Learning Distance Functions for Image Retrieval

Unsupervised Learning and Clustering

Introduction to Machine Learning CMU-10701

Improving Image Segmentation Quality Via Graph Theory

Relative Constraints as Features

APPROXIMATE SPECTRAL LEARNING USING NYSTROM METHOD. Aleksandar Trokicić

3D Surface Recovery via Deterministic Annealing based Piecewise Linear Surface Fitting Algorithm

Mining di Dati Web. Lezione 3 - Clustering and Classification

Clustering and Visualisation of Data

Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results

Machine Learning for OR & FE

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Machine Learning / Jan 27, 2010

SSV Criterion Based Discretization for Naive Bayes Classifiers

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Unsupervised Learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Unsupervised Learning and Clustering

Non-Profiled Deep Learning-Based Side-Channel Attacks

Divide and Conquer Kernel Ridge Regression

Unsupervised Learning

Probabilistic Graphical Models

Unsupervised Learning: Clustering

An Efficient Learning of Constraints For Semi-Supervised Clustering using Neighbour Clustering Algorithm

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Rank Measures for Ordering

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

How do we obtain reliable estimates of performance measures?

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Conflict Graphs for Parallel Stochastic Gradient Descent

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Fuzzy Modeling using Vector Quantization with Supervised Learning

PARALLEL CLASSIFICATION ALGORITHMS

Learning and Evaluating Classifiers under Sample Selection Bias

K-means and Hierarchical Clustering

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Clustering: Classic Methods and Modern Views

Learning to Detect Partially Labeled People

MTTS1 Dimensionality Reduction and Visualization Spring 2014, 5op Jaakko Peltonen

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Active Sampling for Constrained Clustering

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Applying the Possibilistic C-Means Algorithm in Kernel-Induced Spaces

Semi-supervised learning

AN ABSTRACT OF THE THESIS OF

Machine Learning (CSE 446): Unsupervised Learning

Transcription:

A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December 1, 200 Abstract In this paper, we propose a unified framework for applying supervision to discriminative clustering, which: (1) explicitly formalizes the problem of training a partition function as a supervised metric-learning process; (2) learns a partition function that can optimize a supervised clustering error; and () flexibly learns a distance metric with regard to any clustering algorithm. Moreover, we develop a general gradient-descent learning algorithm that trains a distance metric under this framework. The convergence of this algorithm is guaranteed for some restricted cases. Our experimental study shows the significant performance improvement after integrating supervision and distance metric learning in clustering, trained in our new framework. 1 Introduction Many learning tasks aim at partitioning data into disjointed sets (Kamvar, Klein, and Manning, 2002; Bach and Jordan, 200; Bilenko, Basu, and Mooney, 200). The hope is to find a good partitioning algorithm that works well to match the standard designed by a human. When labeled data is limited and expensive to obtain, clustering is typically adopted as an unsupervised approach to these tasks. The typical clustering scenario is: given a set of unlabelled data points and a standard distance metric that measures the distance between any two data points, a specific clustering algorithm optimizes a quality function defined over the data set. Clustering quality is typically measured without any supervision. Note that clustering involves no learning this way. work. We would like to thank Vasin Punyakanok and Steve Hanneke for discussions and suggestions on this 1

There are many problems with this framework. First, without any supervision, clustering results could be disparate from a human s intention. For example, to partition the data points in Figure 1 into k (k = ) groups, different clustering algorithms with different quality functions and distance metrics can partition them in many different ways, such as horizontally or vertically. Second, this framework lacks flexibility. By using a fixed distance metric and a given clustering algorithm, we enforce a lot of strict assumptions on the data set and the task itself (Kamvar, Klein, and Manning, 2002). For instance, the K-means clustering algorithm works only when the data points are generated from Gaussian distributions with identical variances in the given metric space. Any minor transformation of the metric space and the generative models will lead to clustering failure. In order to solve these problems, researchers have started to extend clustering in two directions: (1) exploiting supervision to help clustering, and (2) incorporating distance metric learning to increase flexibility. These two directions are closely related in many works. (Bach and Jordan, 200) attempts to learn a distance metric in an unsupervised setting by optimizing clustering quality. Some works that incorporate partial supervision with distance metric learning include (Bar-Hillel et al., 200; Schultz and Joachims, 200; Bilenko, Basu, and Mooney, 200), but supervision appears only as constraints when they optimize an unsupervised quality function. In these works, supervision is not treated as a direct evaluation and target of clustering, and there is no explicit training stage. (Xing et al., 2002) and (Mochihashi, Kikui, and Kita, 200) train a distance metric by optimizing a quality over a labeled data set. However, their distance metrics are trained independently of any clustering algorithm. Other works that incorporate supervision into discriminative clustering, include (Bilenko et al., 200; Cohen, Ravikumar, and Fienberg, 200), which train a similarity metric through a pairwise classification task whether a pair of data points are in the same class. a: Correct Labeling b: the K-means Clustering Figure 1: An Example of Unsupervised Clustering. a,b denote the partitioning given by a human and by the K-means clustering algorithm, respectively. Data points of different classes are denoted by different shapes. In this paper, we develop an integrated framework for supervised discriminative clustering (as shown in Figure 2). In this framework, a clustering task is explicitly splitted into training and application stages. A partition function, parameterized by a distance metric, is trained in a supervised setting. Distinguishing properties of this framework are: (1) The training stage is formalized as an optimization problem to 2

A labeled data set S Training Stage: Goal: h*=argmin err S (h,p) A Supervised Learner A partition fuction h(s) = A(d, S) A distance metric d + a clustering algorithm A Application Stage: h(s ) A unlabeled data set S A partition h(s ) Figure 2: Supervised Discriminative Clustering Framework (SDC) learn a partition function, that can minimize a clustering error. (2) The clustering error is well-defined and supervised in the sense that in this evaluation, clustering results are compared with a human s supervision. () The goal of learning and application are inherently integrated; that is, we seek to minimize clustering error both in training and application. () The partition function or the distance metric can be learned flexibly with regard to any clustering algorithm. () Kernels can be integrated with the distance metric in this model to allow a more complex feature space. Moreover, we develop a general gradient-descent learning algorithm that can train a distance metric under this framework for any clustering algorithm. In the rest of this paper, we will first introduce our supervised discriminative clustering framework (SDC) and a general learner to train a distance metric in this framework. After that, we present some preliminary experimental results that exhibit the significant performance improvements of our work compared with the K-means clustering algorithm. 2 Supervised Discriminative Clustering (SDC) 2.1 Problem Definition Partitioning is the task of splitting a set of elements into non-overlapping subsets. Given a set of elements S X, a partitioning P (S) = {S 1, S 2,, S K } of S is a disjoint decomposition of S. We associate with a partition P a partition function p that maps any set S X to a set of indices {1, 2,... K}. Notice that, unlike a classifier, the image of p S (x) of x S under the partition function p depends on the set S. We will omit the subscript S in p S (x) when clear from the context. Let p be the target partition over X, and let h be any partition function over X. In principle, one can measure the deviation of h from a given partition p of S X, using a loss function l S (h, p) [0, + ) defined as l S (h, p) 1 S {x S (p S(x) h S (x)}. However, in clustering, we seldom know the mapping from the index of each

element h(x) to p(x), so a more general loss function, namely (weighted) clustering error given a distance metric d, is defined as follows: err S (h, p) 1 S 2 [d(x i, x j ) A ij + (D d(x i, x j )) B ij ] (1) ij where A ij = I[(p(x i ) = p(x j ) & h(x i ) h(x j )] and B ij = I[(p(x i ) p(x j ) & h(x i ) = h(x j )]. D = max d(x, x ) is the maximum distance between any two points in S and I is a indicator function. When the target partition p is unknown, we can still evaluate the quality of a partition. Given a data set S X and any partition function h, a quality function is defined as a mapping q S (h) (, + ). Different clustering algorithms have different quality functions to optimize. For instance, the quality defined in the K-means algorithm is: q S (h) x i S [d(x i, µ i )] 2, where µ i is the mean of data points in the partition of x i in h(s). There is a natural loss function for any quality function q S (h) after supervision, p(s), is given: l S (h) = q S (p) q S (h). Notice that the difference between l, err and q is: a quality function q is unsupervised since p is unknown. In the training stage, the goal is to learn a good partition function given a set of observed data. Depending on whether the data is labeled or unlabeled, we can further define the supervised and unsupervised training problems: Definition 2.1 Supervised Training (SDC): Given an annotated data set S, a set of partition functions H, and the error function err S (h, p)(h H), the problem is to find an optimal partition function h s.t. h = argmin h H err S (h, p). Definition 2.2 Unsupervised Training: Given an unannotated data set S, a set of possible partition functions H, and a quality function q S (h)(h H), the problem to find an optimal partition function h s.t. h = argmax h H q S (h). With this formalization, our supervised clustering work can be distinguished clearly from (1) unsupervised clustering approaches - which typically attempt to learn a metric to optimize q and (2) related works that exploit partial supervision in metric learning in the training stage of our framework, supervision is directly integrated into the error function and the goal is to find a function in H that minimizes the clustering error instead of optimizing clustering quality with constraints. Consequently, given new data S at the application stage, under some assumptions (not stated in this paper), the hope is that the learned partition function can generalize well and achieve small err S (h, p). 2.2 Parameterizing a Partition Function In reality, a partition function h in clustering is a pairing of a clustering algorithm A and a distance metric d. That is, h(s) A (d, S). Either component could influence the clustering results. Therefore, h can be parameterized by either A or d, or both. In

this preliminary work, we only study one of the three options: we fix the clustering algorithm (for example, K-means) and learn a distance metric to minimize the clustering error. The error function err S (h, p), as defined in Equ. 1, is adopted in the rest of this paper. Suppose each element x X is represented as a feature vector x =< f 1, f 2,, f m > and a distance metric d(x, x ) [0, + ) is defined as a function d(x, x ) g(< f 1,, f m >, < f 1,, f m >). The parameters in g can be represented as θ Θ. Given real-valued features, an example of d could be d 1 (x, x ) where d 1 is parameterized by θ = {w k }. m w k f k f k (2) k=1 Definition 2. Supervised Metric Learning: Given an annotated data set S, a family of possible distance metrics parameterized by θ Θ and a specific clustering algorithm A, and err S (h, p)(h H) as defined in Equ. 1, the problem is to seek an optimal partition function h that satisfies where h (S) A (d θ, S). θ = argmin θ err S (h, p), Note that in this framework, a distance metric is learned with respect to a given clustering algorithm. There does not exist a universal distance metric that can satisfy the assumptions of any clustering algorithm. It is necessary to learn a specific distance metric for a given clustering algorithm so that in this metric space, the assumptions of the clustering algorithm could be satisfied. 2. A General Learner for SDC In addition to the general framework of supervised discriminative clustering, we describe a general learning algorithm based on gradient descent algorithm that can learn a metric with respect to any clustering algorithm (as shown in Figure 2.). The gradient descent algorithm converges to a local optima when there exists one. When the error function err is linear in θ, the above algorithm becomes a perceptron-like algorithm: that is, if θ = {w k } and d(x, x ) t w k f k (x, x ), where f k (x, x ) is a feature between x, x. If we further define a global feature over S as ψ t (h, S) = 1 S 2 ij f t(x, x ) I[h(x i ) = h(x j )] then the update rule in Step 2(c) becomes w t k = w t 1 k + ψ t 1 k (p, S) ψ t 1 k (h, S). () The convergence of this algorithm can be induced from Michael Collins s mistakebound theorem on the generalized perceptron algorithm (Theorem 1 in (Collins, 2002)): this algorithm converges to the global optima when there exists θ = {w k }( w k 0) that can reach zero error.

Algorithm: SDC-Learner Input: S and p: the annotated data set. A: the clustering algorithm. err S (h, p): the clustering error function. Output: θ. 1. In the initial (I-) step, we randomly choose θ 0 for d. After this step we have the initial d 0 and h 0. 2. Then we iterate over t (t = 1, 2, ), (a) Partition S using h t 1 (S) A (d t 1, S); (b) Compute err S (h t 1, p) and update θ using the formula: θ t = θ t 1 + err S(h t 1,p) θ t 1.. Stoping Criterion: If err S (h t, p) err S (h t 1, p) or t > T, the algorithm exits. Figure : A General Training Algorithm for SDC Experimental Study.1 Simulations In this simulation, we randomly generate two data sets. Each data set contains 160 data points that are uniformly generated from four classes on a plane. Each data point has two features (x, y). the K-means algorithm is applied in the training stage of SDC to learn a linear distance metric defined as d 1 in Equ. 2 parameterized by θ = {w x, w y }. The perceptron-like algorithm in Fig. 2. with the update rule in Equ. has been applied in training the distance metric with the two labeled data set respectively. In the initialization step, θ is set to be (0, 1) which focus on the vertical distance between data points. In Figure, we can observe that the training effect on the two data sets are significant. Without any supervision, the K-means algorithm will always partition the data points along the rows while true partition is along the columns. After training, the same K-means algorithm with the learned distance metric partitions the data points correctly after 20 iteration steps. The learned metric parameters become (0.86, 0.1) after training for the first data set, focusing on the horizonal distance..2 Real Data Sets In the following preliminary experiments, we only evaluate the SDC framework on training a good distance metric with respect to the K-means clustering algorithm. The performance is then compared with that of an unsupervised K-means algorithm (with L 1 as metric). For this purpose, we collected four data sets from the UCI repository (Blake and Merz, 1998). The K-means algorithm is applied, both in the training stage of SDC to learn a linear distance metric defined as d 1 (θ = {w k }) in Equ. 2, and 6

... 2. 2 0 2 6... 2. 2 0 2 6... 2. 2 0 2 6 10 9 8 7 6 0 2 6 10 9 8 7 6 0 2 6 10 9 8 7 6 0 2 6 Figure : Two Simple Examples of Training a Linear Distance Metric. Different colors represent different classes. The first sub-figure in each row denotes the true partition of the points in each data set; the second shows unsupervised K-means clustering results with the initial distance metric; and the third sub-figure shows the clustering results using the learned distance metric after training. the test stage where it, combined with the learned distance metric, is used to cluster test data set. The initial weights are set to 1. The perceptron-like algorithm in Fig. 2. with update rule as in Equ. has been applied in training. Moreover, the initial center of each cluster is randomly chosen for the K-means algorithm. Results are averaged on 20 runs of two-folded cross-validation the unsupervised K-means is only ran on the test sets. The performance of both approaches are evaluated in a standard way computing precision, recall and F 1 over all pairs of elements, relative to the target partition. Not surprisingly, learning a better metric for the data results in that the SDC framework outperforms the unsupervised K-means clustering significantly on these data sets. No. Data Sets # of elements # of features # of classes K-Means SDC 1 iris plants 10 82.9 91.21 2 breast cancer 69 1 2 78. 91.6 balance 62 6.9 2. wine 178 1 6.26 88.11 Table 1: Comparison of SDC with the unsupervised K-means (F 1 ). Four UCI data sets are used in our experiments. The size of each data set, the # of features and the # of classes are summarized. Results are averaged on 20 runs of two-fold cross validation. 7

Conclusion In this paper, we propose a unified framework for applying supervision to discriminative clustering, which can be used to learn a distance metric for any given clustering algorithm to minimize a supervised clustering error. Our preliminary experimental study shows its performance improvement and more flexibility over unsupervised clustering. References Bach, F. R. and M. I. Jordan. 200. Learning spectral clustering. In Proc. of the Conference on Advances in Neural Information Processing Systems. Bar-Hillel, A., T. Hertz, N. Shental, and D. Weinshall. 200. Learning distance functions using equivalence relations. In Proc. of the International Conference on Machine Learning. Bilenko, M, S. Basu, and R. J. Mooney. 200. Integrating constraints and metric learning in semi-supervised clustering. In Proc. of the International Conference on Machine Learning. Bilenko, M., R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. 200. Adaptive name matching in information integration. IEEE Intelligent Systems. Blake, C. L. and C. J. Merz. 1998. Uci repository of machine learning databases. http://www.ics.uci.edu/ mlearn/ MLRepository.html. Cohen, W., P. Ravikumar, and S. Fienberg. 200. A comparison of string metrics for name-matching tasks. In IIWeb Workshop 200. Collins, M. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. Kamvar, S., D. Klein, and C. Manning. 2002. Interpreting and extending classical agglomerative clustering algorithms using a model-based approach. In Proc. of the International Conference on Machine Learning. Mochihashi, D., G. Kikui, and K. Kita. 200. Learning nonstructural distance metric by minimum cluster distortions. In Proc. of COLING. Schultz, M. and T. Joachims. 200. Learning a distance metric from relative comparisons. In Proc. of the Conference on Advances in Neural Information Processing Systems. Xing, E. P., A. Y. Ng, M. I. Jordan, and S. Russell. 2002. Distance metric learning, with application to clustering with side-information. In Proc. of the Conference on Advances in Neural Information Processing Systems. 8