Efficient Algorithm for Distance Metric Learning

Size: px

Start display at page:

Download "Efficient Algorithm for Distance Metric Learning"

Kerry O’Brien’
5 years ago
Views:

1 Efficient lgorithm for Distance Metric Learning Yipei Wang Language Technologies Institute, Carnegie Mellon University Pittsburgh, P yipeiw@cmu.edu bstract Distance metric learning provides an approach to transfer knowledge from sparse labeled data to unlabeled data. The learned metric is more proper to measure the similarity of semantics among instances. The main idea of the algorithm is to create an objective function using the equivalence constraints and in-equivalence constraints and pose the problem as an optimization problem. In this paper, we proposed to unify different metric learning algorithms into semidefinite programming (SDP) framework. The classical semidefinite programming algorithms are extremely expensive on larger problem. So we discuss efficient algorithms for large-scale metric learning. We investigated a recent proposed algorithm arise from Frank-Wolfe algorithm and proposed novel strategies for acceleration based on the special structure of the problem. We compared different algorithms on 3 UCI dataset in clustering problem. 1 Introduction proper distance metric has crucial effect on the performance of distance-based supervised learning and unsupervised learning. For instance, the performance of K-means clustering algorithm, KNN classifiers and SVM classifiers are critically influenced by good metrics. Metric learning provides approaches to transfer knowledge learned from sparse labeled data to unlabeled data. Recently this problem has been actively studied [1][2][3][4][5][16]. These methods have been applied to many real-world problems such as image retrieval [7], face verification [6] and bioinformatics [8]. Previous works often have different formulations and provide specific optimization technologies to solve the problem. For example, Xing [1] pose the problem into a convex optimization problem and design iterative gradient decent algorithm to solve the problem. In [2], they incorporate the idea of margin in formalizing the cost function. They solve the minimization of cost function though alternating projection algorithm. In [4], they learn the Mahalanobis distance metric bu directly maximizing a stochastic variant of the leave-one-out KNN score on training set. The function is not convex and they adopt gradient search to find the maximal. Though the distance metric can be a general function, the prevalent form of the distance function is (x y) T (x y),. This is linear transformation and non linear transformation can be implemented by using kernel. Inspired by the format of the distance function, can we derive a unified semidefinite programming framework? In the following part, we discuss the reformulation into standard semidefinite programming (SDP) of two previous algorithms and we implemented the algorithm with open source SDP solver Sedumi [9]. nother problem is the efficiency of the algorithm in dealing with large scale problem, either high dimension or large data size. Recent works have studied multiple technologies for large-scale semidefinite programming problem [11]. The work mainly falls into two lines of research directions. One direction is to develop first-order method designed for solving generic optimization problem[]. They provide approximation approaches to reduce the iteration cost. The second direction is designing 1

2 algorithm by exploiting the special structure of the problem [11][14]. These algorithms also include Frank-Wolfe algorithm, block coordinate descent method, cutting-plane method, etc. Recently, the development of subsampling technologies also lead to some efficient algorithms [15] for large scale problem. Here we followed a recent proposed approach [5], which is a combination of the two main directions. They first reformulated the problem into eigenvalue optimization problem and design efficient algorithm by combining smoothing technology and Frank-Wolfe algorithm. By investigating the sparse structure of the problem, we further propose several acceleration strategies to improve the efficiency. 2 Related Work We mainly focus on two metric learning methods in the rest of the paper. They learn the metric through convex optimization and one of the algorithm (LMNN) achieves the state-of-art performance on multiple dataset. 2.1 Review of method by Xing Problem Formulation In Xing s method, we re supposed to be given pairs of equivalent constraints: S: (x i, x j ) S if x i and x j are similar The defined criterion for the desired metric is to demand that pairs of points in S have smaller distance. This is cast into a convex optimization problem as below: min s.t. (x i,x j) D x i x j 2 (x i,x j) S x i x j 1 (1) It uses the in-equivalence constraints as condition. Here, D can be set of pairs of points known to be dissimilar if such information is explicitly available; otherwise, we just take all pairs not in S. Without this condition, the function can be solved trivially with =, which is not useful. s is mentioned in the paper, they didn t formulates in the format x i x j 2 1 because it would result in always being rank 1. Optimization The derive the the optimization part for computational cost analysis. Details can be referred to appendix. 1. Newton Method for diagonal. Xing proved that the original optimization problem is equivalent to minimize the function: g() = x i x j 2 c log( x i x j ) ( ) The computational cost is O(n). But when is full rank, it requires O(n 3 ) time to invert the hessian matrix. The computational cost is too expensive and unacceptable. Projected gradient search For full rank matrix, Xing proposed to use iterative projected gradient search to solve the opti- 2

3 mization problem efficiently. The problem is posed as the equivalent form as below: max = x i x j s.t. f() = x i x j 2 1 (x i,x j) S (2) The algorithm takes the gradient step on g() and then repeatedly project into the sets C 1 = { : (x i,x j) S x i x j 2 1} and C 2 = { : }. The algorithm is shown below: Iterate Iterate projection := P C1 () := P C2 () until converges := + t ( g()) until convergence f() The projection to set C 1 can be solved analytically. = < X s, > X s 2 X s + F Considering projection to set C 2. It is completed through decomposition of the matrix. 2.2 Review of LMNN method = U T ΣU Σ + = max(, Σ) = U T Σ + U This work aims to learn Mahanalobis matirx for knn classification. Compared to Xing s method, we are given more information (the class label for each point in the training set) than just the equivalence constraints. Here, we use y ij {, 1} to indicate whether or not the class label y i and y j match. We use η ij {, 1} to indicate whether input x j is a k-nearest neighbor of input x i. The criterion is that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. The cost function is given by: cost() = i,j η ij x i x j 2 + c ijl η ij (1 y il ) max([1 + x i x j 2 x i x l 2 ], ) The optimization of the cost function is given by: min η ij x i x j 2 + c η ij (1 y il )ɛ ijl i,j ijl s.t. x i x l 2 x i x j 2 1 ɛ ijl ɛ ijl (3) Most of the slack variable η ijl never attain positive values. The method is based on a combination of sub-gradient decent in matrices L and M (M = L T L), the later mainly to verify that we reached the global minimum. The alternating projection algorithm is proved to be converged [17]. 3 The unified Semidifinite programming framework We will discuss how to transform two metric learning algorithms into standard semidefinite programming problem. 3

4 3.1 Xing s method We use the denotation: X s = (x i,x j) S x i x j 2, X α = x i x j 2, (x i, x j ) D. (x i,x j) S x i x j 2 can be written as < X S, >. We modify the original problem as below so that it s easy to reformulate into SDP problem. max s.t. min x i x j 2 x i x j 2 1 (4) The problem can be rewritten as: max t, t s.t. < X α, > t, α = 1,, D < X s, > 1 (5) The problem can be transformed into standard form as below: where X = ( t max X < C, X > s.t. < MD α, X > =, α = 1,, D < MS, X > = 1 X d ( D +1) ( D +1) ), C = ( n n 1 ( D +1) ( D +1) Here d is diagonal matrix and each diagonal element is the slack variable to transform the inequality into equality. X α X s 1 MD α = E, MS = α D D 1 For matrix E α R D D, E αα = 1, other elements equal to. ). (6) 3.2 LMNN method We use the denotation C = i,j η ij(x i x j )(x i x j ) T The original problem definition in section 1.1 can be transformed into SDP form as below: ( X = ɛ ) ( C, C = Y N min < C, X > s.t. < B ijl, X >, for all ijl X ). Here ɛ includes all the slack variable ɛ ijl and Y N = diag([η ij (1 y il ), ]). The index should be consistent with the ijl index order in ɛ. ( ) (xi x B ijl = l )(x i x l ) T (x i x j )(x i x j ) T. Eijl Here E ijl is a diagonal matrix where only the corresponding ijl diagonal element equals to 1 and (7) 4

5 others. The problem can be easily transformed into standard form by adding slack variables to the inequality constraints. 4 Large-scale problem 4.1 lgorithms for large-scale problem and DML-eig method Most of semidefinite programming solvers are based on interiror point method and compuing Hessian become very hard on larger problem. series of algorithms have been proposed to address the problem. lot of efforts focused on exploiting structural properties of the problem and the proper algorithm depends on the type of the problem. More general method is first order methods, which seeks to significantly reduce the per iteration complexity of optimization algorithms rather than the total computational cost. nother recent trend it to use subsampling to reduce the computational cost of each iteration. Recently Ying [5] proposed a method arised from a special structure based method, Frank-Wolfe algorithm. They modify the algorithm by smoothing technologies, which allows gradient search on the approximated function instead of subgradient method on the initial problem. Here we briefly review their method. Ying proved the theorem: ssume that X s is invertible and, for any τ D, let X τ = X 1/2 S X τ X 1/2 S. Then, problem () is equivalent to the following problem max min u τ < X τ, S > S P u τ D where = {u R D : u τ, τ D u τ = 1}, P = {M S d + : T r(m) = 1} Ying further propose en efficient algorithm for DML-eig, which a new first-order method by combining the Frank-Wolfe algorithm and smoothing technologies. Let f u (S) = min u τ D u τ < X τ, S > +u τ D < X τ, S >, u > is smoothing parameter lgorithm: pproximate Frank-Wolfe lgorithm for DML-eig Parameter: smoothing parameter u >, tolerance value tol, step size α t (, 1), t N Initialization: Set S u 1 S d + with T r(s u 1 ) = 1 for t=1,2,3,... Z u t = argmax{f u (S t )+ < Z, f u (S u t ) >: Z S d +, T r(z) = 1}, that is Z u t = vv T. S u t+1 = (1 α t )S u t + α t Z u t if f u (S u t+1) f u (S u t ) < tol then break The step size need to satisfy: α t =, lim t α t = t N 4.2 cceleration Strategies We can observe that the density part of the matrix is the gram matrix of the samples. So the complexity problem depends on the feature dimension. DML-eig algorithm has reduced the computational 5

6 cost to O(d 2 ). This is due to the reason that they calculate the leading eigenvector instead of the decomposition of the matrix. The constraints are all from the in-equivalence constraints and the computational cost is also proportional to the number of the in-equivalence constraints. However, only few of them should be active based on our formulation. Therefore, we might use the Euclidean distance to prefilter out less in-equivalence constrains before applying the optimization algorithms so that we can accelerate the optimization. nother idea to accelerate the DML-eig algorithm is that whether we can use better initialization with low computational cost. Relevent Component nalysis (RC) [16] is a metric learning method only considering the equivalence constraints with low computational cost. So we explored to use the result from RC as initialization. 5 Experiments 5.1 Dataset and Evaluation Criteria We experiment with 3 UCI dataset, iris, wine and protein. The number of classes and the feature dimension for each data set is: iris: classes 3,d=4; wine: classes 3, d=12; protein: classes 6, d=2; We follow the criteria used by Xing to evaluate the quality of learned metrics in a clustering application.(we use Kmeans with learned metric here) Let c i be the cluster label, ĉ i be the assigned label by an automatic clustering algorithm. ccuracy = i>j 1{1{c i = c j } = 1{ĉ i = ĉ j }}.5m(m 1) where 1 is the indicator function. ll the experiment code is released in Comparison of different optimization technologies Here we compare the performance of different optimization algorithms in learning metric. Xing s method is from his released implementation [1]. We implemented SDP using the open source Sedumi solver [9]. We implemented DML-eig algorithm by matlab. The baseline is using Euclidean distance. From the result, we can see that both SDP and DML-eig achieves better performance than Xing s optimization algorithm. 6

ccuracy (ratio=.9) 1 Baseline(Euclidean) Newton Iterative projected gradient SDP DEig.8.6.4.2 Iris Wine Protein Figure 1: ccuracy for different optimization technologies.

This is not clear in the Euclidean distance matrix but more clear pattern is shown using the learned matrix. 1 12 12 x 1 5 4.5 1 2 1 9 2 8 1 2 1 1 2 1 4 3 4 3 7 4 3 4 8 3 4 8 3.5 3 5 6 5 5 5 2.

7 ccuracy (ratio=.9) 1 Baseline(Euclidean) Newton Iterative projected gradient SDP DEig Iris Wine Protein Figure 1: ccuracy for different optimization technologies. To better visualize the result, we also show the distance matrix on protein data. It actually includes 6 clusters. This is not clear in the Euclidean distance matrix but more clear pattern is shown using the learned matrix x Euclidean Newton IPG DML-eig Figure 2: Distance matrix over different distance functions 5.3 Result for ccelartion strategies 1. We explored to use RC as initialization for DML-eig algorithm. Unfortunately, we didn t reduce the iteration number. 2.We explored to filter the negative constraints by selecting those with smaller Euclidean distance. The result is shown in the figure below. Both the semidefinite programming algorithm and the DML-eig algorithm converge with fewer iterations while the performance is only slightly affect. 7

8 Iteration Number of SDP (tolerance=1e 6) original sampled 12 Iteration Number of Deig on wine data original sampled IterNum IterNum dataset: iris, wine, protein 1e 6 1e 8 1e 1 1e 12 1e 6 1e 8 1e 1 tolerance SDP DML-eig Figure 3: Iteration number using sampling strategy References [1] Xing, Eric P., et al. Distance metric learning with application to clustering with side-information. dvances in neural information processing systems. 22. [2] Kilian Q. Weinberger, Lawrence K. Saul,Distance Metric Learning for Large Margin Nearest Neighbor Classification,Journal of Machine Learning Research,1,29, [3] J.Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-theoretic metric learning. In Proceedings of the Twenty-Fourth International Conference on Machine Learning, pages 29216, 27. [4] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood component analysis. In dvances in Neural Information Processing Systems 17, 24. [5] Yiming Ying, Peng Li, Distance Metric Learning with Eigenvalue Optimization, Journal of Machine Learning Research, 212 [6] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively with application to face verification. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages , 25. [7] S. C. H. Hoi, W. Liu, M. R. Lyu, and W.-Y. Ma. Learning distance metrics with contextual constraints for image retrieval. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages , 26 [8] T. Kato and N. Nagano. Metric learning for enzyme active-site search. Bioinformatics, 26: , 21. [9] Sturm, J. F. (1999). Using SeDuMi 1.2, a MTLB toolbox for optimization over symmetric cones.optimization Methods and Software, 1112: Special issue on Interior Point Methods. [1] epxing/papers/old_papers/code_metric_online.tar.gz [11] lexandre dspremont,(212) Tutorial: algorithms for Large-Scale Semidefinite Programming [12] Olivier Devolder, Franois Glineur, Yurii Nesterov, First-order methods of smooth convex optimization with inexact oracle, Math. Program., Ser. DOI 1.17/s [13] dspremont,., Banerjee, O., and El Ghaoui, L. (28). First-order methods for sparse covariance selection., SIM Journal on Matrix nalysis and its pplications, 3(1):5666. [14] Steven J. Benson, Yinyu Ye, and Xiong Zhang, Solving large-scale sparse semidefinite programs for combinatorial optimization, SIM J. Optim., 1(2), (19 pages) [15] lexandre dspremont,(211) Subsampling algorithms for Semidefinite Programming,Technical report arxiv:83.199v6 [16] haron Bar-Hillel, Tomer Hertz,Noam Shental, Daphna Weinshall, Learning a Mahalanobis Metric from Equivalence Constraints, Journal of Machine Learning Research 6 (25) [17] Lieven Vandenberghe,Stephen P. Boyd, Semidefinite Programming, SIM review, 38(1):49-95, March

9 ppendix Newton method by Xing Xing proved that the original optimization problem is equivalent to minimize the function: g() = x i x j 2 c log( x i x j ) ( ) When is diagonal matrix, Xing points that it can be solved by Newton method cheaply. We derived the gradient and hessian matrix to consider the computation complexity. We define a = [ 11, 22,, nn] T. dist(x i, x j ) = ((x i x j ) T a (x i x j )) g = g(11, 22,, nn) a 2 H = 2 g( 11, 22,, nn ) 2 a The update step is distderive1(x i, x j ) =.5 (x i x j ) 2 dist(x i, x j ) distderive2(x i, x j ) =.25 (x i x j ) 4 sumddist = sumdderive1 = sumdderive2 = = dist(x i, x j ) 3 dist(x i, x j) distderive1(x i, x j ) distderive2(x i, x j) (x i x j) 2 C sumdderive1 sumddist = C [ sumdderive2 sumddist a = a t [ 2 H] 1 g, t is stepsize sumdderive1t sumdderive1 sumddist 2 ] (8) There are n parameters in a and they are separable. So the computational cost is around O(n). But when is full rank, n 2 parameter requires O(n 6 ) time to invert the hessian matrix. The computational cost is too expensive and unacceptable. Projected gradient search by Xing For full rank matrix, Xing proposed to use iterative projected gradient search to solve the optimization problem efficiently. The problem is posed as the equivalent form as below: max g() = x i x j s.t. f() = x i x j 2 1 The algorithm takes the gradient step on g() and then repeatedly project into the sets C 1 = { : x i x j 2 1} and C 2 = { : }. The algorithm is shown below: (9) Iterate Iterate projection := P C1 () := P C2 () until converges := + t ( g()) until convergence f() 9

10 The projection to set C 1 is to solve the optimization problem: min s.t. 2 F x i x j 2 1 (1) Considering the dual problem. We define X s matrix as <, B >= T r( T B) = x i x j 2. We denote the inner product of The Lagrangian function:l(, u) = 2 F + u( x i x j 1) L(, u) = 2( ) + ux s = =.5uX s + (11) g(u) = min L(, u) =.5uX s 2 F + u(< X s, +.5uX s > 1) =.75u 2 T r(s T S) + u(t r(s T ) 1) (12) The dual problem is: max g(u), u R. g(u) =, we can get: u = T r(xt s ) T r(x T s X s ) (13) Using KKT condition, using (6) and (8), we can get: = < X s, > X s 2 X s + F Considering projection to set C 2: = U T ΣU Σ + = max(, Σ) = U T Σ + U Summary From the illustration of the projection step, we can see that the projection to set C 1 has analytical solution. X s can be pre-computed and stored. The only cost is matrix multiplication and the projection step is cheap. The main cost of projection to C 2 is matrix decomposition, which is usually O(n 3 ) time complexity. 1

Distance metric learning: A two-phase approach

Distance metric learning: A two-phase approach ESANN 07 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 6-8 April 07, i6doc.com publ., ISBN 978-8758709-. Distance metric