THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE SIMILARITY METRICS FROM TRIPLET CONSTRAINTS A DISSERTATION SUBMITTED TO

Size: px

Start display at page:

Download "THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE SIMILARITY METRICS FROM TRIPLET CONSTRAINTS A DISSERTATION SUBMITTED TO"

Winifred Francis
5 years ago
Views:

1 THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE SIMILARITY METRICS FROM TRIPLET CONSTRAINTS A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN CANDIDACY FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE BY LIWEN ZHANG CHICAGO, ILLINOIS WINTER, 2015

3 ABSTRACT The measure of similarity plays a crucial role in many applications such as content-based recommendation, image search and speech recognition. Similarity between objects is multifaceted and it is easier to judge similarity when the focus is on a specific aspect. We consider the problem of mapping objects into view specific embeddings where the distance between them is consistent with the similarity comparisons of the form from the t-th perspective, object A is more similar to B than to C. We propose a framework to jointly learn multiple views by deploying different similarity metrics in a unified embedding space. Our approach is a natural extension of the view-wise independent approach and is capable of exploiting the correlation between the views if it exists. Experiments on a number of datasets, including a large dataset of multi-view crowdsourced comparison on bird images, showed the proposed method achieved lower triplet generalization error and better grouping of classes in most cases, when compared to learning embeddings independently for each view or learning a single embedding from triplets collected on all views. More specifically, on datasets where correlation between views is strong, the proposed method is able to achieve significant improvement, while on datasets with limited view correlation, it still performs no worse than its independent learning counterpart. iii

4 ACKNOWLEDGEMENTS This thesis is based on a joint work with Ryota Tomioka, my thesis advisor, and Subhransu Maji, who is now a faculty member in UMass Amherst. I am grateful to them for their collaboration and help on many technical details. I also wish to express my sincere thanks to Ryota for his guidance throughout the completion of this thesis. Thanks go to Min Xu at the Statistics Department for pointing out the idea of using Kendall s τ in the analysis. iv

5 TABLE OF CONTENTS ABSTRACT ACKNOWLEDGEMENTS iii iv 1 INTRODUCTION 1 2 LEARNING SIMILARITY FROM RELATIVE COMPARISON Learning with Relative Similarity Measure Triplet Embedding JOINT LEARNING MULTIPLE PERCEPTUAL SIMILARITIES Formulation Relation with Learning Multiple Independent Similarities Influence of Parameters Alternative Formulations and Computational Complexity EXPERIMENTS Experimental Setup Datasets Experimental Results Learning a New View (Zero-Shot Learning) Discussion Influence of Dimension Correlation Between Views RELATED WORK Multitask Metric Learning with Low-Rank Tensors Relation between Triplet Embedding and Metric Learning CONCLUSION AND FUTURE WORK 38 A VIEWS OF POSES OF AIRPLANES DATASET 40 B ATTRIBUTES OF PUBLIC FIGURES FACE DATASET 42 C CORRELATION BETWEEN SIMILARITY METRICS ON MULTIPLE DATASETS 43 REFERENCES 46 v

6 LIST OF FIGURES 1.1 Illustration of ambiguity in similarity View specific similarities between poses of planes View specific similarities between birds Experimental results on synthetic data with clusters Experimental results on uniformly distributed synthetic data The global view of embeddings of poses of planes Experimental results on poses of planes dataset Illustration of public figures face data embedded under the metric of the first view Experimental results on public figure faces dataset Illustration of CUB-200 birds data Experimental results on CUB-200 birds dataset Learning a new view on CUB-200 birds dataset Relation between triplet consistency and correlation of pairwise-distances Performance gain and correlation of views A.1 Landmarks illustrated on the several planes vi

7 LIST OF TABLES 4.1 Measure of similarity correlation and performance gain of using joint learning. Entries marked with are values estimated from independent embeddings B.1 List of Pubfig attributes that were used in this work C.1 Consistency of triplet constraints between different views in synthetic data (clustered) C.2 Consistency of triplet constraints between different views in synthetic data (uniformly distributed) C.3 Consistency of triplet constraints between different views in poses of planes dataset estimated from independent embedding C.4 Consistency of triplet constraints between different views in public figures dataset. 44 C.5 Consistency of triplet constraints between different views in CUB-200 dataset estimated from independent embedding C.6 Distance correlation between different views in synthetic data (clustered) C.7 Distance correlation between different views in synthetic data (uniformly distributed). 44 C.8 Distance correlation between different views in poses of planes dataset estimated from independent embedding C.9 Distance correlation between different views in public figures dataset C.10 Distance correlation between different views in CUB-200 dataset estimated from independent embedding vii

8 CHAPTER 1 INTRODUCTION Measure of similarity plays an important role in applications such as content-based recommendation, image search and speech recognition, and various techniques to learn such a measure from data have been proposed (Jain et al., 2008; Weinberger and Saul, 2009; McFee and Lanckriet, 2011). While the concept of similarity between objects can be abstract, it is typically captured by embedding objects into a vector space equipped with a distance metric that conforms to the similarity observation. Similarity comparison of the form object A is more similar to B than to C is a commonly used form of supervision for learning such embeddings (Agarwal et al., 2007; Tamuz et al., 2011; van der Maaten and Weinberger, 2012). Although judging similarity comparisons is easier than rating similarity on an absolute scale (Kendall and Gibbons, 1990), sometimes there is ambiguity on how similarity is measured. Consider the problem of comparing three birds (see Fig. 1.1), the answer to question: is bird A more similar to B or to C could depend on what aspect of the bird one is interested in. Most annotators will say that the head of bird A is more similar to the head of B while the back of A is more similar to C. With the presence of multiple notions of similarity, it is natural to treat observation from different perspectives separately. To address this issue, we study the setting where the similarity observation takes the form from the t-th perspective, A is more similar to B than to C. These view specific similarities can be seen as an extension of attributes that can be collected without crisply defining attribute vocabularies, and they can encode continuous and multidimensional structures, such as, color and pose. Similarity in such cases can be expressed as a view-specific low dimensional embedding, e.g., color as a (red, green, blue) vector. In addition to making the annotation task simpler, multiple perceptual similarities can also enable precise feedback for human in the loop tasks, such as, content-based image retrieval system (Cox et al., 2000) and interactive fine-grained recognition (Wah et al., 2014a). The main drawback of learning view specific embeddings independently is that one needs similarity comparisons for each view. Consider the problem of simply learning a single embedding of N objects which may require O(N 3 ) triplet comparisons for each view. This can be expensive to obtain especially when experts are involved. 1

9 2 A B C head back Is A more similar to B than C? Figure 1.1: Ambiguity in similarity. A is more similar to C than B when focussing on the back (middle row), but is more similar to B than C when focussing on the head (bottom row).

10 3 We propose a framework to learn embeddings jointly that addresses this drawback. The key intuition is that while different notions of similarity may produce contradicting triplet relations between some instances, they could be related for most of them. For instance, photos of a car taken from different angles are correlated, as they can be considered as projections of a 3D model onto different planes. In a symphony, musics played by individual instruments in the orchestra could be related, as they may share some of the melody or rhythm. Our method exploits such correlation between views by constructing a unified embedding space where different notions of similarity are treated as its subspaces. We perform experiments on a synthetic dataset and three real datasets from different domains, namely, poses of airplanes, features from face images (PubFig dataset, Kumar et al., 2009), and crowd-sourced similarities collected on different body parts of birds (CUB dataset, Welinder et al., 2010). For a given amount of training data per view, the proposed joint learning approach obtains lower triplet generalization error compared to the naive independent learning approach on most datasets. The proposed joint learning approach also tends to obtain better cluster structures at the category level, which are measured through the leave-one-out classification error. On dataset where different similarity metrics are highly related, the joint embeddings are significantly better. The rest of this thesis is structured as follows: in Chapter 2, we review the literature of learning perceptual similarity from relative comparisons. We formulate the problem of multiple-metric triplet embedding as an extension to learning a single similarity metric in Chapter 3. Our algorithm to solve it is presented in the same chapter. Experiments on both synthetic and real datasets are presented in Chapter 4. Chapter 5 lists some other work related to our problem. Discussion and future directions are presented in Chapter 6.

11 CHAPTER 2 LEARNING SIMILARITY FROM RELATIVE COMPARISON While the concept of similarity between objects can be abstract, it is typically captured by embedding objects into a vector space equipped with a distance metric that conforms to the similarity observation. Suppose there are N objects represented as D dimensional vectors {x 1, x 2,..., x N } R D and we would like to learn a distance mapping d : R D R D R + 0 which satisfies the properties: 1. d(x i, x i ) 0 (non-negativity). 2. d(x i, x j ) = d(x j, x i ) (symmetry). 3. d(x i, x j ) + d(x j, x k ) d(x i, x k ) (triangular inequality). Notice that d here isn t necessarily a metric because it doesn t have to satisfy the following property d(x i, x j ) = 0 if and only if x i = x j (distinguishability). If a distance mapping satisfies the condition (1), (2) and (3) above, but not distinguishability, it is called a pseudometric. Throughout this work, we consider the learning of pseudometrics. A family of metrics that is commonly used in the literature is Mahalanobis metrics which defines distance metric as d(x i, x j ) = (x i x j ) M(x i x j ) (2.1) where M is a D D positive semi-definite matrix. In a learning problem, when x i s are given, we may learn a positive definite matrix M. It is common to think of it as learning a linear transformation L and let d(x i, x j ) = (x i x j ) L L(x i x j ), since L L is guaranteed to be positive semidefinite. When x i s are not given, we may take x i = e i R N and learn a rank D positive semi-definite matrix M R N N. Because M can be decomposed as M = L L, for some L R D N and Lx i is simply the i-th column of L in this case, learning a rank D metric M and fixing x i s is equivalent to taking M to be the identy matrix I D and learning the embedding x i s in R D directly. 4

12 5 2.1 Learning with Relative Similarity Measure Similarity knowledge can be described in different forms. In this work, we focus on problems where the similarity observation takes the form of similarity triplet: A is more similar to B than to C, which is a special case of similarity quadruples A is more similar to B than C is to D. In this section, we review different approaches that people have taken to learn similarity with this kind of pairwise comparison. A formal description of learning from similarity triplets will be presented in the next section. Different techniques have been studied for embedding data points based on similarity triplets. Agarwal et al. (2007) seeks to find an embedding where inter-point Euclidean distances have the same ordering as a given set of dissimilarities. Tamuz et al. (2011) proposed a method to learn an embedding from crowd sourced data alone where the training set is consist of triplet relations in the form object A is more similar to B than to C. van der Maaten and Weinberger (2012) proposed a technique called t-distributed Stochastic Embedding (t-ste) which fulfill a similar task but using Student-t kernel to model the similarity so that the resulting embedding is more compact and clustered. In the work of McFee and Lanckriet (2009), the authors consider the problem of learning an embedding subject to a set of constraints in the form objects i and j are more similar than k and l. They encode such ordinal constraints on a graph and exploit the global structure of the graph to get an efficient embedding algorithm. McFee and Lanckriet (2011) formulated a novel multiple kernel learning technique for integrating heterogeneous data into a single unified similarity space. They also cope with similarity constraints in the form of pair-wise comparison and use them as side information. Pairwise similarity measurement can sometimes be inferred from other knowledge. For example, supervision of class labels can be treated as similarity information, as it is natural to consider that two objects belong to the same class can be considered more similar than two objects from different classes. For instance, Cui et al. (2013) developed a pairwise-constrained multiple metric learning algorithm for face recognition where similarity constraints are derived from class labels. Parameswaran and Weinberger (2010) consider the case of supervised learning and train multiple k-nn classifiers jointly by learning multiple metrics in the feature spaces. They use triplet similarity constraints inferred from supervision information and showed that the jointly learned metrics achieved higher classification accuracies than their single metric counterpart.

13 6 2.2 Triplet Embedding In this section, we assume that the embeddings are learned purely from similarity triplets. More specifically, given a set of triplets S = {(i, j, k) z i is more similar to z j than z k }, we aim to find an embedding {x 1, x 2,..., x N } R D, for some D N, such that the pair-wise comparison of Euclidean distances agrees with S, i.e., (i, j, k) S = x i x j 2 2 < x i x k 2 2. This is the setting studied by van der Maaten and Weinberger (2012) and Tamuz et al. (2011). It is also a special case of the paired comparison setting proposed in Agarwal et al. (2007), which originates from the work of Shepard (1962a,b); Kruskal (1964a,b). We give a brief review of these methods here. Generalized Non-Metric Multidimensional Scaling (GNMDS) GNMDS (Agarwal et al., 2007) seeks a low-rank kernel matrix K = X X such that distances constraints are satisfied with a large margin. The trace-norm of K is minimized in order to approximately minimize its rank. Although GNMDS was raised for solving problems where similarity constraints are posed as i and j are more alike than k and l, it could be easily adapted for triplets constraints, as illustrated in van der Maaten and Weinberger (2012), and solve a problem of the form: min K trace(k) + C (i,j,l) S ξ ijl, s.t. k jj 2k ij k ll + 2k il 1 + ξ ijl, ξ ijl 0, K 0. where C is a regularization parameter. After the optimal K is learned, the embedding X is then obtained via an SVD of K. Here, ξ ijl s are slack variables. Notice that this optimization problem is equivalent to using a hinge-loss to penalize the violation of similarity constraints: min K trace(k) + C s.t. K 0. (i,j,l) S max {0, 1 + k jj 2k ij k ll + 2k il },

14 7 Crowd Kernel Learning (CKL) Tamuz et al. (2011) proposed the Crowd Kernel Learning method to deal with training triplets (i, j, l) which are collected from annotators without domain expertise. They consider receiving the feedback, i is more similar to j than to l, from a random crowd member as an random event which occurs with certain probability. In the learning stage, the goal is to minimize the empirical log-loss log(1/p ijl ) where a higher probability p ijl means the triplet (i, j, l) is modelled better. They take a scale-invariant model and define the probability p ijl as p ijl = x i x l µ x i x j x i x l µ = k ii + k ll 2k il + µ (k ii + k jj 2k ij ) + (k ii + k ll 2k il ) + 2µ where µ is added to avoid numerical instability, k ij s are entries of the kernel matrix K. Their optimization problem can be written as: min K 1 S (i,j,l) S log p ijl, s.t. k ii = 1, i = 1, 2,..., N, K 0. Stochastic Triplet Embedding Based on the idea of CKL, van der Maaten and Weinberger (2012) developed the method of Stochastic Triplet Embedding. They proposed two ways to define the probability p ijk : 1. STE p ijl = exp( x i x j 2 2) exp( x i x j 2 2) + exp( x i x l 2 2) 2. t-ste p ijl = ( ( 1 + x i x j 2 2 α 1 + x i x j 2 2 α ( ) α ) α x i x k 2 2 α ) α+1 2 They claimed that, during the iterations for numerically solving the optimization problem, using STE and t-ste tends to penalize/reward triplets that are violated/satisfied more locally and the influence of a triplet becomes very small when the triplet constraint is strongly violated. Thus,

15 8 it is much better at revealing the underlying data structure and their experiments showed that STE and t-ste produce more clustered embeddings. See (van der Maaten and Weinberger, 2012) for details. From the methods reviewed above, we observe that a commonly employed strategy is to define a loss function l that measures how well the embedding models a triplet (i, j, k) and solve a minimization problem of the form min C l( x i x j 2, x i x k 2 ) + X 2 X R N D F, (2.2) (i,j,k) S where F is the Frobenius norm and C > 0 is a regularization parameter. Since the squared Euclidean distance x i x j 2 can be expressed as x i x j 2 = k ii 2k ij + k jj using a positive semidefinite Gram matrix K 0, minimization problem (2.2) can be equivalently rewritten as follows: min C l(d K (i, j), d K (i, k)) + tr(k), (2.3) K R N N (i,j,k) S s.t. K 0, where d K (i, j) = k ii 2k ij + k jj. Penalizing the trace of the Gram matrix can be seen as a convex surrogate for penalizing the rank of the Gram matrix (Agarwal et al. (2007); Fazel et al. (2001)). After the optimal K is learned, we recover the embedding X = [x 1,..., x N ] by the decomposition K = X X. Since optimization problems (2.2) and (2.3) are equivalent up to rotation, we will interchangingly say learning the embedding X and learning the metric K. However, optimization problem (2.3) is convex if the loss function l is convex, while this is not true for (2.2).

16 CHAPTER 3 JOINT LEARNING MULTIPLE PERCEPTUAL SIMILARITIES While previous studies deal with the situation where there is one underlying similarity measure, we would like to consider the case that objects are compared from a few different aspects, i.e., there exists multiple measures of similarity. The most related work would be from McFee and Lanckriet (2011), Parameswaran and Weinberger (2010) and Cui et al. (2013). However, the problem we will address has a more general setting and is different from work mentioned previous in two folds: 1) Existing work that deals with similarity triplets learns a single similarity metric. Yet, we consider the case where there exists multiple similarity measures and we aim to learn embeddings under those measure simultaneously. 2) algorithms given by Parameswaran and Weinberger (2010) and Cui et al. (2013) requires supervision and their target is to minimize classification error. But we consider an embedding task that is purely based on constraints in the form of similarity triplets. Finally, our approach is complementary to methods that seek to minimize user effort by actively collecting triplet comparisons (Jamieson and Nowak, 2011; Tamuz et al., 2011) and better interfaces for triplet collection (Wilber et al., 2014). In this chapter, we will first give a formal formulation for the problem of jointly learning multiple similarities and introduce our approach to solve this problem. We will discuss issues such as influence of parameters and computational complexity later. 3.1 Formulation Assume that T sets of triplets S 1,..., S T are obtained by asking labelers to focus on a specific aspect when making pair-wise comparisons. In addition, a set of triplets S 0 corresponding to global (unspecific) notion of similarity may be available. Our goal is to obtain an embedding for the global notion of similarity as well as embeddings corresponding to local views. To this end, we take a hybrid approach that combines (2.2) and (2.3) and use explicit parameterization (2.2) for the global ebmedding of each object and Gram-matrix-based parametrization (2.3) for modeling each view. 9

17 Let x 1,..., x N R D be vectors corresponding to the global embedding of the objects as above. Each local view is associated with a Gram matrix M t and the underlying metric is defined as a (squared) Mahalanobis distance d M t (i, j) := (x i x j ) M t (x i x j ). We formulate the learning problem as follows: min x 1,...,x N, M 1,...,M T l ( x i x j 2, x i x k 2) (i,j,k) S T l (d M t (x i, x j ), d M t (x i, x k )) t=1 (i,j,k) S t 10 T γ tr(m t ) + β X 2 F, (3.1) t=1 s.t. M t 0 (t = 1,..., T ). where l is a loss function that measure how well the embedding models a triplet. We use the hinge loss in the experiments in our work, but the proposed framework readily generalizes to other loss functions proposed in literature such as Tamuz et al. (2011); van der Maaten and Weinberger (2012). We employ regularization terms for both the local metric M t and the global embedding vectors x i in (3.1). Constraints on tr(m t ) are added in the optimization objective, as the trace norm is known to be effective in producing low-rank solutions (Agarwal et al., 2007; Fazel et al., 2001). The norm of x i is necessary because, without this term, we may reduce the trace of M t by scaling down M t while scaling up x i s simultaneously, which significantly weakens the effect of the trace term. Although objective function (3.1) is non-convex, if we fix the value of x i s and choose a convex loss function, e.g.hinge-loss, then it becomes a convex problem with respect to M t s and M t s can be learned independently since they appear in disjoint terms. To minimize the objective, we update M t s and x i s alternately. When M t s are fixed, x i s are updated directly via gradient descent (Algorithm 1). When x i s are fixed, M t s can be updated independently via projected gradient descent, i.e., by iteratively taking a sub-gradient step and projecting the resulting M t onto the positive semi-definite cone (Algorithm 2). The number of dimension D is left as a hyper parameter. The algorithm is summarized in Algorithm 3.

18 11 Algorithm 1: Update x i s Input: x i, i = 1, 2,..., N; M t, t = 1, 2,..., T ; β. Output: x i, i = 1, 2,..., N. initialization: choose a proper initial step size η 0 ; set M 0 := I D ; set η η 0 ; set iteration counter m = 1; while not converged do For n = 1,..., N, let Update g n =2βx n + T t=0 (i,j,k) S t: n (i,j,k) xn l(d M t (x i, x j ), d M t (x i, x k )) x n x n ηg n (n = 1,..., N); Update counter: m m + 1 ; Update step size: η η 0 / m ; end Algorithm 2: Update M t Input: x i, i = 1, 2,..., N; M t ; γ. Output: M t initialization: choose a proper initial step size η 0 ; set η η 0 ; set iteration counter m = 1; while not converged do Take a step in the direction of negative sub-gradient: G M t l(d M t (x i, x j ), d M t (x i, x k )) + γi D, (i,j,k) S t M t M t ηg; Find the eigenvalue decomposition M t = V ΛV ; Project M t to PSD cone: Update counter: m m + 1; Update step size: η η 0 / m; end M t V max(λ, 0)V ;

19 12 Algorithm 3: Multiple-metric Learning Input: N, the number of objects; D, dimension of embedding; S t, t = 0, 1, 2,..., T, triplet constraints. Regularization parameters β, γ. Output: Embedding {x i } N i=1, PSD matrices {M t} T t=1 Initialization: initialize x i s randomly, initialize M t as identity matricies; while not converged do Update x i s by using Algorithm 1 ; for t {1, 2,..., T } do Update M t s by using Algorithm 2 ; end end 3.2 Relation with Learning Multiple Independent Similarities Consider the problem of embedding N objects into D dimensional spaces from T sets of similarity triplets S 1, S 2,..., S T. By independent learning or independent embedding, we mean independently learning T embeddings X t, t = 1, 2,..., T. Each X t is learned from S t only. In contrast, our proposed model learns a global embedding X and T view specific metrics M t s and all S t s are used simultaneously. A simple parameter counting argument tells us that independent learning requires to fit O(N DT ) parameters. On the other hand, our joint learning model has O(ND + D 2 T ) parameters which is fewer when D < N. By comparing the family of embeddings that can be represented by each of these two models, it is not difficult to see that independently embedding into T spaces each of which has dimension D is a special case of jointly embedding in a T D dimensional space. Analytically, solving optimization problem (2.2) independently on T views can be expressed as min x 1,...,x N R T D ind T l (d M t (x i, x j ), d M t (x i, x k )) + β X 2 F, t=1 (i,j,k) S t 1, p = q {(t 1)D ind + 1, (t 1)D ind + 2,..., td ind }, where M t (p, q) = 0, otherwise. which is a special case of the joint learning problem (3.1). The same idea can be reached by approaching from the other direction. Consider a multiview embedding represented by the joint embedding parameterized by a X R D N and M t R D D, t = 1, 2,..., T. If we constrain the local metrics M t to be pair-wise orthogonal, i.e., M i, M j = 0 for all i j, then each M t is associated with a subspace W t R D. Two W t s associated with

20 different M t s are mutually orthogonal. Any x R D can be decomposed as x = T t=1 y t + z where x W t and z ( W t ). In this special case, finding a global embedding X reduces to seeking for embeddings in W t s independently. As a conclusion, jontly embedding in a D dimensional space subsumes indepdent learning in T spaces with dimensionsl D t s as a special case, if T t=1 D t D Influence of Parameters The objective function in (3.1) has two parameters β and γ in the regularization term. For all the loss functions mentioned in Sec 2.2, we can see that the value of these function only depends on the distance d M t (x i, x j ) and d M t (x i, x k ). For any scalar α > 0, by the definition of Mahalanobis distance (2.1), it is easy to see that d M t/α 2(αx i, αx j ) = d M t (x i, x j ) for all x i and x j. This implies that, if we substitute x i = αx i and M t = 1 α 2 M t for x i and M t in the objective function of (3.1), we can alter the value on regularization term while keeping the same loss. Meanwhile, due to arithmetic-mean-geometric-mean inequality, the value of regularization term is lower bounded by βα 2 X 2 F + T γ α tr(m t) 2 βγ X 2 2 F and the lower bound can be reached when α takes the value: t=1 α = ( γ T t=1 tr(m t) β X 2 F T tr(m t ) (3.2) On the other hand, if M t s and X are the optimal solution of (3.1), they must reach the lower bound above. (Otherwise, we can scale them by using the α that gives the lower bound and reduce the value of objective function further.) Thereby, although we have two hyper parameters β and γ, the effective regularization only depends on the value of βγ. Since Algorithm 3 updates x i s and M t s alternately, the lower bound in (3.2) may not be reached numerically. As a heuristic, we could add an optional step in the learning algorithm. That is, after one update of x i s and M t s, we could scale them simultaneously to reach the lower bound on regularization term. See Algorithm 4. ) 1 4. t=1

21 14 Algorithm 4: Multiple-metric Learning with Scaling Input: N, the number of objects; D, dimension of embedding; S t, t = 0, 1, 2,..., T, triplet constraints. Regularization parameters β, γ. Output: Embedding {x i } N i=1, PSD matrices {M t} T t=1 Initialization: initialize x i s randomly, initialize M t as identity matricies; while not converged do Update x i s by using Algorithm 1 ; for t {1, 2,..., T } do Update M t s by using Algorithm 2 ; end Scale x i s and M t s: end α ( γ T X αx, t=1 tr(m t) β X 2 F ) 1 4, M t M t /α 2 (t = 1,..., T ). 3.4 Alternative Formulations and Computational Complexity The joint learning model (3.1) parametrizes similarity metrics with a single global embedding x s and multiple local metrics M t s. Alternatively, we could model the global embedding implicitly by using a positive semi-definite kernel matrix K which can be considered as the Gram matrix and K = Φ Φ for some Φ R D N could be regarded as objects representations in an D dimensional space. Then, we can model the embedding of t-th view with a linear transformation L t R D D and L t Φ would be the embedding for t-th view. If we choose regularization for L t s appropriately, by the generalized representer theorem (Schölkopf et al., 2001), L t can be expressed as L t = W t Φ for some W t R D N. Now, the Gram matrix on the t-th view becomes K t = (( W t Φ ) Φ ) ( W t Φ ) Φ = KW t W t K, and distance between object i and j can be expressed as d t (i, j) = K t (i, i) 2K t (i, j) + K t (j, j) = K(i, :)W t W t K(:, i) The problem of learning multiple similarity metrics can be formulated as

22 15 min W 1,...,W t,k l ( K(i, :) K(j, :) 2, K(i, :) K(j, :) 2) (i,j,k) S 0 + s.t. K 0. T l (d t (i, j), d t (i, k)) + t=1 (i,j,k) S t T γ W t 2 F + β K 2 F, t=1 In this formulation, the objective function is neither convex in K nor in W t s. Since W t s only appear as inner products, we can substitute M t = W t W t but M t will have the size N N. In either way, solving the optimization problem would involve eigen decomposition of the N-by-N matrix K, which could be computationally expensive. On the contrary, formulation (3.1) is convex in M t s and involves eigen decomposition of D-by-D matrices M t s, which is less expensive when D < N. However, compared with independent learning, solving joint learning problem (3.1) with Algorithm 4 is empirically much slower, speculatively because it involves alternate updates in M t s and x i s.

23 CHAPTER 4 EXPERIMENTS In this section, we test our algorithm on both synthetic data and real datasets. One of the real dataset consists of images of airplanes with different poses where similarity between two poses is defined rigorously. Another real dataset uses triplets sampled from a few kernel matrices which are computed from attributes of facial images. The other dataset contains images of 200 species of birds where similarity triplets among the birds are crowd-sourced. We first give a brief explanation on the design of the experiments and introduce the datasets. Experimental results follow. 4.1 Experimental Setup On each dataset, we inspect the quality of embeddings learned from training sets with increasing sizes. We draw about equal numbers of training triplets from all views and check the quality of embedding. In addition, we inspect how the similarity knowledge on existing views could be transferred to a new view where the number of similarity comparisons is small. We did this by conduction an experiment in which we draw a small set of training triplets from one view but use large numbers of training triplets from the rest views. The quality of embeddings is measured from the following aspects: 1 Triplet generalization error. We split triplets randomly into a training set and a test set. Triplet generalization error is defined as the percentage of test triplets whose triplet relations are not correctly modelled by the learned embedding. 2 Leave-one-out classification error. We held out all information about objects class label during the training stage. Only triplet constraints are used for learning the embedding. Then, at the test stage, we choose one embedded object as target and predict its label by revealing the labels of its neighbours. We do this prediction for every objects in turn. The leave-oneout classification error is the percentage of objects whose labels are not correctly predicted. Throughout the experiments, we use a 3-nearest-neighbour classifier to test classification error. 16

24 17 van der Maaten and Weinberger (2012) showed that reducing triplet generalization error often leads to small nearest-neighbour classification error, although small triplet generalization doesn t necessarily guarantee low classification error. We also inspect embeddings obtained by simultaneously learning multiple metrics from this aspect. Throughout the experiments, we use hinge loss as the loss function. Our joint learning method is compared with two baselines, independent learning and pooled single view embedding. In independent learning, we conduct triplet embedding on every view as if they are independent tasks. In pooled single view embedding, triplets collected from all views are merged together to learn a single embedding, as if they are from the same similarity metric. When learning a single embedding, we adopt the problem (2.2) as the objective and solve it by using the package provided by the authors of van der Maaten and Weinberger (2012). To tune the regularization parameters, we further split the training sets for a 5-fold cross-validation and swept over {10 5, 10 4,..., 10 5 } for all parameters. Here are the details of the datasets: 4.2 Datasets Synthetic Data Two synthetic datasets are generated. One consists of 200 points uniformly sampled from a 20 dimensional unit hypercube, while the other dataset is generated in a 10 dimensional space and have 200 objects from a mixture of four Gaussians each of which has variance 1 and is randomly centered in a hypercube with side length 10. Six views are generated on each dataset. Every view is produced by projecting data points onto a random five dimensional subspace. Training and test triplets (i, j, k) are randomly sampled from all possible triplets on every views. Poses of Airplanes This dataset is constructed from 200 images of airplanes from the PASCAL VOC dataset (Everingham et al., 2010) which are annotated with 16 landmarks such as nose tip, wing tips, etc (Bourdev et al., 2010). We use these landmarks to construct a pose-based similarity. Given two planes and the positions of landmarks in these images, pose similarity is defined as the residual error of alignment between the two sets of landmarks under scaling and translation. We generated 5 views each

18 view 1 view 2 view 3 view 4 view 5 Figure 4.

scaling and translation. For e.g.view 1 consists of all landmarks. of which is associated with a subset of these landmarks as seen in Fig. 4.1 which shows three annotated images from the set.

However, there is a strong correlation between the views because the underlying set of landmarks are shared.

25 18 view 1 view 2 view 3 view 4 view 5 Figure 4.1: View specific similarities between poses of planes are obtained by considering subsets of landmarks shown by different colored rectangles and measuring their similarity in configuration up to a scaling and translation. For e.g.view 1 consists of all landmarks. of which is associated with a subset of these landmarks as seen in Fig. 4.1 which shows three annotated images from the set. The planes are highly diverse ranging from passenger planes to fighter jets, varying in size and form which results in a slightly different similarity between instances for each view. However, there is a strong correlation between the views because the underlying set of landmarks are shared. Additionally, we categorize the planes into five classes: left-facing, right-facing, pointingup, pointing-down and facing out or facing away to evaluate classification error. This produces classes with unbalanced number of members. About 80% of the images belongs to one of these three classes: left-facing, right-facing and facing out or facing away. Public Figures Face Data Public Figures Face Database is created by Kumar et al. (2009). It consists of 58,797 images of 200 people. Every image is characterized by 75 attributes which are real valued and describe the appearance of the person in the image. We selected 39 of the attributes and categorized them into 5 groups according to the aspects they describe: hair, age, accessory, shape and ethnicity. We randomly selected ten people and drew 20 images for each of them to create a dataset with 200

19 view 1 view 2 view 3 view 4 view 5 view 6 Figure 4.

(views 2, 3,..., 6). Images were taken from the CUB dataset Welinder et al. (2010) containing 200 species of birds. images.

those in the group. We describe the details of these attributes in the Appendix.

CUB-200 Birds Data We use the birds dataset introduced by Welinder et al. (2010) which contains images of birds categorized into 200 species.

Similarity triplets among species were collected in a crowd-sourced manner: every time, a user is asked to judge the similarity between an image

(2014), where K is the set of all 200 species.

target and K dissim having the ones considered dissimilar.

26 19 view 1 view 2 view 3 view 4 view 5 view 6 Figure 4.2: Perceptual similarities between bird species were collected by showing users either the full image (view 1), or crops around various parts (views 2, 3,..., 6). Images were taken from the CUB dataset Welinder et al. (2010) containing 200 species of birds. images. Similarity between instances for a given group is equal to the dot product between their attribute vectors where the attributes are restricted to those in the group. We describe the details of these attributes in the Appendix. Each group is considered as a local view and identities of the people in the images are considered as class labels. CUB-200 Birds Data We use the birds dataset introduced by Welinder et al. (2010) which contains images of birds categorized into 200 species. We consider the problem of embedding these species based on similarity feedback from human users. Similarity triplets among species were collected in a crowd-sourced manner: every time, a user is asked to judge the similarity between an image of a bird from the target specie z i and nine images of birds of different species {z k } k K using the interface of Wilber et al. (2014), where K is the set of all 200 species. For each display, the user partition these nine images into two sets, K sim and K dissim, with K sim containing birds considered similar to the target and K dissim having the ones considered dissimilar. Such a partition is broadcast to an equivalent set of triplet constraints on associated species, {(i, j, l) j K sim, l K dissim }. Therefore, for each user response, K sim K dissim triplet constraints are acquired. In the setting of multiple metric embedding, different views of birds are obtained cropping regions around various parts of the bird as shown in Fig. 4.2, and then using the same procedure as before to collect triplet comparisons. In this dataset, there are about 100,000 triplets cast on similarities from comparisons made on the whole birds, while there are other 5 views each of which

27 20 is cast on localized region of the birds (e.g.beak, breast, wing). Number of triplets obtained from these views range from about 4,000 to 7,000. For testing classification error, we use a taxonomy of the bird species provided by Welinder et al. (2010). To make the number of objects in all classes balanced, we manually grouped some of the classes to get 6 super classes in total. We note that these global and local similarity triplets were used by Wah et al. (2014a) and Wah et al. (2014b) as a way to incorporate human feedback during recognition. However, our focus is to learn better embeddings of the data itself by combining information across different views. 4.3 Experimental Results Synthetic data We first conduct experiments on synthetic dataset. Synthetic data with cluster is embedded in 5 and 10 dimensional spaces. The uniformly distributed synthetic data is embedded in 10 and 20 dimensional spaces. Triplet generalization errors and leave-one-out 3-nearest-neighbour classification errors are plotted in Fig. 4.3 and Fig The small plots on the right illustrate the view-wise test errors while the big plot on the left shows the average across all views. The dataset with cluster was sampled from a 10 dimensional space and each local view lies in a 5 dimensional space. Our algorithm achieves significant improvement on small trianing samples and it continues to perform better than the baselines except that independent learning obtains a lower triplet generalization error than joint learning on training sets with nearly 40,000 triplets. On the dataset that is uniformly sampled from a 20 dimensional space, joint embedding reduces triplet generalization error faster than independent embedding and pooled embedding as the number of training triplets increases until around 20,000 triplets. At this point, joint embedding in a 10 dimensional space is outperformed by its independent learning counterpart. However, joint embedding in a 20 dimensional space has a lower error rate than the baselines in all cases. Poses of Airplanes The airplanes are embedded into a 10 dimensional space based on a training setting which consists of 3,000 triplets from every view. As an illustration, we project the learned global view of the objects onto their first two principle dimensions via SVD and illustrated the embedding in Fig The visualization shows that objects roughly lies on a circle. Meanwhile, the same figure shows a clear clustered structure and objects belonging to each of the three dominant classes are clustered

28 21 triplet generalization error LOO 3-NN classification error joint-5d ind-5d pooled-5d joint-10d ind-10d pooled-10d number of training triplets across all views joint-5d ind-5d pooled-5d joint-10d ind-10d pooled-10d number of training triplets across all views Figure 4.3: Experimental results on synthetic data with clusters. Top: Triplet generalization error. Bottom: Leave-one-out 3-nearest-neighbor classification error.

29 triplet generalization error joint-10d ind-10d pooled-10d joint-20d ind-20d pooled-20d number of training triplets across all views 22 Figure 4.4: Triplet generalization errors on uniformly distributed synthetic data. together, except that members in the class facing out or facing away are separated into two clusters. Fig. 4.6 shows the triplet generalization errors and classification errors of the learned embeddings. Errors on local views are shown in small plots while the large plots show the average across views. Results of independent learning is included in the same figure. It shows that our algorithm almost always performs better in both predicting triplets and classes when using less than 10,000 training triplets. However, this advantage disappears when training set becomes larger. Public Figures Face Dataset The 200 images are embedded into a 5, 10 and 20 dimensional spaces. We draw triplets randomly from the ground truth similarity measure to form training and test sets. See Fig. 4.7 for a visualization of data embedded under the metric of the first view. Triplet generalization errors and classification errors are shown in Fig In terms of triplet generalization errors of 5 and 10 dimensional embeddings, we find that joint learning performs better on small training triplets while its error rate surpasses independent learning as training set get larger. Joint learning and independent learning have comparable triplet generalization error when the embedding is conducted in a 20 dimensional space. Pooled embedding has the highest triplet generalization errors among the three methods. In terms of leave-one-out classification error, joint learning has a lower error rate than independent learning and pooled embedding while pooled embeddings performs the best on training sets with more than about 10,000 triplets.

30 Joint Embedding (Global) Pointing up Pointing down Left facing Right facing Facing out/away Figure 4.5: The global view of embeddings of poses of planes. CUB-200 Birds Data Student Version of MATLAB The bird data set is more challenging in a sense that it emerges in a more real situation where triplet relations among all bird species are not all available because it is crowdsourced. We learn the embedding both in a 10 dimensional space and a 60 dimensional space. An illustration of embedding learned in 60 dimensional space can be found in Fig In the birds dataset, number of available triplets from the first view is more than the available triplets from other views. During the experiment, we first sample equal numbers of triplets from each view to form up the training set. We keep adding triplets to the training set until it has 3,000 triplets from each view. Then, we add triplets only from the first view and test the embedding in all views. See Fig for triplet generalization error and leave-one-out 3-nearest-neighbour classification errors. The solid vertical line shows the moment that we start to add training triplets only to the first view. We found that, different from the results of previous data sets, our algorithm doesn t work better on this data set in terms of triplet generalization error. However, the triplet generalization error of learning multiple views jointly in a 60 dimensional space is comparable to the ones get from embedding every views independently. In terms of leave-one-out classification error, our method obtains lower error on all views except for the first view.

31 24 triplet generalization error LOO 3-NN classification error joint-3d ind-3d pooled-3d joint-10d ind-10d pooled-10d number of training triplets across all views joint-3d ind-3d pooled-3d joint-10d ind-10d pooled-10d number of training triplets across all views Figure 4.6: Results on poses of planes dataset. Embedding is learned in 3 and 10 dimensional spaces. Top: triplet generalization error. Bottom: leave-one-out 3-nearest-neighbor classification error. The small figures shows errors on individual views and the large figures show the average.

32 25 Independent Embedding Joint Embedding Figure 4.7: Illustration of public figures Student face Version of MATLAB data embedded under the metric of the Student Version first of MATLAB view: hair. Points with the same color belong to the same person. Embeddings are learned in a 10 dimensional space using 10 5 triplets and then further embedded in a 2 dimensional plane by using tsne (van der Maaten and Hinton, 2008) for the purpose of visualization. Left: embeddings learned independently. Right: embeddings learned jointly. 4.4 Learning a New View (Zero-Shot Learning) On the CUB-200 birds dataset, we simulated a setting that is similar to zero-shot learning (Palatucci et al., 2009). We draw a training set contains 100 triplets from the 2nd local view and 3,000 triplets from all other 5 views. And we investigate how does joint learning help create an embedding on the 2nd local view. The learned embedding is shown in Fig It is evident that, embedding learned purely out of 100 triplets from that view has objects from different classes completely mixed together, while the local view from the embedding learned jointly group objects from some classes better. For example, members in the class Passeriformes(Emberizidae) is more separated from members in Passeriformes(Icteridae). Meanwhile, the last plot in the same figure shows the change in triplet generalization error as we add more training triplets from the 2nd local view. We can see that joint learning has lower triplet generalization error when the number of training triplets is small but as is expected it is matched and then outperformed by independent learning soon as the training set gets larger. 4.5 Discussion We have shown through experiments that our algorithm achieves lower triplet generalization errors on synthetic data, airplane poses data and public figure faces data when the training set is relatively small. Since in some real applications, similarity triplets can be expensive to obtain, jointly learning similarity metrics is preferable as it can recover the underlying structure using relatively small

33 26 triplet generalization error LOO 3-NN classification error joint-5d ind-5d pooled-5d joint-10d ind-10d pooled-10d joint-20d ind-20d pooled-20d number of training triplets across all views joint-5d ind-5d pooled-5d joint-10d ind-10d pooled-10d joint-20d ind-20d pooled-20d number of training triplets across all views Figure 4.8: Results on public figures faces dataset. Embeddings are learned in 5, 10 and 20 dimensional spaces. Top: triplet generalization error. Bottom: leave-one-out 3-nearest-neighbor classification error. The small figures shows errors on individual views and the large figures show the average.

34 27 Independent Embedding (View #1) Joint Embedding (View #1) Charadriiformes Passeriformes (Emberizidae) Passeriformes (Icteridae) Passeriformes (Other) Passeriformes (Parulidae) birds (other) Student Version of MATLAB Charadriiformes Passeriformes (Emberizidae) Passeriformes (Icteridae) Passeriformes (Other) Passeriformes (Parulidae) birds (other) Figure 4.9: Illustration of CUB-200 birds data. The figure shows data s embedding under the metric of the first view. Embeddings are learned in a 60 dimensional space using 18,000 triplets and then further embedded in a 2 dimensional plane by using tsne (van der Maaten and Hinton, 2008) for the purpose of visualization. Top: embeddings learned independently. Bottom: embeddings learned jointly. Student Version of MATLAB

35 Figure 4.10: Results on CUB-200 birds dataset. Top: triplet generalization error. Bottom: leaveone-out 3-nearest-neighbor classification error. The small figures shows errors on individual views and the large figures show the average. 28

36 Independent Embedding (View #2) Joint Embedding (View #2) triplet generalization error joint-10d ind-10d Charadriiformes Passeriformes (Emberizidae) Passeriformes (Icteridae) Passeriformes (Other) Passeriformes (Parulidae) birds (other) Charadriiformes Passeriformes (Emberizidae) Passeriformes (Icteridae) Passeriformes (Other) Passeriformes (Parulidae) birds (other) number of training triplets from the local view Student Version of MATLAB Student Version of MATLAB Figure 4.11: Learning a new view on CUB-200 birds dataset. Training data contains 100 triplets from the second local view and 3,000 triplets from other 5 views. Embeddings are learned in a 10 dimensional space and then further embedded in a 2 dimensional plane by using tsne (van der Maaten and Hinton, 2008) for the purpose of visualization. Left: triplet generalization error on the second local view. Middle: embedding learned independently. Right: embedding learned jointly. number of training data. Experiments showed that jointly learning multiple metrics performs better in terms of 3-nearestneighbour classification error almost in all cases, which implies that it has a potential to recover the category level structure of the data. From the experimental results presented above, the gain in performance when using joint learning varies among dataset. On synthetic data with clusters and poses of planes dataset, the benefit of using joint learning is evident and it is especially large on small number of training triplets. Its performance gain on uniformly distributed synthetic data and public figure faces dataset is relatively small and very limited on CUB-200 birds dataset. Influence may come from multiple aspects. We discuss two different yet related factors: (1) choice of dimensions and (2) correlation underlying different views Influence of Dimension From the results on public figure faces and CUB-200 (Fig. 4.8 and 4.10), we see that dimensions of embedding spaces affect the triplet generalization errors of embeddings learned jointly especially when the training set is large. For example, on public figure dataset (Fig. 4.8), joint learning reduces the error faster than independent learning up to around 10,000 triplets. When there are more than 10,000 triplets, errors of joint embedding reduce monotonically as number of dimensions increases. Meanwhile, when we learn a 10 dimensional joint embedding for the uniform synthetic dataset sampled from a 20 dimensional space, we see that joint learning obtains lower triplet generalization error on small up to around 20,000 triplets but was catched up by independent learning on larger training sets (Fig. 4.4). This can be understood as a bias induced by the joint learning.

37 30 As analysed in Sec. 3.2, joint learning has a lower complexity when embedding dimension is chosen the same as independent learning. However, experimental results have shown that in many cases joint learning has a large performance gain even when using the same embedding dimension as independent learning. For example, when public figures data was embedded in a 20 dimensional space, joint embedding continues to perform better than independent learning. Meanwhile, from the results of synthetic data (clustered) and poses of planes data (Fig. 4.3 and 4.6), joint learning is able to achieve lower errors. We interpret this as that when the model has sufficient amount of complexity to capture the inherent structure of data, joint embedding makes use of triplets from all views simultaneously while independent learning only employ triplets from individual views. Next, we consider one particular aspect of the underlying structure, which is the correlation between different views Correlation Between Views Another question of particular interest is how the underlying similarities affect the performance of proposed joint learning algorithm. Speculatively, joint learning would be affected by the correlation between similarity metrics of different views. Here, we examine such correlation from two aspects: consistency of triplet constraints and correlation of distances. Consistency of Triplet Constraints We assume that there is a ground truth embedding for each similarity metric. Since there are 3! ( N 3 ) triplets on N objects and any similarity metric would satisfy exactly half of them 1, a similarity metric induces m(n) = 3! ( N 3 ) /2 triplet constraints. Triplet constraint consistency between two views is defined as the ratio of number of triplet constraints that are shared between both of the views to m(n). In our joint learning algorithm, the global embedding X is trained to conform to triplet constraints over all views. Therefore, if there exists a pair of conflicting triplet constraints, it is possible that joint learning might not be able to model both triplets well. We examine consistency of triplet constraints on a dataset in the following way. On datasets where we have the ground truth similarity metrics in the form of true embedding (e.g., synthetic data) or true similarity kernel (e.g., public figure data), we draw 100,000 triplets at random and check if each triplet is satisfied or violated on each view. Triplet constraint consistency between two similarity metrics is estimated as the percentage of triplets which received the 1 It is because if the triplet constraint (i, j, k) is satisfied under a metric, then the constraint (i, k, j) must be violated by the same metric.

38 31 same judgement under two metrics, i.e., satisfied or violated by both metrics. On poses of planes data and CUB-200 dataset, since we do not have access to all triplet constraints, we estimate the triplet consistency from embeddings learned independently with the largest number of triplets. The triplet constraint consistency for datasets used in this work is list in the Appendix C. The average of triplet constraint consistency for each dataset is summarized in Table 4.1. Correlation of Distances A metric over N objects can be fully captured by a distance matrix D whose (i, j)-th entry is the distance between object i and j. D can be computed from the kernel matrix K by D(i, j) = K(i, i) K(i, j) K(j, i) + K(j, j). Suppose D 1 and D 2 are two distance matrices. Since a (pseudo) metric is symmetric and so is the distance matrix, to compare two distance matrices, we may restrict to the lower-triangle of them and define their correlation as Corr(D 1, D 2 ) = N 1 i<j N (D 1(i, j) µ 1 ) (D 2 (i, j) µ 2 ) 1 i<j N (D 1(i, j) µ 1 ) 2 N 1 i<j N (D 2(i, j) µ 2 ) 2 where µ t = 2 N(N 1) 1 i<j N D t (i, j), t = 1, 2. On synthetic data and public figure faces dataset, we are able to either compute the distance matrix or have access to the true kernel matrix. Therefore we can compute the distance correlation between every pair of local metrics for each of them. For poses of planes and CUB-200 data, we estimate the correlation of distances from embeddings learned independently with the largest number of triplets. Distance correlations between every pair of views on each dataset are shown in Appendix C. The average of distances correlation for each dataset is summarized in Table 4.1. Relation between triplet consistency and correlation of distances. Since we aim to recover the underlying similarity metric from triplet constraints, a natural question would be: how far could two metrics which satisfy the same set of triplet constraints be possibly away from each other. Since triplet consistency and correlation of distances are related to the two sides of this problem, we inspect the relation between these them empirically, to obtain some clues to the problem. A rigorous analysis is left as a future work. Here, an approximate relation between the two quantities are shown first. Its intuition is given

39 sin(0.5 π (2 TripletConsist 1)) correlation syn (cluster) syn (uniform) pubfig bird poses of planes Figure 4.12: Relation between triplet consistency and correlation of distances. subsequently. Let δ(d 1, D 2 ) denote the triplet consistency between two distance metrics. We illustrated the relation between δ(d p, D q ) and Corr(D p, D q ) in Fig It shows that the two quantities roughly follow the relation ( π ) sin 2 (2δ(D p, D q ) 1) = Corr(D p, D q ). (4.1) The intuition behind this relation unfolds as follows. Given two distance matrices D 1 and D 2 over N objects, consider the following generative model: 1. Let (I, J) be a random variable that takes value from the space of all un-ordered pairs among these N objects with equal change. 2. Define Y 1 := D 1 (I, J), Y 2 := D 2 (I, J). By this, Corr(D 1, D 2 ) defined previously is exactly the correlation of Y 1 and Y 2. On the other {( ) } hand, given a sample y (t) 1, y (t) 2 : t = 1,..., m draw from the generative model defined above,

40 33 the Kendall s τ (Kruskal, 1958) is defined by τ = 2 m(m 1) 1 t<t m [( ) ( )] sign y (t) 1 y (t ) 1 y (t) 2 y (t ) 2 and, as stated by Liu et al. (2012), its population version is given by ( ( ) ( )) τ := Corr sign Y 1 Ỹ1, sign Y 2 Ỹ2 where Ỹt is i.i.d. to Y t, t = 1, 2. And, ( ) ( ) τ = P (Y 1 Ỹ1)(Y 2 Ỹ2) > 0 P (Y 1 Ỹ1)(Y 2 Ỹ2) < 0 ( Loosely speaking, we have τ = 2P (Y 1 Ỹ1)(Y 2 Ỹ2) > 0 ( ) which is approximately 2P (Y 1 Ỹ1)(Y 2 Ỹ2) > 0 ) ( ) +P (Y 1 Ỹ1)(Y 2 Ỹ2) = 0 1, 1 when the number of objects N is large. As we define Y 1 and Y 2 to be the distances of a random pair of objects under metrics D 1 and D 2, ( ) P (Y 1 Ỹ1)(Y 2 Ỹ2) > 0 can be regarded as the consistency of quadruple relations induced from these two metrics. Each quadruple relation is written as (A, B, C, D) which means A is more similar to B than C is to D. And it subsumes triplet constraints as special cases. If Y 1 and Y 2 are jointly Gaussian or belong to a nonparanormal family, it is known that sin( π 2 τ) = ρ Y 1,Y 2 (4.2) where ρ Y1,Y 2 is the correlation between Y 1 and Y 2. See (Kruskal, 1958; Liu et al., 2012). This gives the intuition of using (4.1) to approximate the relation between triplet consistency and correlation of pair-wise distances. Note that, here, we use triplet consistency as an estimation of quadruple consistency. Besides, Y 1 and Y 2 obtained from the generative model do not necessarily resemble Gaussian. But, empirically, it works on the dataset used in this work quite well, as shown in Fig Relating performance gain with view correlation. The performance gain is measured by the difference between the area under the triplet generalization errors normalized by that of the independent learning, in the case of 10 dimensional embedding. Relating the performance gain with correlation among similarities (Table 4.1, Fig. 4.13), we see that on synthetic data (clustered) where joint learning obtained a significant performance gain, triplet consistency and distance correlations

41 dataset average triplet consistency average distance correlation performance gain (%) synthetic (clustered) synthetic (uniform) poses of planes public figure CUB Table 4.1: Measure of similarity correlation and performance gain of using joint learning. Entries marked with are values estimated from independent embeddings syn (clustered) syn (clustered) 40 performance gain (%) syn (uniform) pubfig poses of planes syn (uniform) pubfig poses of planes 0 CUB-200 CUB triplet consistency correlation of distances Figure 4.13: Performance gain and correlation of views. are high. On the other hand, synthetic data (uniform), poses of planes and public figure faces dataset have relatively mild view correlations. Performance gains on these dataset are not very significant. In contrast, triplet consistency of CUB-200 is close to random (0.5). That is probably the reason that performance gain on this dataset is very limited.

42 CHAPTER 5 RELATED WORK Previously, we have covered some literature that is closely related to our formulation of jointly learning multiple metrics from triplet constraints. In this chapter, we would like to mention some other work that could be related to the problem of learning multiple similarities from pairwise comparisons and take some glimpses at this problem from different angles. 5.1 Multitask Metric Learning with Low-Rank Tensors Tensors are widely used in the context of multi-task learning (Evgeniou and Pontil, 2007; Romera- Paredes et al., 2013). Learning a low-rank tensor is a problem that has been studied and used in many applications (Tomioka et al., 2011; Signoretto et al., 2014; Liu et al., 2013; Blankertz et al., 2008). Since T similarity metrics over a given set of N obejets can be represented as kernel matrices K t, t = 1,..., T, we may consider them as slices of a N N T tensor K where the (i, j, t) entry of K reflects the similarity between objects i and j under the t-th view. Meanwhile, under the assumption that there exists correlation among different kernel matrices, we may seek for a K with low rank. See (Kolda and Bader, 2009) for the definition of tensor rank. Let (i,j,t) R N N T be a tensor with entries (i,j,t) (i, i, t) = (i,j,t) (j, j, t) = 1, (i,j,t) (i, j, t) = (i,j,t) (j, i, t) = 1 and 0 everywhere else. Then its inner product with K is euqal to the (squared) distance under t-th similarity metric, i.e., (i,j,t), K = K i,i,t K i,j,t K j,i,t + K j,j,t By this, each triplet constraint can be expressed as a linear constraint on K as (i, j, k) S t = (i,j,t), K < (i,k,t), K. 35

43 36 The problem of learning multiple similarity metrics can be formulated with a tensor as: min K T l ( (i,j,t), K, (i,k,t), K ) + Ω(K), (5.1) t=1 (i,j,k) S t s.t. K(:, :, t) 0 (t = 1,..., T ). where Ω(K) is a regularization term which induces low-rank tensor. See (Tomioka et al., 2010) for example choices of such regularizations. Comparing (5.1) with (3.1), we can see that (3.1) can also be cast into a tensor learning problem but the learning space is restricted to tensors with a special form of factorization K = M 1 X 2 X. Here, M is a D D T tensor whose t-th slices is required to be a positive semi-definite matrix and corresponds to the M t representing the metric under the t-th view while X R D N represents the global embedding in (3.1). 5.2 Relation between Triplet Embedding and Metric Learning In the literature, embedding with pairwise comparisons is sometimes called ordinal embedding (Alon et al., 2008; von Luxburg et al., 2014; Terada and Luxburg, 2014) or partial order embedding (McFee and Lanckriet, 2009). von Luxburg et al. (2014) consider the problem of recovering the embeddings of n objects from pairwise comparisons. They proved that, asymptitocally, if we have the knowledge of the ordinal relationships for all quadruples (i, j, k, l), as n, the set of embedded points always converges to the set of original points, up to similarity transformations such as rotations, translations, rescalings and reflections. Developed on that, a further work of Terada and Luxburg (2014) investigates the consistency of local ordinal embedding (LOE) which only uses triplet constraints that reflect k-nearest-neighbour information. More specifically, in the setting of LOE, a triplet constraints (i, j, l) implies that j is a k-nearest-neighbour of i but l is not. The authors proved that, under certain conditions, it is possible to reconstruct the point set x 1,..., x n asymptotically, if we just know the k-nearest neighbours of each point. They also raised a Soft Ordinal Embedding algorithm, not only to recover the ordinal constraints, but the density structure underlying data set. Jamieson and Nowak (2011) derived a lower bound for minimum number of queries of triplet relations (i, j, k) that is needed to determine the embedding. They proved that at least Ω(dn log n) such comparisons is needed to determine the embedding of n objects into a d dimensional space and such lower bound cannot be achieved by randomly choosing pairwise comparisons. The work

44 37 of McFee and Lanckriet (2009) suggested that it might be hard to find the minimal dimension needed to produce an embedding that satisfies all the constraints. They showed that it is NP- Complete to decide if a given set of oridnal constraints C can be satisfied in R 1. Alon et al. (2008) studied the problem of minimum-relaxation ordinal embedding. Roughly speaking, given a distance function D(i, j), an ordinal embedding with relaxation α is an embedding that satisfies D(i, j) > αd(p, q) = d(x i, x j ) > d(x p, x q ) where x i s are the learned embeddings and d(, ) is the metric function of the embedding space. They also established that the problem of ordinal embedding has many qualitative differences from metric embedding which aims to minimize distortion.

45 CHAPTER 6 CONCLUSION AND FUTURE WORK In this work, we introduced our formulation on the problem of jointly learning multiple similarity metrics which is a natural extension of the conventional independent learning model. The proposed model consists of a global view, which represents each object as a fixed dimensional vector, and local views, which specifies each view-specific Mahalanobis metric as a positive semi-definite matrix. Its performance is studied empirically on both synthetic dataset and dataset from the real world. Experimental results have demonstrated that, on dataset with a higher correlation between similarity metrics, joint learning has a better performance than independent learning in terms of triplet generalization error, especially for the cases where the number of training triplets is small. This suggests that joint learning would be preferable for applications where triplet relations can be expensive to obtain. Although the results are presented for the hinge loss (Agarwal et al., 2007), the proposed algorithm easily generalize to other loss functions, e.g., CKL loss (Tamuz et al., 2011), t-ste loss (van der Maaten and Weinberger, 2012). Meanwhile, joint learning tends to have a lower leave-one-out classification error in most situations which implies it could support supervised learning tasks such as learning a classifier. As a future work, we aim to study how to use embeddings learned from similarity comparisons for classification tasks where data is partially labelled. Some preliminary analysis shows that the underlying correlation among similarity metrics would notably affect the performance of joint learning. Based on our empirical study, to achieve performance comparable to independent learning, joint learning would require a higher dimensional embedding space on datasets where views are less correlated. It would be interesting to find a way of estimating the amount of view correlation and further provide guidance for choosing the dimension of embedding space. Moreover, as mentioned in Section 5.1, multiple metrics learning from triplet constraints could be formulated as a problem of learning a low-rank tensor under linear constraints. We may explore this approach and use low-rank tensors as the tool to model correlated similarity metrics. Resort to existing studies on tensors, we might get some insight on learning multiple metric learning under triplet constraints. 38

46 39 On the other hand, embeddings learned from joint learning can be used in higher level machine learning tasks, such as fast retrieval and recognition tasks. We may integrate our joint learning framework with those tasks and study the performance of such systems in real application, such as music recommendation (McFee et al., 2012), word embedding in language (Mikolov et al., 2013) and interactive fine-grained recognition (Wah et al., 2014a). Since triplet relations might be expensive to obtain in many applications, when building such a system, it would also be interesting to develop an active triplet sampling algorithm, as an extension to (Jamieson and Nowak, 2011), so that the queried triplets are the most informative ones.

47 APPENDIX A VIEWS OF POSES OF AIRPLANES DATASET Each of the 200 airplanes were annotated with 16 landmarks namely, 01. Top Rudder 05. L WingTip 09. Nose Bottom 13. Left Engine Back 02. Bot Rudder 06. R WingTip 10. Left Wing Base 14. Right Engine Front 03. L Stabilizer 07. NoseTip 11. Right Wing Base 15. Right Engine Back 04. R Stabilizer 08. Nose Top 12. Left Engine Front 16. Bot Rudder Front This is also illustrated in the Figure A.1. The five different views are defined by considering different subsets of landmarks as follows: 1. all {1, 2,..., 16} 2. back {1, 2, 3, 4, 16} 3. nose {7, 8, 9} 4. back+wings {1, 2,..., 6, 10, 11,..., 16} 5. nose+wings {5, 6,..., 15} For triplet (A, B, C) we compute similarity s i (A, B) and s i (A, C) by aligning the subset i of landmarks of B and C to A under a translation and scaling that minimizes the sum of squared error after alignment. The similarity is inversely proportional to the residual error. This is also known as procrustes analysis commonly used for matching shapes. 40

48 Figure A.1: Landmarks illustrated on the several planes 41

Learning Models of Similarity: Metric and Kernel Learning. Eric Heim, University of Pittsburgh

Learning Models of Similarity: Metric and Kernel Learning Eric Heim, University of Pittsburgh Standard Machine Learning Pipeline Manually-Tuned Features Machine Learning Model Desired Output for Task Features