THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE SIMILARITY METRICS FROM TRIPLET CONSTRAINTS A DISSERTATION SUBMITTED TO

Size: px
Start display at page:

Download "THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE SIMILARITY METRICS FROM TRIPLET CONSTRAINTS A DISSERTATION SUBMITTED TO"

Transcription

1 THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE SIMILARITY METRICS FROM TRIPLET CONSTRAINTS A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN CANDIDACY FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE BY LIWEN ZHANG CHICAGO, ILLINOIS WINTER, 2015

2

3 ABSTRACT The measure of similarity plays a crucial role in many applications such as content-based recommendation, image search and speech recognition. Similarity between objects is multifaceted and it is easier to judge similarity when the focus is on a specific aspect. We consider the problem of mapping objects into view specific embeddings where the distance between them is consistent with the similarity comparisons of the form from the t-th perspective, object A is more similar to B than to C. We propose a framework to jointly learn multiple views by deploying different similarity metrics in a unified embedding space. Our approach is a natural extension of the view-wise independent approach and is capable of exploiting the correlation between the views if it exists. Experiments on a number of datasets, including a large dataset of multi-view crowdsourced comparison on bird images, showed the proposed method achieved lower triplet generalization error and better grouping of classes in most cases, when compared to learning embeddings independently for each view or learning a single embedding from triplets collected on all views. More specifically, on datasets where correlation between views is strong, the proposed method is able to achieve significant improvement, while on datasets with limited view correlation, it still performs no worse than its independent learning counterpart. iii

4 ACKNOWLEDGEMENTS This thesis is based on a joint work with Ryota Tomioka, my thesis advisor, and Subhransu Maji, who is now a faculty member in UMass Amherst. I am grateful to them for their collaboration and help on many technical details. I also wish to express my sincere thanks to Ryota for his guidance throughout the completion of this thesis. Thanks go to Min Xu at the Statistics Department for pointing out the idea of using Kendall s τ in the analysis. iv

5 TABLE OF CONTENTS ABSTRACT ACKNOWLEDGEMENTS iii iv 1 INTRODUCTION 1 2 LEARNING SIMILARITY FROM RELATIVE COMPARISON Learning with Relative Similarity Measure Triplet Embedding JOINT LEARNING MULTIPLE PERCEPTUAL SIMILARITIES Formulation Relation with Learning Multiple Independent Similarities Influence of Parameters Alternative Formulations and Computational Complexity EXPERIMENTS Experimental Setup Datasets Experimental Results Learning a New View (Zero-Shot Learning) Discussion Influence of Dimension Correlation Between Views RELATED WORK Multitask Metric Learning with Low-Rank Tensors Relation between Triplet Embedding and Metric Learning CONCLUSION AND FUTURE WORK 38 A VIEWS OF POSES OF AIRPLANES DATASET 40 B ATTRIBUTES OF PUBLIC FIGURES FACE DATASET 42 C CORRELATION BETWEEN SIMILARITY METRICS ON MULTIPLE DATASETS 43 REFERENCES 46 v

6 LIST OF FIGURES 1.1 Illustration of ambiguity in similarity View specific similarities between poses of planes View specific similarities between birds Experimental results on synthetic data with clusters Experimental results on uniformly distributed synthetic data The global view of embeddings of poses of planes Experimental results on poses of planes dataset Illustration of public figures face data embedded under the metric of the first view Experimental results on public figure faces dataset Illustration of CUB-200 birds data Experimental results on CUB-200 birds dataset Learning a new view on CUB-200 birds dataset Relation between triplet consistency and correlation of pairwise-distances Performance gain and correlation of views A.1 Landmarks illustrated on the several planes vi

7 LIST OF TABLES 4.1 Measure of similarity correlation and performance gain of using joint learning. Entries marked with are values estimated from independent embeddings B.1 List of Pubfig attributes that were used in this work C.1 Consistency of triplet constraints between different views in synthetic data (clustered) C.2 Consistency of triplet constraints between different views in synthetic data (uniformly distributed) C.3 Consistency of triplet constraints between different views in poses of planes dataset estimated from independent embedding C.4 Consistency of triplet constraints between different views in public figures dataset. 44 C.5 Consistency of triplet constraints between different views in CUB-200 dataset estimated from independent embedding C.6 Distance correlation between different views in synthetic data (clustered) C.7 Distance correlation between different views in synthetic data (uniformly distributed). 44 C.8 Distance correlation between different views in poses of planes dataset estimated from independent embedding C.9 Distance correlation between different views in public figures dataset C.10 Distance correlation between different views in CUB-200 dataset estimated from independent embedding vii

8 CHAPTER 1 INTRODUCTION Measure of similarity plays an important role in applications such as content-based recommendation, image search and speech recognition, and various techniques to learn such a measure from data have been proposed (Jain et al., 2008; Weinberger and Saul, 2009; McFee and Lanckriet, 2011). While the concept of similarity between objects can be abstract, it is typically captured by embedding objects into a vector space equipped with a distance metric that conforms to the similarity observation. Similarity comparison of the form object A is more similar to B than to C is a commonly used form of supervision for learning such embeddings (Agarwal et al., 2007; Tamuz et al., 2011; van der Maaten and Weinberger, 2012). Although judging similarity comparisons is easier than rating similarity on an absolute scale (Kendall and Gibbons, 1990), sometimes there is ambiguity on how similarity is measured. Consider the problem of comparing three birds (see Fig. 1.1), the answer to question: is bird A more similar to B or to C could depend on what aspect of the bird one is interested in. Most annotators will say that the head of bird A is more similar to the head of B while the back of A is more similar to C. With the presence of multiple notions of similarity, it is natural to treat observation from different perspectives separately. To address this issue, we study the setting where the similarity observation takes the form from the t-th perspective, A is more similar to B than to C. These view specific similarities can be seen as an extension of attributes that can be collected without crisply defining attribute vocabularies, and they can encode continuous and multidimensional structures, such as, color and pose. Similarity in such cases can be expressed as a view-specific low dimensional embedding, e.g., color as a (red, green, blue) vector. In addition to making the annotation task simpler, multiple perceptual similarities can also enable precise feedback for human in the loop tasks, such as, content-based image retrieval system (Cox et al., 2000) and interactive fine-grained recognition (Wah et al., 2014a). The main drawback of learning view specific embeddings independently is that one needs similarity comparisons for each view. Consider the problem of simply learning a single embedding of N objects which may require O(N 3 ) triplet comparisons for each view. This can be expensive to obtain especially when experts are involved. 1

9 2 A B C head back Is A more similar to B than C? Figure 1.1: Ambiguity in similarity. A is more similar to C than B when focussing on the back (middle row), but is more similar to B than C when focussing on the head (bottom row).

10 3 We propose a framework to learn embeddings jointly that addresses this drawback. The key intuition is that while different notions of similarity may produce contradicting triplet relations between some instances, they could be related for most of them. For instance, photos of a car taken from different angles are correlated, as they can be considered as projections of a 3D model onto different planes. In a symphony, musics played by individual instruments in the orchestra could be related, as they may share some of the melody or rhythm. Our method exploits such correlation between views by constructing a unified embedding space where different notions of similarity are treated as its subspaces. We perform experiments on a synthetic dataset and three real datasets from different domains, namely, poses of airplanes, features from face images (PubFig dataset, Kumar et al., 2009), and crowd-sourced similarities collected on different body parts of birds (CUB dataset, Welinder et al., 2010). For a given amount of training data per view, the proposed joint learning approach obtains lower triplet generalization error compared to the naive independent learning approach on most datasets. The proposed joint learning approach also tends to obtain better cluster structures at the category level, which are measured through the leave-one-out classification error. On dataset where different similarity metrics are highly related, the joint embeddings are significantly better. The rest of this thesis is structured as follows: in Chapter 2, we review the literature of learning perceptual similarity from relative comparisons. We formulate the problem of multiple-metric triplet embedding as an extension to learning a single similarity metric in Chapter 3. Our algorithm to solve it is presented in the same chapter. Experiments on both synthetic and real datasets are presented in Chapter 4. Chapter 5 lists some other work related to our problem. Discussion and future directions are presented in Chapter 6.

11 CHAPTER 2 LEARNING SIMILARITY FROM RELATIVE COMPARISON While the concept of similarity between objects can be abstract, it is typically captured by embedding objects into a vector space equipped with a distance metric that conforms to the similarity observation. Suppose there are N objects represented as D dimensional vectors {x 1, x 2,..., x N } R D and we would like to learn a distance mapping d : R D R D R + 0 which satisfies the properties: 1. d(x i, x i ) 0 (non-negativity). 2. d(x i, x j ) = d(x j, x i ) (symmetry). 3. d(x i, x j ) + d(x j, x k ) d(x i, x k ) (triangular inequality). Notice that d here isn t necessarily a metric because it doesn t have to satisfy the following property d(x i, x j ) = 0 if and only if x i = x j (distinguishability). If a distance mapping satisfies the condition (1), (2) and (3) above, but not distinguishability, it is called a pseudometric. Throughout this work, we consider the learning of pseudometrics. A family of metrics that is commonly used in the literature is Mahalanobis metrics which defines distance metric as d(x i, x j ) = (x i x j ) M(x i x j ) (2.1) where M is a D D positive semi-definite matrix. In a learning problem, when x i s are given, we may learn a positive definite matrix M. It is common to think of it as learning a linear transformation L and let d(x i, x j ) = (x i x j ) L L(x i x j ), since L L is guaranteed to be positive semidefinite. When x i s are not given, we may take x i = e i R N and learn a rank D positive semi-definite matrix M R N N. Because M can be decomposed as M = L L, for some L R D N and Lx i is simply the i-th column of L in this case, learning a rank D metric M and fixing x i s is equivalent to taking M to be the identy matrix I D and learning the embedding x i s in R D directly. 4

12 5 2.1 Learning with Relative Similarity Measure Similarity knowledge can be described in different forms. In this work, we focus on problems where the similarity observation takes the form of similarity triplet: A is more similar to B than to C, which is a special case of similarity quadruples A is more similar to B than C is to D. In this section, we review different approaches that people have taken to learn similarity with this kind of pairwise comparison. A formal description of learning from similarity triplets will be presented in the next section. Different techniques have been studied for embedding data points based on similarity triplets. Agarwal et al. (2007) seeks to find an embedding where inter-point Euclidean distances have the same ordering as a given set of dissimilarities. Tamuz et al. (2011) proposed a method to learn an embedding from crowd sourced data alone where the training set is consist of triplet relations in the form object A is more similar to B than to C. van der Maaten and Weinberger (2012) proposed a technique called t-distributed Stochastic Embedding (t-ste) which fulfill a similar task but using Student-t kernel to model the similarity so that the resulting embedding is more compact and clustered. In the work of McFee and Lanckriet (2009), the authors consider the problem of learning an embedding subject to a set of constraints in the form objects i and j are more similar than k and l. They encode such ordinal constraints on a graph and exploit the global structure of the graph to get an efficient embedding algorithm. McFee and Lanckriet (2011) formulated a novel multiple kernel learning technique for integrating heterogeneous data into a single unified similarity space. They also cope with similarity constraints in the form of pair-wise comparison and use them as side information. Pairwise similarity measurement can sometimes be inferred from other knowledge. For example, supervision of class labels can be treated as similarity information, as it is natural to consider that two objects belong to the same class can be considered more similar than two objects from different classes. For instance, Cui et al. (2013) developed a pairwise-constrained multiple metric learning algorithm for face recognition where similarity constraints are derived from class labels. Parameswaran and Weinberger (2010) consider the case of supervised learning and train multiple k-nn classifiers jointly by learning multiple metrics in the feature spaces. They use triplet similarity constraints inferred from supervision information and showed that the jointly learned metrics achieved higher classification accuracies than their single metric counterpart.

13 6 2.2 Triplet Embedding In this section, we assume that the embeddings are learned purely from similarity triplets. More specifically, given a set of triplets S = {(i, j, k) z i is more similar to z j than z k }, we aim to find an embedding {x 1, x 2,..., x N } R D, for some D N, such that the pair-wise comparison of Euclidean distances agrees with S, i.e., (i, j, k) S = x i x j 2 2 < x i x k 2 2. This is the setting studied by van der Maaten and Weinberger (2012) and Tamuz et al. (2011). It is also a special case of the paired comparison setting proposed in Agarwal et al. (2007), which originates from the work of Shepard (1962a,b); Kruskal (1964a,b). We give a brief review of these methods here. Generalized Non-Metric Multidimensional Scaling (GNMDS) GNMDS (Agarwal et al., 2007) seeks a low-rank kernel matrix K = X X such that distances constraints are satisfied with a large margin. The trace-norm of K is minimized in order to approximately minimize its rank. Although GNMDS was raised for solving problems where similarity constraints are posed as i and j are more alike than k and l, it could be easily adapted for triplets constraints, as illustrated in van der Maaten and Weinberger (2012), and solve a problem of the form: min K trace(k) + C (i,j,l) S ξ ijl, s.t. k jj 2k ij k ll + 2k il 1 + ξ ijl, ξ ijl 0, K 0. where C is a regularization parameter. After the optimal K is learned, the embedding X is then obtained via an SVD of K. Here, ξ ijl s are slack variables. Notice that this optimization problem is equivalent to using a hinge-loss to penalize the violation of similarity constraints: min K trace(k) + C s.t. K 0. (i,j,l) S max {0, 1 + k jj 2k ij k ll + 2k il },

14 7 Crowd Kernel Learning (CKL) Tamuz et al. (2011) proposed the Crowd Kernel Learning method to deal with training triplets (i, j, l) which are collected from annotators without domain expertise. They consider receiving the feedback, i is more similar to j than to l, from a random crowd member as an random event which occurs with certain probability. In the learning stage, the goal is to minimize the empirical log-loss log(1/p ijl ) where a higher probability p ijl means the triplet (i, j, l) is modelled better. They take a scale-invariant model and define the probability p ijl as p ijl = x i x l µ x i x j x i x l µ = k ii + k ll 2k il + µ (k ii + k jj 2k ij ) + (k ii + k ll 2k il ) + 2µ where µ is added to avoid numerical instability, k ij s are entries of the kernel matrix K. Their optimization problem can be written as: min K 1 S (i,j,l) S log p ijl, s.t. k ii = 1, i = 1, 2,..., N, K 0. Stochastic Triplet Embedding Based on the idea of CKL, van der Maaten and Weinberger (2012) developed the method of Stochastic Triplet Embedding. They proposed two ways to define the probability p ijk : 1. STE p ijl = exp( x i x j 2 2) exp( x i x j 2 2) + exp( x i x l 2 2) 2. t-ste p ijl = ( ( 1 + x i x j 2 2 α 1 + x i x j 2 2 α ( ) α ) α x i x k 2 2 α ) α+1 2 They claimed that, during the iterations for numerically solving the optimization problem, using STE and t-ste tends to penalize/reward triplets that are violated/satisfied more locally and the influence of a triplet becomes very small when the triplet constraint is strongly violated. Thus,

15 8 it is much better at revealing the underlying data structure and their experiments showed that STE and t-ste produce more clustered embeddings. See (van der Maaten and Weinberger, 2012) for details. From the methods reviewed above, we observe that a commonly employed strategy is to define a loss function l that measures how well the embedding models a triplet (i, j, k) and solve a minimization problem of the form min C l( x i x j 2, x i x k 2 ) + X 2 X R N D F, (2.2) (i,j,k) S where F is the Frobenius norm and C > 0 is a regularization parameter. Since the squared Euclidean distance x i x j 2 can be expressed as x i x j 2 = k ii 2k ij + k jj using a positive semidefinite Gram matrix K 0, minimization problem (2.2) can be equivalently rewritten as follows: min C l(d K (i, j), d K (i, k)) + tr(k), (2.3) K R N N (i,j,k) S s.t. K 0, where d K (i, j) = k ii 2k ij + k jj. Penalizing the trace of the Gram matrix can be seen as a convex surrogate for penalizing the rank of the Gram matrix (Agarwal et al. (2007); Fazel et al. (2001)). After the optimal K is learned, we recover the embedding X = [x 1,..., x N ] by the decomposition K = X X. Since optimization problems (2.2) and (2.3) are equivalent up to rotation, we will interchangingly say learning the embedding X and learning the metric K. However, optimization problem (2.3) is convex if the loss function l is convex, while this is not true for (2.2).

16 CHAPTER 3 JOINT LEARNING MULTIPLE PERCEPTUAL SIMILARITIES While previous studies deal with the situation where there is one underlying similarity measure, we would like to consider the case that objects are compared from a few different aspects, i.e., there exists multiple measures of similarity. The most related work would be from McFee and Lanckriet (2011), Parameswaran and Weinberger (2010) and Cui et al. (2013). However, the problem we will address has a more general setting and is different from work mentioned previous in two folds: 1) Existing work that deals with similarity triplets learns a single similarity metric. Yet, we consider the case where there exists multiple similarity measures and we aim to learn embeddings under those measure simultaneously. 2) algorithms given by Parameswaran and Weinberger (2010) and Cui et al. (2013) requires supervision and their target is to minimize classification error. But we consider an embedding task that is purely based on constraints in the form of similarity triplets. Finally, our approach is complementary to methods that seek to minimize user effort by actively collecting triplet comparisons (Jamieson and Nowak, 2011; Tamuz et al., 2011) and better interfaces for triplet collection (Wilber et al., 2014). In this chapter, we will first give a formal formulation for the problem of jointly learning multiple similarities and introduce our approach to solve this problem. We will discuss issues such as influence of parameters and computational complexity later. 3.1 Formulation Assume that T sets of triplets S 1,..., S T are obtained by asking labelers to focus on a specific aspect when making pair-wise comparisons. In addition, a set of triplets S 0 corresponding to global (unspecific) notion of similarity may be available. Our goal is to obtain an embedding for the global notion of similarity as well as embeddings corresponding to local views. To this end, we take a hybrid approach that combines (2.2) and (2.3) and use explicit parameterization (2.2) for the global ebmedding of each object and Gram-matrix-based parametrization (2.3) for modeling each view. 9

17 Let x 1,..., x N R D be vectors corresponding to the global embedding of the objects as above. Each local view is associated with a Gram matrix M t and the underlying metric is defined as a (squared) Mahalanobis distance d M t (i, j) := (x i x j ) M t (x i x j ). We formulate the learning problem as follows: min x 1,...,x N, M 1,...,M T l ( x i x j 2, x i x k 2) (i,j,k) S T l (d M t (x i, x j ), d M t (x i, x k )) t=1 (i,j,k) S t 10 T γ tr(m t ) + β X 2 F, (3.1) t=1 s.t. M t 0 (t = 1,..., T ). where l is a loss function that measure how well the embedding models a triplet. We use the hinge loss in the experiments in our work, but the proposed framework readily generalizes to other loss functions proposed in literature such as Tamuz et al. (2011); van der Maaten and Weinberger (2012). We employ regularization terms for both the local metric M t and the global embedding vectors x i in (3.1). Constraints on tr(m t ) are added in the optimization objective, as the trace norm is known to be effective in producing low-rank solutions (Agarwal et al., 2007; Fazel et al., 2001). The norm of x i is necessary because, without this term, we may reduce the trace of M t by scaling down M t while scaling up x i s simultaneously, which significantly weakens the effect of the trace term. Although objective function (3.1) is non-convex, if we fix the value of x i s and choose a convex loss function, e.g.hinge-loss, then it becomes a convex problem with respect to M t s and M t s can be learned independently since they appear in disjoint terms. To minimize the objective, we update M t s and x i s alternately. When M t s are fixed, x i s are updated directly via gradient descent (Algorithm 1). When x i s are fixed, M t s can be updated independently via projected gradient descent, i.e., by iteratively taking a sub-gradient step and projecting the resulting M t onto the positive semi-definite cone (Algorithm 2). The number of dimension D is left as a hyper parameter. The algorithm is summarized in Algorithm 3.

18 11 Algorithm 1: Update x i s Input: x i, i = 1, 2,..., N; M t, t = 1, 2,..., T ; β. Output: x i, i = 1, 2,..., N. initialization: choose a proper initial step size η 0 ; set M 0 := I D ; set η η 0 ; set iteration counter m = 1; while not converged do For n = 1,..., N, let Update g n =2βx n + T t=0 (i,j,k) S t: n (i,j,k) xn l(d M t (x i, x j ), d M t (x i, x k )) x n x n ηg n (n = 1,..., N); Update counter: m m + 1 ; Update step size: η η 0 / m ; end Algorithm 2: Update M t Input: x i, i = 1, 2,..., N; M t ; γ. Output: M t initialization: choose a proper initial step size η 0 ; set η η 0 ; set iteration counter m = 1; while not converged do Take a step in the direction of negative sub-gradient: G M t l(d M t (x i, x j ), d M t (x i, x k )) + γi D, (i,j,k) S t M t M t ηg; Find the eigenvalue decomposition M t = V ΛV ; Project M t to PSD cone: Update counter: m m + 1; Update step size: η η 0 / m; end M t V max(λ, 0)V ;

19 12 Algorithm 3: Multiple-metric Learning Input: N, the number of objects; D, dimension of embedding; S t, t = 0, 1, 2,..., T, triplet constraints. Regularization parameters β, γ. Output: Embedding {x i } N i=1, PSD matrices {M t} T t=1 Initialization: initialize x i s randomly, initialize M t as identity matricies; while not converged do Update x i s by using Algorithm 1 ; for t {1, 2,..., T } do Update M t s by using Algorithm 2 ; end end 3.2 Relation with Learning Multiple Independent Similarities Consider the problem of embedding N objects into D dimensional spaces from T sets of similarity triplets S 1, S 2,..., S T. By independent learning or independent embedding, we mean independently learning T embeddings X t, t = 1, 2,..., T. Each X t is learned from S t only. In contrast, our proposed model learns a global embedding X and T view specific metrics M t s and all S t s are used simultaneously. A simple parameter counting argument tells us that independent learning requires to fit O(N DT ) parameters. On the other hand, our joint learning model has O(ND + D 2 T ) parameters which is fewer when D < N. By comparing the family of embeddings that can be represented by each of these two models, it is not difficult to see that independently embedding into T spaces each of which has dimension D is a special case of jointly embedding in a T D dimensional space. Analytically, solving optimization problem (2.2) independently on T views can be expressed as min x 1,...,x N R T D ind T l (d M t (x i, x j ), d M t (x i, x k )) + β X 2 F, t=1 (i,j,k) S t 1, p = q {(t 1)D ind + 1, (t 1)D ind + 2,..., td ind }, where M t (p, q) = 0, otherwise. which is a special case of the joint learning problem (3.1). The same idea can be reached by approaching from the other direction. Consider a multiview embedding represented by the joint embedding parameterized by a X R D N and M t R D D, t = 1, 2,..., T. If we constrain the local metrics M t to be pair-wise orthogonal, i.e., M i, M j = 0 for all i j, then each M t is associated with a subspace W t R D. Two W t s associated with

20 different M t s are mutually orthogonal. Any x R D can be decomposed as x = T t=1 y t + z where x W t and z ( W t ). In this special case, finding a global embedding X reduces to seeking for embeddings in W t s independently. As a conclusion, jontly embedding in a D dimensional space subsumes indepdent learning in T spaces with dimensionsl D t s as a special case, if T t=1 D t D Influence of Parameters The objective function in (3.1) has two parameters β and γ in the regularization term. For all the loss functions mentioned in Sec 2.2, we can see that the value of these function only depends on the distance d M t (x i, x j ) and d M t (x i, x k ). For any scalar α > 0, by the definition of Mahalanobis distance (2.1), it is easy to see that d M t/α 2(αx i, αx j ) = d M t (x i, x j ) for all x i and x j. This implies that, if we substitute x i = αx i and M t = 1 α 2 M t for x i and M t in the objective function of (3.1), we can alter the value on regularization term while keeping the same loss. Meanwhile, due to arithmetic-mean-geometric-mean inequality, the value of regularization term is lower bounded by βα 2 X 2 F + T γ α tr(m t) 2 βγ X 2 2 F and the lower bound can be reached when α takes the value: t=1 α = ( γ T t=1 tr(m t) β X 2 F T tr(m t ) (3.2) On the other hand, if M t s and X are the optimal solution of (3.1), they must reach the lower bound above. (Otherwise, we can scale them by using the α that gives the lower bound and reduce the value of objective function further.) Thereby, although we have two hyper parameters β and γ, the effective regularization only depends on the value of βγ. Since Algorithm 3 updates x i s and M t s alternately, the lower bound in (3.2) may not be reached numerically. As a heuristic, we could add an optional step in the learning algorithm. That is, after one update of x i s and M t s, we could scale them simultaneously to reach the lower bound on regularization term. See Algorithm 4. ) 1 4. t=1

21 14 Algorithm 4: Multiple-metric Learning with Scaling Input: N, the number of objects; D, dimension of embedding; S t, t = 0, 1, 2,..., T, triplet constraints. Regularization parameters β, γ. Output: Embedding {x i } N i=1, PSD matrices {M t} T t=1 Initialization: initialize x i s randomly, initialize M t as identity matricies; while not converged do Update x i s by using Algorithm 1 ; for t {1, 2,..., T } do Update M t s by using Algorithm 2 ; end Scale x i s and M t s: end α ( γ T X αx, t=1 tr(m t) β X 2 F ) 1 4, M t M t /α 2 (t = 1,..., T ). 3.4 Alternative Formulations and Computational Complexity The joint learning model (3.1) parametrizes similarity metrics with a single global embedding x s and multiple local metrics M t s. Alternatively, we could model the global embedding implicitly by using a positive semi-definite kernel matrix K which can be considered as the Gram matrix and K = Φ Φ for some Φ R D N could be regarded as objects representations in an D dimensional space. Then, we can model the embedding of t-th view with a linear transformation L t R D D and L t Φ would be the embedding for t-th view. If we choose regularization for L t s appropriately, by the generalized representer theorem (Schölkopf et al., 2001), L t can be expressed as L t = W t Φ for some W t R D N. Now, the Gram matrix on the t-th view becomes K t = (( W t Φ ) Φ ) ( W t Φ ) Φ = KW t W t K, and distance between object i and j can be expressed as d t (i, j) = K t (i, i) 2K t (i, j) + K t (j, j) = K(i, :)W t W t K(:, i) The problem of learning multiple similarity metrics can be formulated as

22 15 min W 1,...,W t,k l ( K(i, :) K(j, :) 2, K(i, :) K(j, :) 2) (i,j,k) S 0 + s.t. K 0. T l (d t (i, j), d t (i, k)) + t=1 (i,j,k) S t T γ W t 2 F + β K 2 F, t=1 In this formulation, the objective function is neither convex in K nor in W t s. Since W t s only appear as inner products, we can substitute M t = W t W t but M t will have the size N N. In either way, solving the optimization problem would involve eigen decomposition of the N-by-N matrix K, which could be computationally expensive. On the contrary, formulation (3.1) is convex in M t s and involves eigen decomposition of D-by-D matrices M t s, which is less expensive when D < N. However, compared with independent learning, solving joint learning problem (3.1) with Algorithm 4 is empirically much slower, speculatively because it involves alternate updates in M t s and x i s.

23 CHAPTER 4 EXPERIMENTS In this section, we test our algorithm on both synthetic data and real datasets. One of the real dataset consists of images of airplanes with different poses where similarity between two poses is defined rigorously. Another real dataset uses triplets sampled from a few kernel matrices which are computed from attributes of facial images. The other dataset contains images of 200 species of birds where similarity triplets among the birds are crowd-sourced. We first give a brief explanation on the design of the experiments and introduce the datasets. Experimental results follow. 4.1 Experimental Setup On each dataset, we inspect the quality of embeddings learned from training sets with increasing sizes. We draw about equal numbers of training triplets from all views and check the quality of embedding. In addition, we inspect how the similarity knowledge on existing views could be transferred to a new view where the number of similarity comparisons is small. We did this by conduction an experiment in which we draw a small set of training triplets from one view but use large numbers of training triplets from the rest views. The quality of embeddings is measured from the following aspects: 1 Triplet generalization error. We split triplets randomly into a training set and a test set. Triplet generalization error is defined as the percentage of test triplets whose triplet relations are not correctly modelled by the learned embedding. 2 Leave-one-out classification error. We held out all information about objects class label during the training stage. Only triplet constraints are used for learning the embedding. Then, at the test stage, we choose one embedded object as target and predict its label by revealing the labels of its neighbours. We do this prediction for every objects in turn. The leave-oneout classification error is the percentage of objects whose labels are not correctly predicted. Throughout the experiments, we use a 3-nearest-neighbour classifier to test classification error. 16

24 17 van der Maaten and Weinberger (2012) showed that reducing triplet generalization error often leads to small nearest-neighbour classification error, although small triplet generalization doesn t necessarily guarantee low classification error. We also inspect embeddings obtained by simultaneously learning multiple metrics from this aspect. Throughout the experiments, we use hinge loss as the loss function. Our joint learning method is compared with two baselines, independent learning and pooled single view embedding. In independent learning, we conduct triplet embedding on every view as if they are independent tasks. In pooled single view embedding, triplets collected from all views are merged together to learn a single embedding, as if they are from the same similarity metric. When learning a single embedding, we adopt the problem (2.2) as the objective and solve it by using the package provided by the authors of van der Maaten and Weinberger (2012). To tune the regularization parameters, we further split the training sets for a 5-fold cross-validation and swept over {10 5, 10 4,..., 10 5 } for all parameters. Here are the details of the datasets: 4.2 Datasets Synthetic Data Two synthetic datasets are generated. One consists of 200 points uniformly sampled from a 20 dimensional unit hypercube, while the other dataset is generated in a 10 dimensional space and have 200 objects from a mixture of four Gaussians each of which has variance 1 and is randomly centered in a hypercube with side length 10. Six views are generated on each dataset. Every view is produced by projecting data points onto a random five dimensional subspace. Training and test triplets (i, j, k) are randomly sampled from all possible triplets on every views. Poses of Airplanes This dataset is constructed from 200 images of airplanes from the PASCAL VOC dataset (Everingham et al., 2010) which are annotated with 16 landmarks such as nose tip, wing tips, etc (Bourdev et al., 2010). We use these landmarks to construct a pose-based similarity. Given two planes and the positions of landmarks in these images, pose similarity is defined as the residual error of alignment between the two sets of landmarks under scaling and translation. We generated 5 views each

25 18 view 1 view 2 view 3 view 4 view 5 Figure 4.1: View specific similarities between poses of planes are obtained by considering subsets of landmarks shown by different colored rectangles and measuring their similarity in configuration up to a scaling and translation. For e.g.view 1 consists of all landmarks. of which is associated with a subset of these landmarks as seen in Fig. 4.1 which shows three annotated images from the set. The planes are highly diverse ranging from passenger planes to fighter jets, varying in size and form which results in a slightly different similarity between instances for each view. However, there is a strong correlation between the views because the underlying set of landmarks are shared. Additionally, we categorize the planes into five classes: left-facing, right-facing, pointingup, pointing-down and facing out or facing away to evaluate classification error. This produces classes with unbalanced number of members. About 80% of the images belongs to one of these three classes: left-facing, right-facing and facing out or facing away. Public Figures Face Data Public Figures Face Database is created by Kumar et al. (2009). It consists of 58,797 images of 200 people. Every image is characterized by 75 attributes which are real valued and describe the appearance of the person in the image. We selected 39 of the attributes and categorized them into 5 groups according to the aspects they describe: hair, age, accessory, shape and ethnicity. We randomly selected ten people and drew 20 images for each of them to create a dataset with 200

26 19 view 1 view 2 view 3 view 4 view 5 view 6 Figure 4.2: Perceptual similarities between bird species were collected by showing users either the full image (view 1), or crops around various parts (views 2, 3,..., 6). Images were taken from the CUB dataset Welinder et al. (2010) containing 200 species of birds. images. Similarity between instances for a given group is equal to the dot product between their attribute vectors where the attributes are restricted to those in the group. We describe the details of these attributes in the Appendix. Each group is considered as a local view and identities of the people in the images are considered as class labels. CUB-200 Birds Data We use the birds dataset introduced by Welinder et al. (2010) which contains images of birds categorized into 200 species. We consider the problem of embedding these species based on similarity feedback from human users. Similarity triplets among species were collected in a crowd-sourced manner: every time, a user is asked to judge the similarity between an image of a bird from the target specie z i and nine images of birds of different species {z k } k K using the interface of Wilber et al. (2014), where K is the set of all 200 species. For each display, the user partition these nine images into two sets, K sim and K dissim, with K sim containing birds considered similar to the target and K dissim having the ones considered dissimilar. Such a partition is broadcast to an equivalent set of triplet constraints on associated species, {(i, j, l) j K sim, l K dissim }. Therefore, for each user response, K sim K dissim triplet constraints are acquired. In the setting of multiple metric embedding, different views of birds are obtained cropping regions around various parts of the bird as shown in Fig. 4.2, and then using the same procedure as before to collect triplet comparisons. In this dataset, there are about 100,000 triplets cast on similarities from comparisons made on the whole birds, while there are other 5 views each of which

27 20 is cast on localized region of the birds (e.g.beak, breast, wing). Number of triplets obtained from these views range from about 4,000 to 7,000. For testing classification error, we use a taxonomy of the bird species provided by Welinder et al. (2010). To make the number of objects in all classes balanced, we manually grouped some of the classes to get 6 super classes in total. We note that these global and local similarity triplets were used by Wah et al. (2014a) and Wah et al. (2014b) as a way to incorporate human feedback during recognition. However, our focus is to learn better embeddings of the data itself by combining information across different views. 4.3 Experimental Results Synthetic data We first conduct experiments on synthetic dataset. Synthetic data with cluster is embedded in 5 and 10 dimensional spaces. The uniformly distributed synthetic data is embedded in 10 and 20 dimensional spaces. Triplet generalization errors and leave-one-out 3-nearest-neighbour classification errors are plotted in Fig. 4.3 and Fig The small plots on the right illustrate the view-wise test errors while the big plot on the left shows the average across all views. The dataset with cluster was sampled from a 10 dimensional space and each local view lies in a 5 dimensional space. Our algorithm achieves significant improvement on small trianing samples and it continues to perform better than the baselines except that independent learning obtains a lower triplet generalization error than joint learning on training sets with nearly 40,000 triplets. On the dataset that is uniformly sampled from a 20 dimensional space, joint embedding reduces triplet generalization error faster than independent embedding and pooled embedding as the number of training triplets increases until around 20,000 triplets. At this point, joint embedding in a 10 dimensional space is outperformed by its independent learning counterpart. However, joint embedding in a 20 dimensional space has a lower error rate than the baselines in all cases. Poses of Airplanes The airplanes are embedded into a 10 dimensional space based on a training setting which consists of 3,000 triplets from every view. As an illustration, we project the learned global view of the objects onto their first two principle dimensions via SVD and illustrated the embedding in Fig The visualization shows that objects roughly lies on a circle. Meanwhile, the same figure shows a clear clustered structure and objects belonging to each of the three dominant classes are clustered

28 21 triplet generalization error LOO 3-NN classification error joint-5d ind-5d pooled-5d joint-10d ind-10d pooled-10d number of training triplets across all views joint-5d ind-5d pooled-5d joint-10d ind-10d pooled-10d number of training triplets across all views Figure 4.3: Experimental results on synthetic data with clusters. Top: Triplet generalization error. Bottom: Leave-one-out 3-nearest-neighbor classification error.

29 triplet generalization error joint-10d ind-10d pooled-10d joint-20d ind-20d pooled-20d number of training triplets across all views 22 Figure 4.4: Triplet generalization errors on uniformly distributed synthetic data. together, except that members in the class facing out or facing away are separated into two clusters. Fig. 4.6 shows the triplet generalization errors and classification errors of the learned embeddings. Errors on local views are shown in small plots while the large plots show the average across views. Results of independent learning is included in the same figure. It shows that our algorithm almost always performs better in both predicting triplets and classes when using less than 10,000 training triplets. However, this advantage disappears when training set becomes larger. Public Figures Face Dataset The 200 images are embedded into a 5, 10 and 20 dimensional spaces. We draw triplets randomly from the ground truth similarity measure to form training and test sets. See Fig. 4.7 for a visualization of data embedded under the metric of the first view. Triplet generalization errors and classification errors are shown in Fig In terms of triplet generalization errors of 5 and 10 dimensional embeddings, we find that joint learning performs better on small training triplets while its error rate surpasses independent learning as training set get larger. Joint learning and independent learning have comparable triplet generalization error when the embedding is conducted in a 20 dimensional space. Pooled embedding has the highest triplet generalization errors among the three methods. In terms of leave-one-out classification error, joint learning has a lower error rate than independent learning and pooled embedding while pooled embeddings performs the best on training sets with more than about 10,000 triplets.

30 Joint Embedding (Global) Pointing up Pointing down Left facing Right facing Facing out/away Figure 4.5: The global view of embeddings of poses of planes. CUB-200 Birds Data Student Version of MATLAB The bird data set is more challenging in a sense that it emerges in a more real situation where triplet relations among all bird species are not all available because it is crowdsourced. We learn the embedding both in a 10 dimensional space and a 60 dimensional space. An illustration of embedding learned in 60 dimensional space can be found in Fig In the birds dataset, number of available triplets from the first view is more than the available triplets from other views. During the experiment, we first sample equal numbers of triplets from each view to form up the training set. We keep adding triplets to the training set until it has 3,000 triplets from each view. Then, we add triplets only from the first view and test the embedding in all views. See Fig for triplet generalization error and leave-one-out 3-nearest-neighbour classification errors. The solid vertical line shows the moment that we start to add training triplets only to the first view. We found that, different from the results of previous data sets, our algorithm doesn t work better on this data set in terms of triplet generalization error. However, the triplet generalization error of learning multiple views jointly in a 60 dimensional space is comparable to the ones get from embedding every views independently. In terms of leave-one-out classification error, our method obtains lower error on all views except for the first view.

31 24 triplet generalization error LOO 3-NN classification error joint-3d ind-3d pooled-3d joint-10d ind-10d pooled-10d number of training triplets across all views joint-3d ind-3d pooled-3d joint-10d ind-10d pooled-10d number of training triplets across all views Figure 4.6: Results on poses of planes dataset. Embedding is learned in 3 and 10 dimensional spaces. Top: triplet generalization error. Bottom: leave-one-out 3-nearest-neighbor classification error. The small figures shows errors on individual views and the large figures show the average.

32 25 Independent Embedding Joint Embedding Figure 4.7: Illustration of public figures Student face Version of MATLAB data embedded under the metric of the Student Version first of MATLAB view: hair. Points with the same color belong to the same person. Embeddings are learned in a 10 dimensional space using 10 5 triplets and then further embedded in a 2 dimensional plane by using tsne (van der Maaten and Hinton, 2008) for the purpose of visualization. Left: embeddings learned independently. Right: embeddings learned jointly. 4.4 Learning a New View (Zero-Shot Learning) On the CUB-200 birds dataset, we simulated a setting that is similar to zero-shot learning (Palatucci et al., 2009). We draw a training set contains 100 triplets from the 2nd local view and 3,000 triplets from all other 5 views. And we investigate how does joint learning help create an embedding on the 2nd local view. The learned embedding is shown in Fig It is evident that, embedding learned purely out of 100 triplets from that view has objects from different classes completely mixed together, while the local view from the embedding learned jointly group objects from some classes better. For example, members in the class Passeriformes(Emberizidae) is more separated from members in Passeriformes(Icteridae). Meanwhile, the last plot in the same figure shows the change in triplet generalization error as we add more training triplets from the 2nd local view. We can see that joint learning has lower triplet generalization error when the number of training triplets is small but as is expected it is matched and then outperformed by independent learning soon as the training set gets larger. 4.5 Discussion We have shown through experiments that our algorithm achieves lower triplet generalization errors on synthetic data, airplane poses data and public figure faces data when the training set is relatively small. Since in some real applications, similarity triplets can be expensive to obtain, jointly learning similarity metrics is preferable as it can recover the underlying structure using relatively small

33 26 triplet generalization error LOO 3-NN classification error joint-5d ind-5d pooled-5d joint-10d ind-10d pooled-10d joint-20d ind-20d pooled-20d number of training triplets across all views joint-5d ind-5d pooled-5d joint-10d ind-10d pooled-10d joint-20d ind-20d pooled-20d number of training triplets across all views Figure 4.8: Results on public figures faces dataset. Embeddings are learned in 5, 10 and 20 dimensional spaces. Top: triplet generalization error. Bottom: leave-one-out 3-nearest-neighbor classification error. The small figures shows errors on individual views and the large figures show the average.

34 27 Independent Embedding (View #1) Joint Embedding (View #1) Charadriiformes Passeriformes (Emberizidae) Passeriformes (Icteridae) Passeriformes (Other) Passeriformes (Parulidae) birds (other) Student Version of MATLAB Charadriiformes Passeriformes (Emberizidae) Passeriformes (Icteridae) Passeriformes (Other) Passeriformes (Parulidae) birds (other) Figure 4.9: Illustration of CUB-200 birds data. The figure shows data s embedding under the metric of the first view. Embeddings are learned in a 60 dimensional space using 18,000 triplets and then further embedded in a 2 dimensional plane by using tsne (van der Maaten and Hinton, 2008) for the purpose of visualization. Top: embeddings learned independently. Bottom: embeddings learned jointly. Student Version of MATLAB

35 Figure 4.10: Results on CUB-200 birds dataset. Top: triplet generalization error. Bottom: leaveone-out 3-nearest-neighbor classification error. The small figures shows errors on individual views and the large figures show the average. 28

36 Independent Embedding (View #2) Joint Embedding (View #2) triplet generalization error joint-10d ind-10d Charadriiformes Passeriformes (Emberizidae) Passeriformes (Icteridae) Passeriformes (Other) Passeriformes (Parulidae) birds (other) Charadriiformes Passeriformes (Emberizidae) Passeriformes (Icteridae) Passeriformes (Other) Passeriformes (Parulidae) birds (other) number of training triplets from the local view Student Version of MATLAB Student Version of MATLAB Figure 4.11: Learning a new view on CUB-200 birds dataset. Training data contains 100 triplets from the second local view and 3,000 triplets from other 5 views. Embeddings are learned in a 10 dimensional space and then further embedded in a 2 dimensional plane by using tsne (van der Maaten and Hinton, 2008) for the purpose of visualization. Left: triplet generalization error on the second local view. Middle: embedding learned independently. Right: embedding learned jointly. number of training data. Experiments showed that jointly learning multiple metrics performs better in terms of 3-nearestneighbour classification error almost in all cases, which implies that it has a potential to recover the category level structure of the data. From the experimental results presented above, the gain in performance when using joint learning varies among dataset. On synthetic data with clusters and poses of planes dataset, the benefit of using joint learning is evident and it is especially large on small number of training triplets. Its performance gain on uniformly distributed synthetic data and public figure faces dataset is relatively small and very limited on CUB-200 birds dataset. Influence may come from multiple aspects. We discuss two different yet related factors: (1) choice of dimensions and (2) correlation underlying different views Influence of Dimension From the results on public figure faces and CUB-200 (Fig. 4.8 and 4.10), we see that dimensions of embedding spaces affect the triplet generalization errors of embeddings learned jointly especially when the training set is large. For example, on public figure dataset (Fig. 4.8), joint learning reduces the error faster than independent learning up to around 10,000 triplets. When there are more than 10,000 triplets, errors of joint embedding reduce monotonically as number of dimensions increases. Meanwhile, when we learn a 10 dimensional joint embedding for the uniform synthetic dataset sampled from a 20 dimensional space, we see that joint learning obtains lower triplet generalization error on small up to around 20,000 triplets but was catched up by independent learning on larger training sets (Fig. 4.4). This can be understood as a bias induced by the joint learning.

37 30 As analysed in Sec. 3.2, joint learning has a lower complexity when embedding dimension is chosen the same as independent learning. However, experimental results have shown that in many cases joint learning has a large performance gain even when using the same embedding dimension as independent learning. For example, when public figures data was embedded in a 20 dimensional space, joint embedding continues to perform better than independent learning. Meanwhile, from the results of synthetic data (clustered) and poses of planes data (Fig. 4.3 and 4.6), joint learning is able to achieve lower errors. We interpret this as that when the model has sufficient amount of complexity to capture the inherent structure of data, joint embedding makes use of triplets from all views simultaneously while independent learning only employ triplets from individual views. Next, we consider one particular aspect of the underlying structure, which is the correlation between different views Correlation Between Views Another question of particular interest is how the underlying similarities affect the performance of proposed joint learning algorithm. Speculatively, joint learning would be affected by the correlation between similarity metrics of different views. Here, we examine such correlation from two aspects: consistency of triplet constraints and correlation of distances. Consistency of Triplet Constraints We assume that there is a ground truth embedding for each similarity metric. Since there are 3! ( N 3 ) triplets on N objects and any similarity metric would satisfy exactly half of them 1, a similarity metric induces m(n) = 3! ( N 3 ) /2 triplet constraints. Triplet constraint consistency between two views is defined as the ratio of number of triplet constraints that are shared between both of the views to m(n). In our joint learning algorithm, the global embedding X is trained to conform to triplet constraints over all views. Therefore, if there exists a pair of conflicting triplet constraints, it is possible that joint learning might not be able to model both triplets well. We examine consistency of triplet constraints on a dataset in the following way. On datasets where we have the ground truth similarity metrics in the form of true embedding (e.g., synthetic data) or true similarity kernel (e.g., public figure data), we draw 100,000 triplets at random and check if each triplet is satisfied or violated on each view. Triplet constraint consistency between two similarity metrics is estimated as the percentage of triplets which received the 1 It is because if the triplet constraint (i, j, k) is satisfied under a metric, then the constraint (i, k, j) must be violated by the same metric.

38 31 same judgement under two metrics, i.e., satisfied or violated by both metrics. On poses of planes data and CUB-200 dataset, since we do not have access to all triplet constraints, we estimate the triplet consistency from embeddings learned independently with the largest number of triplets. The triplet constraint consistency for datasets used in this work is list in the Appendix C. The average of triplet constraint consistency for each dataset is summarized in Table 4.1. Correlation of Distances A metric over N objects can be fully captured by a distance matrix D whose (i, j)-th entry is the distance between object i and j. D can be computed from the kernel matrix K by D(i, j) = K(i, i) K(i, j) K(j, i) + K(j, j). Suppose D 1 and D 2 are two distance matrices. Since a (pseudo) metric is symmetric and so is the distance matrix, to compare two distance matrices, we may restrict to the lower-triangle of them and define their correlation as Corr(D 1, D 2 ) = N 1 i<j N (D 1(i, j) µ 1 ) (D 2 (i, j) µ 2 ) 1 i<j N (D 1(i, j) µ 1 ) 2 N 1 i<j N (D 2(i, j) µ 2 ) 2 where µ t = 2 N(N 1) 1 i<j N D t (i, j), t = 1, 2. On synthetic data and public figure faces dataset, we are able to either compute the distance matrix or have access to the true kernel matrix. Therefore we can compute the distance correlation between every pair of local metrics for each of them. For poses of planes and CUB-200 data, we estimate the correlation of distances from embeddings learned independently with the largest number of triplets. Distance correlations between every pair of views on each dataset are shown in Appendix C. The average of distances correlation for each dataset is summarized in Table 4.1. Relation between triplet consistency and correlation of distances. Since we aim to recover the underlying similarity metric from triplet constraints, a natural question would be: how far could two metrics which satisfy the same set of triplet constraints be possibly away from each other. Since triplet consistency and correlation of distances are related to the two sides of this problem, we inspect the relation between these them empirically, to obtain some clues to the problem. A rigorous analysis is left as a future work. Here, an approximate relation between the two quantities are shown first. Its intuition is given

39 sin(0.5 π (2 TripletConsist 1)) correlation syn (cluster) syn (uniform) pubfig bird poses of planes Figure 4.12: Relation between triplet consistency and correlation of distances. subsequently. Let δ(d 1, D 2 ) denote the triplet consistency between two distance metrics. We illustrated the relation between δ(d p, D q ) and Corr(D p, D q ) in Fig It shows that the two quantities roughly follow the relation ( π ) sin 2 (2δ(D p, D q ) 1) = Corr(D p, D q ). (4.1) The intuition behind this relation unfolds as follows. Given two distance matrices D 1 and D 2 over N objects, consider the following generative model: 1. Let (I, J) be a random variable that takes value from the space of all un-ordered pairs among these N objects with equal change. 2. Define Y 1 := D 1 (I, J), Y 2 := D 2 (I, J). By this, Corr(D 1, D 2 ) defined previously is exactly the correlation of Y 1 and Y 2. On the other {( ) } hand, given a sample y (t) 1, y (t) 2 : t = 1,..., m draw from the generative model defined above,

40 33 the Kendall s τ (Kruskal, 1958) is defined by τ = 2 m(m 1) 1 t<t m [( ) ( )] sign y (t) 1 y (t ) 1 y (t) 2 y (t ) 2 and, as stated by Liu et al. (2012), its population version is given by ( ( ) ( )) τ := Corr sign Y 1 Ỹ1, sign Y 2 Ỹ2 where Ỹt is i.i.d. to Y t, t = 1, 2. And, ( ) ( ) τ = P (Y 1 Ỹ1)(Y 2 Ỹ2) > 0 P (Y 1 Ỹ1)(Y 2 Ỹ2) < 0 ( Loosely speaking, we have τ = 2P (Y 1 Ỹ1)(Y 2 Ỹ2) > 0 ( ) which is approximately 2P (Y 1 Ỹ1)(Y 2 Ỹ2) > 0 ) ( ) +P (Y 1 Ỹ1)(Y 2 Ỹ2) = 0 1, 1 when the number of objects N is large. As we define Y 1 and Y 2 to be the distances of a random pair of objects under metrics D 1 and D 2, ( ) P (Y 1 Ỹ1)(Y 2 Ỹ2) > 0 can be regarded as the consistency of quadruple relations induced from these two metrics. Each quadruple relation is written as (A, B, C, D) which means A is more similar to B than C is to D. And it subsumes triplet constraints as special cases. If Y 1 and Y 2 are jointly Gaussian or belong to a nonparanormal family, it is known that sin( π 2 τ) = ρ Y 1,Y 2 (4.2) where ρ Y1,Y 2 is the correlation between Y 1 and Y 2. See (Kruskal, 1958; Liu et al., 2012). This gives the intuition of using (4.1) to approximate the relation between triplet consistency and correlation of pair-wise distances. Note that, here, we use triplet consistency as an estimation of quadruple consistency. Besides, Y 1 and Y 2 obtained from the generative model do not necessarily resemble Gaussian. But, empirically, it works on the dataset used in this work quite well, as shown in Fig Relating performance gain with view correlation. The performance gain is measured by the difference between the area under the triplet generalization errors normalized by that of the independent learning, in the case of 10 dimensional embedding. Relating the performance gain with correlation among similarities (Table 4.1, Fig. 4.13), we see that on synthetic data (clustered) where joint learning obtained a significant performance gain, triplet consistency and distance correlations

41 dataset average triplet consistency average distance correlation performance gain (%) synthetic (clustered) synthetic (uniform) poses of planes public figure CUB Table 4.1: Measure of similarity correlation and performance gain of using joint learning. Entries marked with are values estimated from independent embeddings syn (clustered) syn (clustered) 40 performance gain (%) syn (uniform) pubfig poses of planes syn (uniform) pubfig poses of planes 0 CUB-200 CUB triplet consistency correlation of distances Figure 4.13: Performance gain and correlation of views. are high. On the other hand, synthetic data (uniform), poses of planes and public figure faces dataset have relatively mild view correlations. Performance gains on these dataset are not very significant. In contrast, triplet consistency of CUB-200 is close to random (0.5). That is probably the reason that performance gain on this dataset is very limited.

42 CHAPTER 5 RELATED WORK Previously, we have covered some literature that is closely related to our formulation of jointly learning multiple metrics from triplet constraints. In this chapter, we would like to mention some other work that could be related to the problem of learning multiple similarities from pairwise comparisons and take some glimpses at this problem from different angles. 5.1 Multitask Metric Learning with Low-Rank Tensors Tensors are widely used in the context of multi-task learning (Evgeniou and Pontil, 2007; Romera- Paredes et al., 2013). Learning a low-rank tensor is a problem that has been studied and used in many applications (Tomioka et al., 2011; Signoretto et al., 2014; Liu et al., 2013; Blankertz et al., 2008). Since T similarity metrics over a given set of N obejets can be represented as kernel matrices K t, t = 1,..., T, we may consider them as slices of a N N T tensor K where the (i, j, t) entry of K reflects the similarity between objects i and j under the t-th view. Meanwhile, under the assumption that there exists correlation among different kernel matrices, we may seek for a K with low rank. See (Kolda and Bader, 2009) for the definition of tensor rank. Let (i,j,t) R N N T be a tensor with entries (i,j,t) (i, i, t) = (i,j,t) (j, j, t) = 1, (i,j,t) (i, j, t) = (i,j,t) (j, i, t) = 1 and 0 everywhere else. Then its inner product with K is euqal to the (squared) distance under t-th similarity metric, i.e., (i,j,t), K = K i,i,t K i,j,t K j,i,t + K j,j,t By this, each triplet constraint can be expressed as a linear constraint on K as (i, j, k) S t = (i,j,t), K < (i,k,t), K. 35

43 36 The problem of learning multiple similarity metrics can be formulated with a tensor as: min K T l ( (i,j,t), K, (i,k,t), K ) + Ω(K), (5.1) t=1 (i,j,k) S t s.t. K(:, :, t) 0 (t = 1,..., T ). where Ω(K) is a regularization term which induces low-rank tensor. See (Tomioka et al., 2010) for example choices of such regularizations. Comparing (5.1) with (3.1), we can see that (3.1) can also be cast into a tensor learning problem but the learning space is restricted to tensors with a special form of factorization K = M 1 X 2 X. Here, M is a D D T tensor whose t-th slices is required to be a positive semi-definite matrix and corresponds to the M t representing the metric under the t-th view while X R D N represents the global embedding in (3.1). 5.2 Relation between Triplet Embedding and Metric Learning In the literature, embedding with pairwise comparisons is sometimes called ordinal embedding (Alon et al., 2008; von Luxburg et al., 2014; Terada and Luxburg, 2014) or partial order embedding (McFee and Lanckriet, 2009). von Luxburg et al. (2014) consider the problem of recovering the embeddings of n objects from pairwise comparisons. They proved that, asymptitocally, if we have the knowledge of the ordinal relationships for all quadruples (i, j, k, l), as n, the set of embedded points always converges to the set of original points, up to similarity transformations such as rotations, translations, rescalings and reflections. Developed on that, a further work of Terada and Luxburg (2014) investigates the consistency of local ordinal embedding (LOE) which only uses triplet constraints that reflect k-nearest-neighbour information. More specifically, in the setting of LOE, a triplet constraints (i, j, l) implies that j is a k-nearest-neighbour of i but l is not. The authors proved that, under certain conditions, it is possible to reconstruct the point set x 1,..., x n asymptotically, if we just know the k-nearest neighbours of each point. They also raised a Soft Ordinal Embedding algorithm, not only to recover the ordinal constraints, but the density structure underlying data set. Jamieson and Nowak (2011) derived a lower bound for minimum number of queries of triplet relations (i, j, k) that is needed to determine the embedding. They proved that at least Ω(dn log n) such comparisons is needed to determine the embedding of n objects into a d dimensional space and such lower bound cannot be achieved by randomly choosing pairwise comparisons. The work

44 37 of McFee and Lanckriet (2009) suggested that it might be hard to find the minimal dimension needed to produce an embedding that satisfies all the constraints. They showed that it is NP- Complete to decide if a given set of oridnal constraints C can be satisfied in R 1. Alon et al. (2008) studied the problem of minimum-relaxation ordinal embedding. Roughly speaking, given a distance function D(i, j), an ordinal embedding with relaxation α is an embedding that satisfies D(i, j) > αd(p, q) = d(x i, x j ) > d(x p, x q ) where x i s are the learned embeddings and d(, ) is the metric function of the embedding space. They also established that the problem of ordinal embedding has many qualitative differences from metric embedding which aims to minimize distortion.

45 CHAPTER 6 CONCLUSION AND FUTURE WORK In this work, we introduced our formulation on the problem of jointly learning multiple similarity metrics which is a natural extension of the conventional independent learning model. The proposed model consists of a global view, which represents each object as a fixed dimensional vector, and local views, which specifies each view-specific Mahalanobis metric as a positive semi-definite matrix. Its performance is studied empirically on both synthetic dataset and dataset from the real world. Experimental results have demonstrated that, on dataset with a higher correlation between similarity metrics, joint learning has a better performance than independent learning in terms of triplet generalization error, especially for the cases where the number of training triplets is small. This suggests that joint learning would be preferable for applications where triplet relations can be expensive to obtain. Although the results are presented for the hinge loss (Agarwal et al., 2007), the proposed algorithm easily generalize to other loss functions, e.g., CKL loss (Tamuz et al., 2011), t-ste loss (van der Maaten and Weinberger, 2012). Meanwhile, joint learning tends to have a lower leave-one-out classification error in most situations which implies it could support supervised learning tasks such as learning a classifier. As a future work, we aim to study how to use embeddings learned from similarity comparisons for classification tasks where data is partially labelled. Some preliminary analysis shows that the underlying correlation among similarity metrics would notably affect the performance of joint learning. Based on our empirical study, to achieve performance comparable to independent learning, joint learning would require a higher dimensional embedding space on datasets where views are less correlated. It would be interesting to find a way of estimating the amount of view correlation and further provide guidance for choosing the dimension of embedding space. Moreover, as mentioned in Section 5.1, multiple metrics learning from triplet constraints could be formulated as a problem of learning a low-rank tensor under linear constraints. We may explore this approach and use low-rank tensors as the tool to model correlated similarity metrics. Resort to existing studies on tensors, we might get some insight on learning multiple metric learning under triplet constraints. 38

46 39 On the other hand, embeddings learned from joint learning can be used in higher level machine learning tasks, such as fast retrieval and recognition tasks. We may integrate our joint learning framework with those tasks and study the performance of such systems in real application, such as music recommendation (McFee et al., 2012), word embedding in language (Mikolov et al., 2013) and interactive fine-grained recognition (Wah et al., 2014a). Since triplet relations might be expensive to obtain in many applications, when building such a system, it would also be interesting to develop an active triplet sampling algorithm, as an extension to (Jamieson and Nowak, 2011), so that the queried triplets are the most informative ones.

47 APPENDIX A VIEWS OF POSES OF AIRPLANES DATASET Each of the 200 airplanes were annotated with 16 landmarks namely, 01. Top Rudder 05. L WingTip 09. Nose Bottom 13. Left Engine Back 02. Bot Rudder 06. R WingTip 10. Left Wing Base 14. Right Engine Front 03. L Stabilizer 07. NoseTip 11. Right Wing Base 15. Right Engine Back 04. R Stabilizer 08. Nose Top 12. Left Engine Front 16. Bot Rudder Front This is also illustrated in the Figure A.1. The five different views are defined by considering different subsets of landmarks as follows: 1. all {1, 2,..., 16} 2. back {1, 2, 3, 4, 16} 3. nose {7, 8, 9} 4. back+wings {1, 2,..., 6, 10, 11,..., 16} 5. nose+wings {5, 6,..., 15} For triplet (A, B, C) we compute similarity s i (A, B) and s i (A, C) by aligning the subset i of landmarks of B and C to A under a translation and scaling that minimizes the sum of squared error after alignment. The similarity is inversely proportional to the residual error. This is also known as procrustes analysis commonly used for matching shapes. 40

48 Figure A.1: Landmarks illustrated on the several planes 41

Learning Models of Similarity: Metric and Kernel Learning. Eric Heim, University of Pittsburgh

Learning Models of Similarity: Metric and Kernel Learning. Eric Heim, University of Pittsburgh Learning Models of Similarity: Metric and Kernel Learning Eric Heim, University of Pittsburgh Standard Machine Learning Pipeline Manually-Tuned Features Machine Learning Model Desired Output for Task Features

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013 Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

A Unified Framework to Integrate Supervision and Metric Learning into Clustering A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December

More information

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Kernels and Clustering

Kernels and Clustering Kernels and Clustering Robert Platt Northeastern University All slides in this file are adapted from CS188 UC Berkeley Case-Based Learning Non-Separable Data Case-Based Reasoning Classification from similarity

More information

Classification: Feature Vectors

Classification: Feature Vectors Classification: Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free YOUR_NAME MISSPELLED FROM_FRIEND... : : : : 2 0 2 0 PIXEL 7,12

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

CSE 573: Artificial Intelligence Autumn 2010

CSE 573: Artificial Intelligence Autumn 2010 CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Supervised Learning: K-Nearest Neighbors and Decision Trees

Supervised Learning: K-Nearest Neighbors and Decision Trees Supervised Learning: K-Nearest Neighbors and Decision Trees Piyush Rai CS5350/6350: Machine Learning August 25, 2011 (CS5350/6350) K-NN and DT August 25, 2011 1 / 20 Supervised Learning Given training

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CX 4242 DVA March 6, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Analyze! Limited memory size! Data may not be fitted to the memory of your machine! Slow computation!

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

570 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2011

570 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2011 570 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2011 Contextual Object Localization With Multiple Kernel Nearest Neighbor Brian McFee, Student Member, IEEE, Carolina Galleguillos, Student

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Kernels and Clustering Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.

More information

Regularized Large Margin Distance Metric Learning

Regularized Large Margin Distance Metric Learning 2016 IEEE 16th International Conference on Data Mining Regularized Large Margin Distance Metric Learning Ya Li, Xinmei Tian CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

CS 534: Computer Vision Segmentation and Perceptual Grouping

CS 534: Computer Vision Segmentation and Perceptual Grouping CS 534: Computer Vision Segmentation and Perceptual Grouping Ahmed Elgammal Dept of Computer Science CS 534 Segmentation - 1 Outlines Mid-level vision What is segmentation Perceptual Grouping Segmentation

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

Robust Shape Retrieval Using Maximum Likelihood Theory

Robust Shape Retrieval Using Maximum Likelihood Theory Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2

More information

Partial order embedding with multiple kernels

Partial order embedding with multiple kernels Partial order embedding with multiple kernels Brian McFee and Gert Lanckriet University of California, San Diego Goal Embed a set of objects into a Euclidean space such that: 1. Distances conform to human

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision Object Recognition Using Pictorial Structures Daniel Huttenlocher Computer Science Department Joint work with Pedro Felzenszwalb, MIT AI Lab In This Talk Object recognition in computer vision Brief definition

More information

Text Modeling with the Trace Norm

Text Modeling with the Trace Norm Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Multiresponse Sparse Regression with Application to Multidimensional Scaling Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,

More information

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies. CSE 547: Machine Learning for Big Data Spring 2019 Problem Set 2 Please read the homework submission policies. 1 Principal Component Analysis and Reconstruction (25 points) Let s do PCA and reconstruct

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/20/2010 Announcements W7 due Thursday [that s your last written for the semester!] Project 5 out Thursday Contest running

More information

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam Presented by Based on work by, Gilad Lerman, and Arthur Szlam What is Tracking? Broad Definition Tracking, or Object tracking, is a general term for following some thing through multiple frames of a video

More information

CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION

CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION 122 CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION 5.1 INTRODUCTION Face recognition, means checking for the presence of a face from a database that contains many faces and could be performed

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Lecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering

Lecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering CSE 291: Unsupervised learning Spring 2008 Lecture 5 Finding meaningful clusters in data So far we ve been in the vector quantization mindset, where we want to approximate a data set by a small number

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points] CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, 2015. 11:59pm, PDF to Canvas [100 points] Instructions. Please write up your responses to the following problems clearly and concisely.

More information

Facial Expression Recognition Using Non-negative Matrix Factorization

Facial Expression Recognition Using Non-negative Matrix Factorization Facial Expression Recognition Using Non-negative Matrix Factorization Symeon Nikitidis, Anastasios Tefas and Ioannis Pitas Artificial Intelligence & Information Analysis Lab Department of Informatics Aristotle,

More information

Constrained Clustering with Interactive Similarity Learning

Constrained Clustering with Interactive Similarity Learning SCIS & ISIS 2010, Dec. 8-12, 2010, Okayama Convention Center, Okayama, Japan Constrained Clustering with Interactive Similarity Learning Masayuki Okabe Toyohashi University of Technology Tenpaku 1-1, Toyohashi,

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked Plotting Menu: QCExpert Plotting Module graphs offers various tools for visualization of uni- and multivariate data. Settings and options in different types of graphs allow for modifications and customizations

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Visualizing pairwise similarity via semidefinite programming

Visualizing pairwise similarity via semidefinite programming Visualizing pairwise similarity via semidefinite programming Amir Globerson Computer Science and Artificial Intelligence Laboratory MIT Cambridge, MA 02139 gamir@csail.mit.edu Sam Roweis Department of

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Neural Computation : Lecture 14 John A. Bullinaria, 2015 1. The RBF Mapping 2. The RBF Network Architecture 3. Computational Power of RBF Networks 4. Training

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Jian Guo, Debadyuti Roy, Jing Wang University of Michigan, Department of Statistics Introduction In this report we propose robust

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

CSE 546 Machine Learning, Autumn 2013 Homework 2

CSE 546 Machine Learning, Autumn 2013 Homework 2 CSE 546 Machine Learning, Autumn 2013 Homework 2 Due: Monday, October 28, beginning of class 1 Boosting [30 Points] We learned about boosting in lecture and the topic is covered in Murphy 16.4. On page

More information

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601 Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network Nathan Sun CIS601 Introduction Face ID is complicated by alterations to an individual s appearance Beard,

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Face Recognition using Eigenfaces SMAI Course Project

Face Recognition using Eigenfaces SMAI Course Project Face Recognition using Eigenfaces SMAI Course Project Satarupa Guha IIIT Hyderabad 201307566 satarupa.guha@research.iiit.ac.in Ayushi Dalmia IIIT Hyderabad 201307565 ayushi.dalmia@research.iiit.ac.in Abstract

More information

Clustering will not be satisfactory if:

Clustering will not be satisfactory if: Clustering will not be satisfactory if: -- in the input space the clusters are not linearly separable; -- the distance measure is not adequate; -- the assumptions limit the shape or the number of the clusters.

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Final Exam Study Guide

Final Exam Study Guide Final Exam Study Guide Exam Window: 28th April, 12:00am EST to 30th April, 11:59pm EST Description As indicated in class the goal of the exam is to encourage you to review the material from the course.

More information

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Unsupervised Learning: Clustering Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com (Some material

More information

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS This chapter presents a computational model for perceptual organization. A figure-ground segregation network is proposed based on a novel boundary

More information

arxiv: v1 [cs.cv] 2 May 2016

arxiv: v1 [cs.cv] 2 May 2016 16-811 Math Fundamentals for Robotics Comparison of Optimization Methods in Optical Flow Estimation Final Report, Fall 2015 arxiv:1605.00572v1 [cs.cv] 2 May 2016 Contents Noranart Vesdapunt Master of Computer

More information

Markov Networks in Computer Vision

Markov Networks in Computer Vision Markov Networks in Computer Vision Sargur Srihari srihari@cedar.buffalo.edu 1 Markov Networks for Computer Vision Some applications: 1. Image segmentation 2. Removal of blur/noise 3. Stereo reconstruction

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Classification III Dan Klein UC Berkeley 1 Classification 2 Linear Models: Perceptron The perceptron algorithm Iteratively processes the training set, reacting to training errors

More information

Distance metric learning: A two-phase approach

Distance metric learning: A two-phase approach ESANN 07 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 6-8 April 07, i6doc.com publ., ISBN 978-8758709-. Distance metric

More information

Lecture 2 September 3

Lecture 2 September 3 EE 381V: Large Scale Optimization Fall 2012 Lecture 2 September 3 Lecturer: Caramanis & Sanghavi Scribe: Hongbo Si, Qiaoyang Ye 2.1 Overview of the last Lecture The focus of the last lecture was to give

More information

6 Randomized rounding of semidefinite programs

6 Randomized rounding of semidefinite programs 6 Randomized rounding of semidefinite programs We now turn to a new tool which gives substantially improved performance guarantees for some problems We now show how nonlinear programming relaxations can

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Markov Networks in Computer Vision. Sargur Srihari

Markov Networks in Computer Vision. Sargur Srihari Markov Networks in Computer Vision Sargur srihari@cedar.buffalo.edu 1 Markov Networks for Computer Vision Important application area for MNs 1. Image segmentation 2. Removal of blur/noise 3. Stereo reconstruction

More information

1 Homophily and assortative mixing

1 Homophily and assortative mixing 1 Homophily and assortative mixing Networks, and particularly social networks, often exhibit a property called homophily or assortative mixing, which simply means that the attributes of vertices correlate

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Divide and Conquer Kernel Ridge Regression

Divide and Conquer Kernel Ridge Regression Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT 2013 1 / 15 Problem

More information

Content-based image and video analysis. Machine learning

Content-based image and video analysis. Machine learning Content-based image and video analysis Machine learning for multimedia retrieval 04.05.2009 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

Factorization with Missing and Noisy Data

Factorization with Missing and Noisy Data Factorization with Missing and Noisy Data Carme Julià, Angel Sappa, Felipe Lumbreras, Joan Serrat, and Antonio López Computer Vision Center and Computer Science Department, Universitat Autònoma de Barcelona,

More information

On Order-Constrained Transitive Distance

On Order-Constrained Transitive Distance On Order-Constrained Transitive Distance Author: Zhiding Yu and B. V. K. Vijaya Kumar, et al. Dept of Electrical and Computer Engineering Carnegie Mellon University 1 Clustering Problem Important Issues:

More information

Computational Statistics and Mathematics for Cyber Security

Computational Statistics and Mathematics for Cyber Security and Mathematics for Cyber Security David J. Marchette Sept, 0 Acknowledgment: This work funded in part by the NSWC In-House Laboratory Independent Research (ILIR) program. NSWCDD-PN--00 Topics NSWCDD-PN--00

More information

Software Documentation of the Potential Support Vector Machine

Software Documentation of the Potential Support Vector Machine Software Documentation of the Potential Support Vector Machine Tilman Knebel and Sepp Hochreiter Department of Electrical Engineering and Computer Science Technische Universität Berlin 10587 Berlin, Germany

More information

Accelerometer Gesture Recognition

Accelerometer Gesture Recognition Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate

More information

Metric Learning for Large-Scale Image Classification:

Metric Learning for Large-Scale Image Classification: Metric Learning for Large-Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Florent Perronnin 1 work published at ECCV 2012 with: Thomas Mensink 1,2 Jakob Verbeek 2 Gabriela Csurka

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information