Selective Pooling Vector for Fine-grained Recognition

Size: px

Start display at page:

Download "Selective Pooling Vector for Fine-grained Recognition"

Barbra Newton
5 years ago
Views:

1 Selective Pooling Vector for Fine-grained Recognition Guang Chen Jianchao Yang Hailin Jin Eli Shechtman Jonathan Brandt Tony X. Han Adobe Research University of Missouri San Jose, CA, USA Columbia, MO, USA {jiayang, hljin, elishe, Abstract We propose a new framework for image recognition by selectively pooling local visual descriptors, and show its superior discriminative power on fine-grained image classification tasks. The representation is based on selecting the most confident local descriptors for nonlinear function learning using a linear approximation in an embedded higher dimensional space. The advantage of our Selective Pooling Vector over the previous state-of-the-art Super Vector and Fisher Vector representations, is that it ensures a more accurate learning function, which proves to be important for classifying details in fine-grained image recognition. Our experimental results corroborate this claim: with a simple linear SVM as the classifier, the selective pooling vector achieves significant performance gains on standard benchmark datasets for various fine-grained tasks such as the CMU Multi-PIE dataset for face recognition, the Caltech-UCSD Bird dataset and the Stanford Dogs dataset for fine-grained object categorization. On all datasets we outperform the state of the arts and boost the recognition rates to 96.4%, 48.9%, 52.0% respectively. 1. Introduction Image classification is the task of assigning a predefined category label to an input image, which is a fundamental building block for intelligent image content analysis. Even though it has been studied for many years, image classification remains to be a major challenge. Perhaps one of the most significant developments in the last decade in image recognition is the application of local image features, including the introduction of the Bag-of-Visual-Words (BOV) model and its variants [25, 21, 11, 29, 19], which inspired and initiated a lot of research efforts. The BOV model treats an image as a collection of unordered local visual descriptors extracted from small patches, quantizes them into discrete visual words and then computes a compact histogram representation for image recognition. However, the BOV model discards the spatial order of local descriptors, which limits its descriptive power. To overcome this problem, one particularly popular extension of the BOV model uses spatial pyramids to take into account the global image structure [15], and is now an important component in many state-of-the-art systems. Other vector representations of local image descriptors [11, 29, 19] extend the BOV model and build richer and more discriminative image representations for classification and retrieval tasks. Besides generic image categorization, there has been recently a growing interest in fine-grained image classification. Even though the aforementioned algorithms perform well on general object categorization tasks, they may be suboptimal in distinguishing finer details. Specific algorithms have been developed over the last several years to tackle the fine-grained recognition problem from various aspects. Yao et al. [27] introduced a very high-dimensional histogram to represent the color and gradient pixel values to alleviate the quantization problem. Yang et al. [26] constructed kernel descriptors based on shape, texture and color information for template learning in fine-grained recognition. Chai et al. [7, 6] used Fisher vectors to learn global level and object part level representations. Another line of research on fine-grained recognition focuses on image alignment by segmenting or detecting object parts before classification. Gavves et al. [9] localized distinctive details by roughly aligning the objects using an ellipse fit to the shape and achieved convincing performance. Chai et al. [5, 7, 6] demonstrated how co-segmentation could be employed to increase recognition accuracy. Angelova et al. [3] proposed a joint framework of detection and segmentation to localize discriminative parts. Comparing with generic image categorization problems, fine-grained recognition relies on identifying the subtle differences in appearance of specific object parts. To tackle this problem, we propose a new image feature representation we call the Selective pooling vector (SPV). It is derived from learning a Lipschitz smooth nonlinear classification function in the local descriptor space using a linear approximation in a higher dimensional embedded space [29]. The selective pooling procedure rejects local descriptors that

2 Figure 1. Framework of our Selective Pooling Vector. (a) Input image. (b) Dense local descriptor extraction and GMM encoding. (c) For each GMM component, we selectively pool out the most representative local descriptors. (d) We concatenate the selectively pooling vectors from each Gaussian mixture as the final image representation for linear classifier. In (c), we show some pooled local parts with circles. The color of these circles denote the SVM classifier energy associated with the parts. As we can see, our algorithm can learn the parts that are most discriminative for the fine-grained recognition task. do not contribute to the function learning, which result in better function learning and improved classification performance on fine-grained recognition tasks. In brief, to build our Selective pooling vector image representation, we first use a Gaussian Mixture Model to encode the local descriptors densely extracted from the input image. Then for each Gaussian mixture, we conduct selective pooling to find the most representative local descriptors, and concatenate the pooling vectors from all the mixtures to form the final image representation. Simple and grounded on the function learning theory, our feature representation turns out to be very effective in fine-grained recognition tasks. Figure 1 illustrates the framework of our Selective pooling vector. It is worth to note that our Selective pooling vector shares a similar feature representation form as the Super vector [29] and the Fisher vector [19]. These representations are based on aggregation through averaging of all local image descriptors. This works well for coarse-grained image categorization. However for fine-grained recognition, where the task is to distinguish fine differences between the subcategories, including local descriptors far away from the cluster centers might harm the classification function learning. Intuitively, the weighted averaging pooling step in Super vector and Fisher vector smears the fine image structures that are important for fine-grained recognition. In contrast, our selective pooling is based on choosing only a few (often only a single) representative local features per mixture component, thus avoiding the excessive averaging and preserving much better the fine visual patterns in the original images. We investigate this distinction between our Selective pooling vector and the Super vector and Fisher vector based methods on several fine-grained recognition tasks. To demonstrate the effectiveness of the proposed algorithm, we test it on two different fine-grained image classification tasks including face recognition and fine-grained object categorization. Both tasks require distinguishing subtle differences in appearance of specific object parts. For the face recognition task, we test on the CMU Multi-PIE dataset [10] and achieve state-of-the-art average accuracy 96.4% on all three test sessions. For fine-grained object categorization, we test on two popular benchmark datasets: Caltech-UCSD Bird 2010 dataset [22] and Stanford Dogs dataset [14]. We achieve state-of-the-art classification accuracies of 48.9% and 52.0% on these datasets, respectively. 2. Selective Pooling Vector Encoding In this section, we describe the rationale behind our Selective pooling vector (SPV) as a new image feature representation. The image feature construction is inspired by that fact that a nonlinear function in the original space can be learned as a linear function in a high-dimensional embedded space using first-order approximation [29]. To ensure accurate function learning, we propose a selective pooling procedure to select the most significant local descriptors, from which we derive our new image feature representation Image Recognition as Nonlinear Function Learning For image recognition, we represent each image as a bag of local descriptors I = {z 1, z 2,..., z n }, where z i is the i-th local descriptor (e.g., SIFT [17] or LBP [2]). For the sake of simplicity, we discuss the two-class problem c = { 1, +1}. Assuming that these local descriptors are i.i.d., we look at the log odds ratio for classification, log p(i c = +1) p(i c = 1) = log n n p(zi c = +1) p(zi c = 1) = log exp( n g(zi, c = +1)) exp( n g(zi, c = 1)) n = {g(z i, c = +1) g(z i, c = 1)} where g(z i, c) is the potential function that determines the likelihood of z i belonging to class c. Let f(z i ) = g(z i, c = +1) g(z i, c = 1), the above equation translates into log p(i c = +1) p(i c = 1) = n (1) f(z i). (2) Therefore, if we know function f in the local image descriptor space, we can classify image I as c = +1 if n f(z i) > 0 and c = 1 otherwise Nonlinear Function Learning As shown in [29], the nonlinear function f can be approximated by locally linear functions if it is sufficiently smooth.

3 Let D R p = {d 1, d 2,..., d K } denote a set of anchor points in the local descriptor space, which we call a codebook. For a data sample z, denote d (z) D as its closest anchor point or codebook item. According to Taylor expansion, we have f(z) f(d (z)) + f(d (z)) T (z d (z)), (3) where the quality of approximating f(z) by f(d (z)) + f(d (z)) T (z d (z)) is bounded by how close z is from d (z). By reformulating Eqn. (3) as in [29], we have where f(z) K wk T φ k (z). (4) k=1 φ k (z) = r k (z)[1, (z d k )] T, (5) w k = [f(d k ), f(d k ) T ] T. (6) Here r k (z) is the vector quantization encoding coefficients for z w.r.t. codebook D defined as { 1, if k = arg minj {1,...,K} z d j 2, r k (z) = (7) We denote the concatenation of φ k and w k with φ and w as follows: φ(z) = [φ k (z)] k {1,...,K} (8) w = [w k ] k {1,...,K}. (9) This is referred as super-vector coding in [29]. Then the classification decision function in Eqn. (2) can be expressed as n n f(z i) = w T φ(z i). (10) Given the codebook D, it is easy to compute n φ(z i), which we denote as ψ(i). However, the function values on the anchor points in D, i.e., w, are still unknown. Note that if we regard ψ(i) as the image feature, w is basically the linear classifier, which can be learned from the labeled training data Selective Pooling Vector According to Eqn. (3), the linear approximation accuracy of function f is bounded by the quantization error z d (z) 2 2. Therefore, we can improve the function approximation accuracy by learning the codebook D to minimize the quantization error. One simple way to learn such a codebook is by the K-means algorithm } D = arg min D { z min z d 2 d D. (11) However, as the dimension of the local descriptor space is usually high, e.g., SIFT has 128 dimensions and LBP has 59 dimensions, a limited number of anchor points are not sufficient to model the entire space well. As a result, there will be always local descriptors that have large quantization errors w.r.t. the codebook D. Including local descriptors that are far away from the set of anchor points D in Eqn. (2) will result in a poor learning of w. Therefore, rather than using all local descriptors in the image, we compute ψ(i) by only choosing local descriptors that are sufficiently close to our codebook D. Specifically, for each local descriptor z i, we measure its distance from its closet anchor point z i d (z i ) 2 2 and select it only when the quantization error is smaller than a predefined threshold ɛ. We define a descriptor encoding matrix A R K n, where K is the number of anchor points and n is the number of local descriptors in the input image, for all local descriptors by 1, k = arg min j {1,...,K} z i d j 2 2 A(k, i) = and z i d k (z i) 2 2 ɛ, (12) Then we encode each local descriptor as φ(z i) = [A(k, i), A(k, i)(z i d k ) T ] T k {1,...,K}, (13) and the image feature representation is again computed as ψ(i) = φ(z i ). As each encoded local feature has a dimension of K (p + 1), where K is the number of anchor points and p is the dimension of the local descriptor, we have a high final image feature dimension of K (p + 1). Note that matrix A is a binary matrix that encodes which descriptors are selected with respect to each anchor point, i.e., not all local descriptors are used to construct our image feature Refined Selective Pooling Vector The aforementioned feature embedding scheme uses binary hard assignment or selection for the encoding matrix A; it does not take into account the fact that local descriptors are typically distributed in a non-uniform way in the space. Soft assignment with Gaussian Mixture Model has shown to be superior to hard assignment with K-means in previous bag-of-features based recognition work [16]. Accordingly, we refine our feature representation by incorporating the properties of GMM based on the above theory. From the training images, we first sample a subset of the local descriptors to train a Gaussian Mixture Model with the standard EM algorithm. Here we denote the learned GMM as K v kn (µ k, Σ k ). Rather than using binary assignment for selective pooling, we define the encoding matrix A by the posterior probabilities of the local descriptors belonging to each Gaussian mixture A(k, i) = v k N (z i; µ k, Σ k ) K. (14) j=1 vjn (zi; µj, Σj) Each row of matrix A indicates which descriptors are softly selected for the corresponding mixture or anchor point, while each column represents the soft vector quantization encoding coefficients of a local descriptor with respect to

4 all Gaussian mixtures. With the newly defined encoding matrix A, we can define different procedures of selective pooling 1 : Radius pooling: Set the elements of A to be zero if the Mahalanobis distance between descriptors and GMM centers exceed a certain threshold τ: { A(k, j), (zi µ B(k, j) = k ) T Σ 1 k (zi µ k) < τ (15) Posterior thresholding: Instead of inspecting the Mahalanobis distances directly, a simple approximation would be to set the elements of A to be zero if they are smaller than some threshold σ: { A(k, j), A(k, j) > σ, B(k, j) = (16) k-nearest neighbor pooling: The problem of radius pooling with a fixed threshold is that it does not adapt to the local density of the feature space very well, and thus is typically inferior to k-nearest neighbor method. Therefore, as an approximation, we use k- nearest neighbor pooling by retaining the largest k values of each row of A and set the rest to be zero. Max pooling: In the extreme case, we can do 1-nearest neighbor pooling by keeping only the largest value in each row of A and setting all others to be zero, which we call max pooling. { A(k, j), A(k, j) > A(k, i) i j, B(k, j) = (17) As we will see in the experiment section, max pooling works very well in general for our SPV, echoing the recent success using max pooling for image recognition [24, 21]. Based on Eqn. (13), we then encode each local descriptor z i using the new encoding matrix B φ(z i) = [B(k, i), B(k, i)(z i µ k ) T ] T k {1,...,K}. (18) Inspired by previous work on super-vector and fisher-vector image representation, we normalize the feature representation properly in order to make the linear classifier learning easier. Specifically, we modify the local descriptor embedding step by incorporating Gaussian covariance normalization and feature cardinality normalization as bellow: φ(z i) = [ B(k, i), B(k, i)[σ 1/2 k (z i µ k )] T ] T k {1,...,K}, (19) where B(k, i) = B(k, i)/ B(k, :) 1 with B(k, :) 1 is the sum of the k-th row of B. Note that the covariance normalization corresponds to feature whitening within each Gaussian mixture to evenly spread the feature energy, which has been shown to be effective for training linear classifiers. 1 Note that the pooling here is a little different from its traditional use. We not only pool the encoding coefficients, but also their corresponding local descriptors Relationship to Previous Work Our new feature representation shares a similar form to several previous work, such as Super vector coding [29], Fisher vector [19] and VLAD [11]. Specifically, in the extreme case, if we set k to be the number of all local descriptors from the input image in k-nearest neighbor pooling, we have φ(z i) = [Ã(k, i), Ã(k, i)[σ 1/2 k (z i µ k )] T ] T k {1,...,K}, (20) where Ã(k, i) = A(k, i)/ k,i A(k, i), our SPV is equivalent to the Super vector image representation, which is similar to Fisher vector and VLAD. In contrast to our SPV feature, these previous work utilize all available local descriptors from the input image to construct their image features. Using all local descriptors for weighted averaging can suppress the intra-class variance of the local descriptors, which is desired for coarse-grained image classification. However, for fine-grained image classification, which is more sensitive to quantization errors of the local descriptors, keeping the intra-class variance is important to distinguish different subcategories. Averaging pooling in Super vector and Fisher vector tends to smear the local object parts that are important for the recognition. Although GMM itself is doing a certain degree of selective pooling by assigning lower weights to descriptors far away from mixture centers, the fact that GMM is a generative model for the entire space makes the exponential weight decay not fast enough for selective pooling. Therefore, some amount of averaging effect still exist in Super vector or Fisher vector. Fig. 2 visualizes the feature differences between our Selective pooling vector and Super vector using the gradient map feature as an approximation of the SIFT descriptor (as SIFT is hard to visualize). The area in red circle in (b) gives the most confident local descriptor for a particular Gaussian component. (c) shows the local descriptor pooled by our SPV while (d) shows the descriptor pooled by Super vector. As we can see, Super vector coding will blur the fine local details that could be important for fine-grained recognition, even though its feature construction is based on weighted average. It is also worth to note that sparsification is a common practice used in Fisher vector to speed up computation. It is typically done by setting A(k, i) to zero for very small values. However, the motivation of their sparsification is mainly for speed concern, which is very different from our selective pooling. In particular, our selective pooling is much more aggressive to ensure accurate function learning for fine-grained recognition tasks, and in the extreme case, we can select only a single local descriptor for each Gaussian mixture. The extreme case of Selective pooling vector using max pooling (with no feature averaging) is interesting, which makes our image feature representation only remotely sim-

5 ing typically works better than radius pooling or posterior thresholding, where the latter are more sensitive to parameter tuning. Therefore, in the following experiments, we will only report results on SPV with k-nearest neighbor pooling Face recognition Figure 2. Visualization of the feature space for Selective pooling vector and Super vector. (a) Input image. (b) Gradient feature map with circled area as the pooled local descriptor for a Gaussian mixture. (c) The gradient feature pooled by our SPV. (d) The gradient feature pooled by Super vector. As we can see, Super vector coding will blur the fine local details that could be important for fine-grained recognition, even though its feature construction is based on weighted average. Since we cannot visualize SIFT descriptors easily, we use the gradient map as an approximation of SIFT for illustration purpose. ilar to Super vector or Fisher vector. As we will show in the experiment section, SPV with max pooling will usually give the best performance. Besides, our algorithm may shed light on the understanding about max pooling from a function learning perspective, which is beyond the traditional intuitive explanation for achieving local translation invariance Encoding Spatial Information To incorporate the discriminative spatial information for image recognition, we can apply an idea similar to spatial pyramid matching [15], where each image is partitioned into different size of blocks (e.g., 1 1, 4 1) at different spatial scales. Alternatively, we could follow the rough part alignment framework [9] to segment the object and divide it into different subregions. Then we extract SPV from each of the spatial blocks or subregions. The final image feature representation is obtained by concatenating all Selective pooling vectors. 3. Experimental Results In this section, we apply the proposed Selective pooling vector (SPV) to fine-grained recognition tasks including face recognition and fine-grained object recognition. Extensive experiments have been carried on several standard benchmark datasets. We show that our algorithm outperforms both super vector and Fisher vector representations on these fine-grained problems, and favorable comparisons with state-of-the-art fine-grained recognition methods demonstrate the effectiveness of our new image feature. In our experiments, we find that k-nearest neighbor pool- The standard CMU Multi-PIE face dataset [10] is used as the benchmark to compare the proposed algorithm with the state of the arts. The database contains 337 subjects with a spectrum of variations caused by different poses, expressions, and illumination conditions. The dataset is challenging due to the large number of subjects, and the big heterogeneous appearance variations. We evaluate the algorithms with the standard experimental settings [25, 23]. Among the 337 subjects, 249 subjects in Session 1 are used for training. Session 2, 3 and 4 are used for testing. For each subject in the training set, 7 frontal face images with neutral expression taken under extremal illumination conditions are included. For the testing set, all images taken under 20 illumination conditions are used. We report the recognition accuracy for each Session respectively. For all of the experiments on CMU-PIE dataset, we first resize the image to 80. We then densely extract SIFT descriptors [17] and LBP descriptors [2] over a grid of 3 pixels at different scales(8 8, 12 12, 16 16, 24 24, 32 32). We reduce the feature dimension to 80 through PCA. A GMM with 512 components is learned and we build a three-level spatial pyramid(1 1, 2 2, 3 1) to incorporate the spatial information. Finally we learn a linear SVM classifier for classification. We first evaluate the effect of k in k-nearest neighbor selective pooling. One extreme case is to keep only the largest value for each row of the encoding matrix A, which basically to max pooling. The max pooling approach can be interpreted as finding the most confident local descriptor for each GMM component for the final classification. The other extreme case is to keep all the values, and we then compute a weighted local descriptor for each GMM component. In this case, the proposed pooling feature degenerates to Super Vector [29], which bears large similarity to the Fisher vector [19]. We vary the value of k and study the corresponding performance changes, as shown in Tab. 1. We find that keeping a small number of local descriptors for each component gives superior results: For k = 1, the recognition accuracies are already quite high for all three sessions:96.3%, 96.2%, 96.7%. For k = 2 and k = 3, the performance is similar. However, the performance tends to drop as k gets larger. If we keep all the local descriptors (k = 1578), which degenerates our feature to the super vector, the performance drops significantly to 92.0%, 92.4%, 92.7% on three sessions respectively. This performance change can be well explained with our selective pooling analysis discussed in Section 2: Local descriptors with low posterior

6 Table 1. The recognition accuracy of our Selective pooling vector on CMU Multi-PIE. k-nn pooling session 2 session 3 session 4 k = % 96.2% 96.7% k = % 96.3% 96.6% k = % 96.1% 96.4% k = % 94.9% 94.7% k = % 93.6% 93.8% k = % 92.5% 92.7% k = % 92.4% 92.7% probabilities have large quantization errors that are destructive to learning the classification function. Although tuning the number of neighbors k for pooling might increase the performance (e.g., performance gain on Session 3), we will use max pooling from now on for its simplicity, efficiency, as well as effectiveness. Comparison with the state of the arts. We compare the proposed local feature embedding algorithm with several state-of-the-art face recognition algorithms, including face recognition algorithm using sparse representation [23], supervised parse coding [25], and the recent structure sparse coding [12]. The face recognition comparisons are shown in Table 2. The proposed Selective pooling vector outperforms the latest work [12] by 1% 3% for three sessions. We achieve 96.3%, 96.2%, 96.7% highest recognition rates on all three sessions. Table 2. Comparisons with state of the arts on CMU Mlulti-PIE for face recognition. Algorithms session 2 session 3 session 4 SRC [23] 91.4% 90.3% 90.2% USC [25] 94.6% 91.0% 92.5% SSC [25] 95.2% 93.4% 95.1% Struct. Sparsity [12] 95.7% 94.9% 93.7% SPV (SPM) 96.3% 96.2% 96.7% 3.2. Fine-grained recognition Recently, there has been a growing interest in finegrained recognition problems. Many powerful algorithms have been proposed in the last several years, including the high throughput template matching [27], unsupervised template learning [26], segmentation based alignment [9], part localization [6] and different flavors of feature encoding and learning algorithms (e.g., fisher vector [19, 9], LLC [21, 28], POOF [4]). We evaluate the effectiveness of the proposed selective pooling vector by comparing its performance with the the aforementioned state-of-the art algorithms on two challenging benchmark fine-grained datasets: Caltech- UCSD Birds 2010 [22] and Stanford Dogs dataset [14]. The Caltech-UCSD Birds 2010 dataset contains 6, 044 images from 200 bird species; some of the species have very subtle inter-class differences. We adopt the standard training/testing split [22] on the Bird dataset, i.e., around 15 training and 15 test images per category. The Stanford Dogs dataset [14] is another popular benchmark dataset containing 20, 580 images of 120 breeds of dogs. It is a carefully selected subset of ImageNet [1]. For the experiments on these two datasets, we follow the standard evaluation protocol [27, 26, 6]: We augment the training dataset by mirroring the training images so that the training set is doubled. We use the labeled bounding boxes to normalize the images. The performance is evaluated according to the category normalized mean accuracy. For experiments on these two datasets, we densely extract SIFT descriptors [17] from the opponent color space [20] and LBP descriptors [2] over a grid of 3 pixels at five scales(16 16, 24 24, 32 32, 40 40, 48 48). The dimension of the local descriptors is then reduced by PCA and the GMM component number K is set to be Finally the Selective pooling vector representation is fed to a linear SVM classifier. Gravves et al. [9] have shown that a rough part level alignment with spatial information encoding could improve the recognition accuracy significantly. Accordingly, we report fine-grained object recognition results with two different spatial information encoding methods. The first one is the traditional spatial pyramid matching algorithm with three layers(1 1, 2 2, 4 1). The second one is the spatial encoding algorithm introduced by Gravves et al. [9]. First, we use GrabCut [18] on the labeled bounding box to compute an accurate foreground segmentation. Second, we compute the mean and covariance of the pixels on the segmentation mask, and accordingly fit an ellipse to these pixels. Third, we divide the principle axis of the ellipse equally into four segments, and define regions that falling into each segment as an object part. Finally for each segment region, we perform extract our Selective pooling vector, and concatenate all the vectors as the final object representation Caltech-UCSD Bird 2010 Dataset For the fine-grained recognition experiment on Bird dataset [22], we first compare our Selective pooling vector with the state-of-the-art feature coding and learning algorithms, i.e., LLC [21], Multi-kernel learning [28] and Fisher vector [19] under the same settings, i.e., same local descriptors and same number of Gaussian mixtures. To encode the spatial information, we first use the traditional 3-layer spatial pyramid for all algorithms. The comparison results are shown in Table 3. We observe a much higher accuracy than LLC [28] on Bird dataset: a significant performance leap from 18% to 46.7%. Comparing with state-of-the-art object recognition Fisher vector algorithm [19], our algorithm still works much better, outperforming it by 5%. Since LLC only uses the pooling coefficients for classification, these pooling coefficients are too coarse to distinguish the subtle inter-class differences in fine-grained recognition tasks.

7 Table 3. Comparison with popular feature learning algorithms on Caltech-UCSD Bird Dataset. Algorithms Accuracy LLC [28] 18.0% Multiple Kernel Learning [28] 19.0% Fisher Vector [19] 41.1% SPV (SPM) 46.7% Table 4. Comparison with state of the arts on Caltech-UCSD Bird Dataset. Algorithms Accuracy Co-Segmentation [5] 23.3% Discriminative color descriptors [13] 26.7% Unsupervised template learning [26] 28.2% Detection+Segmentation [3] 30.2% DPM+Segmentation+Fisher vector [6] 47.3% SPV (Alignment) 48.9% The Fisher vector algorithm and our algorithm both preserve the local descriptor information, which helps to differentiate the subtle differences between fine-grained object categories. However, Fisher vector uses all local descriptors to construct the feature representation (i.e., average pooling), while our feature discards local descriptors that are far away from the Gaussian mixture centers and make use of only the most confident local descriptors for classification. Therefore, the function learning in our new feature could be more accurate and as a result we can achieve better performance. Comparisons between our algorithm and many state-ofthe-art algorithms reported on this bird dataset [22] is shown in Tab. 4. In this case, we use the segmentation alignment algorithm [9] to encode the spatial information, which increases our performance by 2.2% compared with that with SPM in Tab. 3. As we can see from Tab. 4, the proposed Selective pooling vector clearly outperforms all state of the arts. Comparing with prior art [6], which is based on an elegant joint framework of deformable parts model [8] and segmentation algorithm [18] built on Fisher vector, our algorithm improves the accuracy from 47.3% to 48.9%, but with a much simpler learning and testing scheme Stanford Dog Dataset Comparing with the Bird dataset [22], the Stanford Dog dataset [14] contains more images and has even larger shape and pose variations. We again first report result comparisons with LLC coding [13] and Fisher vector coding [19] under the same experimental setup with spatial pyramid. From Tab. 5, we again observe the big performance improvement over LLC from 14.5% [13] to 47.2%. Comparing with Fisher vector under the same experiment settings, our algorithm again performs much better, around 6% higher. The results are consistent with our observations on Table 5. Comparison with popular feature learning algorithms on Stanford Dogs Dataset. Algorithms Accuracy LLC [13] 14.5% Fisher Vector [19] 41.0% SPV (SPM) 47.2% Table 6. Comparison with state of the arts on Stanford Dogs Dataset. Algorithms Accuracy Tricos [6] 26.9% Discriminative color descriptors [13] 28.1% Unsupervised template learning [26] 38.0% DPM+segmentation+fisher vector [6] 45.6% alignment+fisher vector [9] 50.1% SPV (Alignment) 52.0% the Bird dataset. We then report comparisons between our algorithm and state-of-the-art algorithms on this dog dataset in Tab. 6. Again, we use the spatial alignment algorithm in [9] to encode the spatial information. This time, it increases our performance from 47.4% with SPM to 52.0%, a leap larger than what we observe on the bird dataset. Due to the larger shape and pose variations in the Stanford Dog dataset, spatial alignment helps more. On this dataset, the unsupervised template learning algorithm [26] achieved a recognition accuracy of 38.0%. The segmentation based frameworks showed great success [6, 9] and achieved 45.6% and 50.1%, respectively. With the spatial alignment algorithm introduced by [9], we achieved an accuracy of 52%, outperforming the DPM and segmentation algorithm [6] by 6.4%, and the prior best result [9] by 1.9%. Note that the difference between our algorithm and that of [9] is the use of Selective pooling vector rather than Fisher vector Discussion We have shown the superior performance of our SPV over state-of-the-art algorithms on several fine-grained recognition tasks. In particular, we compare with similar feature representations from Super vector and Fisher vector in the framework of Spatial pyramid and spatial alignment [9]. In both cases, our SPV outperforms them significantly. One interesting observation is that our SPV can bring more improvements over Super vector when objects are not very well aligned (e.g., in the case of using spatial pyramid in Tab. 3 and 5 ), indicating that our selective pooling is more robust than the average pooling used in Super vector and Fisher vector on fine-grained recognition tasks. 4. Conclusion In this paper, we propose a novel image feature representation called Selective Pooling Vector. The new image

8 feature is derived from nonlinear function learning by linear approximation in an embedded high dimensional space. Different from previous work, we ensure the function learning accuracy by selecting only local descriptors that are confident. We apply our algorithm to CMU Multi-PIE for face recognition and fine-grained recognition tasks on Caltech- UCSD Bird 2010 dataset and Stanford Dogs dataset, all outperforming the state-of-the-art handcrafted features. References [1] The ImageNet dataset. [2] T. Ahonen, A. Hadid, and M. Pietikinen. Face description with local binary patterns: Application to face recognition [3] A. Angelova and S. Zhu. Efficient object detection and segmentation for fine-grained recognition. In CVPR, [4] T. Berg and P. N. Belhumeur. Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation IEEE Conference on Computer Vision and Pattern Recognition, [5] Y. Chai, V. Lempitsky, and A. Zisserman. Bicos: A bi-level co-segmentation method for image classification. In IEEE International Conference on Computer Vision, [6] Y. Chai, V. Lempitsky, and A. Zisserman. Symbiotic segmentation and part localization for fine-grained categorization. In IEEE International Conference on Computer Vision, [7] Y. Chai, E. Rahtu, V. Lempitsky, L. Van Gool, and A. Zisserman. Tricos: A tri-level class-discriminative cosegmentation method for image classification. In European Conference on Computer Vision, [8] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., [9] E. Gavves, B. Fernando, C. Snoek, A. Smeulders, and T. Tuytelaars. Fine-grained categorization by alignments. In The IEEE International Conference on Computer Vision (ICCV), December [10] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image Vision Comput., [11] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In IEEE Conference on Computer Vision & Pattern Recognition, [12] K. Jia, T.-H. Chan, and Y. Ma. Robust and practical face recognition via structured sparsity. In ECCV, [13] R. Khan, J. van de Weijer, F. S. Khan, D. Muselet, C. Ducottet, and C. Barat. Discriminative color descriptors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [14] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, [15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, [16] L. Liu, L. Wang, and X. Liu. In defense of soft-assignment coding. In CVPR, [17] D. G. Lowe. Distinctive image features from scale-invariant keypoints [18] C. Rother, V. Kolmogorov, and A. Blake. grabcut : interactive foreground extraction using iterated graph cuts. ACM Trans. Graph., [19] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image Classification with the Fisher Vector: Theory and Practice. International Journal of Computer Vision, [20] K. van de Sande, T. Gevers, and C. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., [21] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In CVPR, [22] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical report, California Institute of Technology, [23] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell., [24] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, [25] J. Yang, K. Yu, and T. Huang. Supervised translationinvariant sparse coding. In CVPR, [26] S. Yang, L. Bo, J. Wang, and L. G. Shapiro. Unsupervised template learning for fine-grained object recognition. In NIPS, [27] B. Yao, G. Bradski, and L. Fei-Fei. A codebook-free and annotation-free approach for fine-grained image categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [28] B. Yao, A. Khosla, and L. Fei-Fei. Combining randomization and discrimination for fine-grained image categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [29] X. Zhou, K. Yu, T. Zhang, and T. Huang. Image classification using super-vector coding of local image descriptors. In ECCV, 2010.

Aggregating Descriptors with Local Gaussian Metrics

Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,