Selective Pooling Vector for Fine-grained Recognition

Size: px
Start display at page:

Download "Selective Pooling Vector for Fine-grained Recognition"

Transcription

1 Selective Pooling Vector for Fine-grained Recognition Guang Chen Jianchao Yang Hailin Jin Eli Shechtman Jonathan Brandt Tony X. Han Adobe Research University of Missouri San Jose, CA, USA Columbia, MO, USA {jiayang, hljin, elishe, Abstract We propose a new framework for image recognition by selectively pooling local visual descriptors, and show its superior discriminative power on fine-grained image classification tasks. The representation is based on selecting the most confident local descriptors for nonlinear function learning using a linear approximation in an embedded higher dimensional space. The advantage of our Selective Pooling Vector over the previous state-of-the-art Super Vector and Fisher Vector representations, is that it ensures a more accurate learning function, which proves to be important for classifying details in fine-grained image recognition. Our experimental results corroborate this claim: with a simple linear SVM as the classifier, the selective pooling vector achieves significant performance gains on standard benchmark datasets for various fine-grained tasks such as the CMU Multi-PIE dataset for face recognition, the Caltech-UCSD Bird dataset and the Stanford Dogs dataset for fine-grained object categorization. On all datasets we outperform the state of the arts and boost the recognition rates to 96.4%, 48.9%, 52.0% respectively. 1. Introduction Image classification is the task of assigning a predefined category label to an input image, which is a fundamental building block for intelligent image content analysis. Even though it has been studied for many years, image classification remains to be a major challenge. Perhaps one of the most significant developments in the last decade in image recognition is the application of local image features, including the introduction of the Bag-of-Visual-Words (BOV) model and its variants [25, 21, 11, 29, 19], which inspired and initiated a lot of research efforts. The BOV model treats an image as a collection of unordered local visual descriptors extracted from small patches, quantizes them into discrete visual words and then computes a compact histogram representation for image recognition. However, the BOV model discards the spatial order of local descriptors, which limits its descriptive power. To overcome this problem, one particularly popular extension of the BOV model uses spatial pyramids to take into account the global image structure [15], and is now an important component in many state-of-the-art systems. Other vector representations of local image descriptors [11, 29, 19] extend the BOV model and build richer and more discriminative image representations for classification and retrieval tasks. Besides generic image categorization, there has been recently a growing interest in fine-grained image classification. Even though the aforementioned algorithms perform well on general object categorization tasks, they may be suboptimal in distinguishing finer details. Specific algorithms have been developed over the last several years to tackle the fine-grained recognition problem from various aspects. Yao et al. [27] introduced a very high-dimensional histogram to represent the color and gradient pixel values to alleviate the quantization problem. Yang et al. [26] constructed kernel descriptors based on shape, texture and color information for template learning in fine-grained recognition. Chai et al. [7, 6] used Fisher vectors to learn global level and object part level representations. Another line of research on fine-grained recognition focuses on image alignment by segmenting or detecting object parts before classification. Gavves et al. [9] localized distinctive details by roughly aligning the objects using an ellipse fit to the shape and achieved convincing performance. Chai et al. [5, 7, 6] demonstrated how co-segmentation could be employed to increase recognition accuracy. Angelova et al. [3] proposed a joint framework of detection and segmentation to localize discriminative parts. Comparing with generic image categorization problems, fine-grained recognition relies on identifying the subtle differences in appearance of specific object parts. To tackle this problem, we propose a new image feature representation we call the Selective pooling vector (SPV). It is derived from learning a Lipschitz smooth nonlinear classification function in the local descriptor space using a linear approximation in a higher dimensional embedded space [29]. The selective pooling procedure rejects local descriptors that

2 Figure 1. Framework of our Selective Pooling Vector. (a) Input image. (b) Dense local descriptor extraction and GMM encoding. (c) For each GMM component, we selectively pool out the most representative local descriptors. (d) We concatenate the selectively pooling vectors from each Gaussian mixture as the final image representation for linear classifier. In (c), we show some pooled local parts with circles. The color of these circles denote the SVM classifier energy associated with the parts. As we can see, our algorithm can learn the parts that are most discriminative for the fine-grained recognition task. do not contribute to the function learning, which result in better function learning and improved classification performance on fine-grained recognition tasks. In brief, to build our Selective pooling vector image representation, we first use a Gaussian Mixture Model to encode the local descriptors densely extracted from the input image. Then for each Gaussian mixture, we conduct selective pooling to find the most representative local descriptors, and concatenate the pooling vectors from all the mixtures to form the final image representation. Simple and grounded on the function learning theory, our feature representation turns out to be very effective in fine-grained recognition tasks. Figure 1 illustrates the framework of our Selective pooling vector. It is worth to note that our Selective pooling vector shares a similar feature representation form as the Super vector [29] and the Fisher vector [19]. These representations are based on aggregation through averaging of all local image descriptors. This works well for coarse-grained image categorization. However for fine-grained recognition, where the task is to distinguish fine differences between the subcategories, including local descriptors far away from the cluster centers might harm the classification function learning. Intuitively, the weighted averaging pooling step in Super vector and Fisher vector smears the fine image structures that are important for fine-grained recognition. In contrast, our selective pooling is based on choosing only a few (often only a single) representative local features per mixture component, thus avoiding the excessive averaging and preserving much better the fine visual patterns in the original images. We investigate this distinction between our Selective pooling vector and the Super vector and Fisher vector based methods on several fine-grained recognition tasks. To demonstrate the effectiveness of the proposed algorithm, we test it on two different fine-grained image classification tasks including face recognition and fine-grained object categorization. Both tasks require distinguishing subtle differences in appearance of specific object parts. For the face recognition task, we test on the CMU Multi-PIE dataset [10] and achieve state-of-the-art average accuracy 96.4% on all three test sessions. For fine-grained object categorization, we test on two popular benchmark datasets: Caltech-UCSD Bird 2010 dataset [22] and Stanford Dogs dataset [14]. We achieve state-of-the-art classification accuracies of 48.9% and 52.0% on these datasets, respectively. 2. Selective Pooling Vector Encoding In this section, we describe the rationale behind our Selective pooling vector (SPV) as a new image feature representation. The image feature construction is inspired by that fact that a nonlinear function in the original space can be learned as a linear function in a high-dimensional embedded space using first-order approximation [29]. To ensure accurate function learning, we propose a selective pooling procedure to select the most significant local descriptors, from which we derive our new image feature representation Image Recognition as Nonlinear Function Learning For image recognition, we represent each image as a bag of local descriptors I = {z 1, z 2,..., z n }, where z i is the i-th local descriptor (e.g., SIFT [17] or LBP [2]). For the sake of simplicity, we discuss the two-class problem c = { 1, +1}. Assuming that these local descriptors are i.i.d., we look at the log odds ratio for classification, log p(i c = +1) p(i c = 1) = log n n p(zi c = +1) p(zi c = 1) = log exp( n g(zi, c = +1)) exp( n g(zi, c = 1)) n = {g(z i, c = +1) g(z i, c = 1)} where g(z i, c) is the potential function that determines the likelihood of z i belonging to class c. Let f(z i ) = g(z i, c = +1) g(z i, c = 1), the above equation translates into log p(i c = +1) p(i c = 1) = n (1) f(z i). (2) Therefore, if we know function f in the local image descriptor space, we can classify image I as c = +1 if n f(z i) > 0 and c = 1 otherwise Nonlinear Function Learning As shown in [29], the nonlinear function f can be approximated by locally linear functions if it is sufficiently smooth.

3 Let D R p = {d 1, d 2,..., d K } denote a set of anchor points in the local descriptor space, which we call a codebook. For a data sample z, denote d (z) D as its closest anchor point or codebook item. According to Taylor expansion, we have f(z) f(d (z)) + f(d (z)) T (z d (z)), (3) where the quality of approximating f(z) by f(d (z)) + f(d (z)) T (z d (z)) is bounded by how close z is from d (z). By reformulating Eqn. (3) as in [29], we have where f(z) K wk T φ k (z). (4) k=1 φ k (z) = r k (z)[1, (z d k )] T, (5) w k = [f(d k ), f(d k ) T ] T. (6) Here r k (z) is the vector quantization encoding coefficients for z w.r.t. codebook D defined as { 1, if k = arg minj {1,...,K} z d j 2, r k (z) = (7) We denote the concatenation of φ k and w k with φ and w as follows: φ(z) = [φ k (z)] k {1,...,K} (8) w = [w k ] k {1,...,K}. (9) This is referred as super-vector coding in [29]. Then the classification decision function in Eqn. (2) can be expressed as n n f(z i) = w T φ(z i). (10) Given the codebook D, it is easy to compute n φ(z i), which we denote as ψ(i). However, the function values on the anchor points in D, i.e., w, are still unknown. Note that if we regard ψ(i) as the image feature, w is basically the linear classifier, which can be learned from the labeled training data Selective Pooling Vector According to Eqn. (3), the linear approximation accuracy of function f is bounded by the quantization error z d (z) 2 2. Therefore, we can improve the function approximation accuracy by learning the codebook D to minimize the quantization error. One simple way to learn such a codebook is by the K-means algorithm } D = arg min D { z min z d 2 d D. (11) However, as the dimension of the local descriptor space is usually high, e.g., SIFT has 128 dimensions and LBP has 59 dimensions, a limited number of anchor points are not sufficient to model the entire space well. As a result, there will be always local descriptors that have large quantization errors w.r.t. the codebook D. Including local descriptors that are far away from the set of anchor points D in Eqn. (2) will result in a poor learning of w. Therefore, rather than using all local descriptors in the image, we compute ψ(i) by only choosing local descriptors that are sufficiently close to our codebook D. Specifically, for each local descriptor z i, we measure its distance from its closet anchor point z i d (z i ) 2 2 and select it only when the quantization error is smaller than a predefined threshold ɛ. We define a descriptor encoding matrix A R K n, where K is the number of anchor points and n is the number of local descriptors in the input image, for all local descriptors by 1, k = arg min j {1,...,K} z i d j 2 2 A(k, i) = and z i d k (z i) 2 2 ɛ, (12) Then we encode each local descriptor as φ(z i) = [A(k, i), A(k, i)(z i d k ) T ] T k {1,...,K}, (13) and the image feature representation is again computed as ψ(i) = φ(z i ). As each encoded local feature has a dimension of K (p + 1), where K is the number of anchor points and p is the dimension of the local descriptor, we have a high final image feature dimension of K (p + 1). Note that matrix A is a binary matrix that encodes which descriptors are selected with respect to each anchor point, i.e., not all local descriptors are used to construct our image feature Refined Selective Pooling Vector The aforementioned feature embedding scheme uses binary hard assignment or selection for the encoding matrix A; it does not take into account the fact that local descriptors are typically distributed in a non-uniform way in the space. Soft assignment with Gaussian Mixture Model has shown to be superior to hard assignment with K-means in previous bag-of-features based recognition work [16]. Accordingly, we refine our feature representation by incorporating the properties of GMM based on the above theory. From the training images, we first sample a subset of the local descriptors to train a Gaussian Mixture Model with the standard EM algorithm. Here we denote the learned GMM as K v kn (µ k, Σ k ). Rather than using binary assignment for selective pooling, we define the encoding matrix A by the posterior probabilities of the local descriptors belonging to each Gaussian mixture A(k, i) = v k N (z i; µ k, Σ k ) K. (14) j=1 vjn (zi; µj, Σj) Each row of matrix A indicates which descriptors are softly selected for the corresponding mixture or anchor point, while each column represents the soft vector quantization encoding coefficients of a local descriptor with respect to

4 all Gaussian mixtures. With the newly defined encoding matrix A, we can define different procedures of selective pooling 1 : Radius pooling: Set the elements of A to be zero if the Mahalanobis distance between descriptors and GMM centers exceed a certain threshold τ: { A(k, j), (zi µ B(k, j) = k ) T Σ 1 k (zi µ k) < τ (15) Posterior thresholding: Instead of inspecting the Mahalanobis distances directly, a simple approximation would be to set the elements of A to be zero if they are smaller than some threshold σ: { A(k, j), A(k, j) > σ, B(k, j) = (16) k-nearest neighbor pooling: The problem of radius pooling with a fixed threshold is that it does not adapt to the local density of the feature space very well, and thus is typically inferior to k-nearest neighbor method. Therefore, as an approximation, we use k- nearest neighbor pooling by retaining the largest k values of each row of A and set the rest to be zero. Max pooling: In the extreme case, we can do 1-nearest neighbor pooling by keeping only the largest value in each row of A and setting all others to be zero, which we call max pooling. { A(k, j), A(k, j) > A(k, i) i j, B(k, j) = (17) As we will see in the experiment section, max pooling works very well in general for our SPV, echoing the recent success using max pooling for image recognition [24, 21]. Based on Eqn. (13), we then encode each local descriptor z i using the new encoding matrix B φ(z i) = [B(k, i), B(k, i)(z i µ k ) T ] T k {1,...,K}. (18) Inspired by previous work on super-vector and fisher-vector image representation, we normalize the feature representation properly in order to make the linear classifier learning easier. Specifically, we modify the local descriptor embedding step by incorporating Gaussian covariance normalization and feature cardinality normalization as bellow: φ(z i) = [ B(k, i), B(k, i)[σ 1/2 k (z i µ k )] T ] T k {1,...,K}, (19) where B(k, i) = B(k, i)/ B(k, :) 1 with B(k, :) 1 is the sum of the k-th row of B. Note that the covariance normalization corresponds to feature whitening within each Gaussian mixture to evenly spread the feature energy, which has been shown to be effective for training linear classifiers. 1 Note that the pooling here is a little different from its traditional use. We not only pool the encoding coefficients, but also their corresponding local descriptors Relationship to Previous Work Our new feature representation shares a similar form to several previous work, such as Super vector coding [29], Fisher vector [19] and VLAD [11]. Specifically, in the extreme case, if we set k to be the number of all local descriptors from the input image in k-nearest neighbor pooling, we have φ(z i) = [Ã(k, i), Ã(k, i)[σ 1/2 k (z i µ k )] T ] T k {1,...,K}, (20) where Ã(k, i) = A(k, i)/ k,i A(k, i), our SPV is equivalent to the Super vector image representation, which is similar to Fisher vector and VLAD. In contrast to our SPV feature, these previous work utilize all available local descriptors from the input image to construct their image features. Using all local descriptors for weighted averaging can suppress the intra-class variance of the local descriptors, which is desired for coarse-grained image classification. However, for fine-grained image classification, which is more sensitive to quantization errors of the local descriptors, keeping the intra-class variance is important to distinguish different subcategories. Averaging pooling in Super vector and Fisher vector tends to smear the local object parts that are important for the recognition. Although GMM itself is doing a certain degree of selective pooling by assigning lower weights to descriptors far away from mixture centers, the fact that GMM is a generative model for the entire space makes the exponential weight decay not fast enough for selective pooling. Therefore, some amount of averaging effect still exist in Super vector or Fisher vector. Fig. 2 visualizes the feature differences between our Selective pooling vector and Super vector using the gradient map feature as an approximation of the SIFT descriptor (as SIFT is hard to visualize). The area in red circle in (b) gives the most confident local descriptor for a particular Gaussian component. (c) shows the local descriptor pooled by our SPV while (d) shows the descriptor pooled by Super vector. As we can see, Super vector coding will blur the fine local details that could be important for fine-grained recognition, even though its feature construction is based on weighted average. It is also worth to note that sparsification is a common practice used in Fisher vector to speed up computation. It is typically done by setting A(k, i) to zero for very small values. However, the motivation of their sparsification is mainly for speed concern, which is very different from our selective pooling. In particular, our selective pooling is much more aggressive to ensure accurate function learning for fine-grained recognition tasks, and in the extreme case, we can select only a single local descriptor for each Gaussian mixture. The extreme case of Selective pooling vector using max pooling (with no feature averaging) is interesting, which makes our image feature representation only remotely sim-

5 ing typically works better than radius pooling or posterior thresholding, where the latter are more sensitive to parameter tuning. Therefore, in the following experiments, we will only report results on SPV with k-nearest neighbor pooling Face recognition Figure 2. Visualization of the feature space for Selective pooling vector and Super vector. (a) Input image. (b) Gradient feature map with circled area as the pooled local descriptor for a Gaussian mixture. (c) The gradient feature pooled by our SPV. (d) The gradient feature pooled by Super vector. As we can see, Super vector coding will blur the fine local details that could be important for fine-grained recognition, even though its feature construction is based on weighted average. Since we cannot visualize SIFT descriptors easily, we use the gradient map as an approximation of SIFT for illustration purpose. ilar to Super vector or Fisher vector. As we will show in the experiment section, SPV with max pooling will usually give the best performance. Besides, our algorithm may shed light on the understanding about max pooling from a function learning perspective, which is beyond the traditional intuitive explanation for achieving local translation invariance Encoding Spatial Information To incorporate the discriminative spatial information for image recognition, we can apply an idea similar to spatial pyramid matching [15], where each image is partitioned into different size of blocks (e.g., 1 1, 4 1) at different spatial scales. Alternatively, we could follow the rough part alignment framework [9] to segment the object and divide it into different subregions. Then we extract SPV from each of the spatial blocks or subregions. The final image feature representation is obtained by concatenating all Selective pooling vectors. 3. Experimental Results In this section, we apply the proposed Selective pooling vector (SPV) to fine-grained recognition tasks including face recognition and fine-grained object recognition. Extensive experiments have been carried on several standard benchmark datasets. We show that our algorithm outperforms both super vector and Fisher vector representations on these fine-grained problems, and favorable comparisons with state-of-the-art fine-grained recognition methods demonstrate the effectiveness of our new image feature. In our experiments, we find that k-nearest neighbor pool- The standard CMU Multi-PIE face dataset [10] is used as the benchmark to compare the proposed algorithm with the state of the arts. The database contains 337 subjects with a spectrum of variations caused by different poses, expressions, and illumination conditions. The dataset is challenging due to the large number of subjects, and the big heterogeneous appearance variations. We evaluate the algorithms with the standard experimental settings [25, 23]. Among the 337 subjects, 249 subjects in Session 1 are used for training. Session 2, 3 and 4 are used for testing. For each subject in the training set, 7 frontal face images with neutral expression taken under extremal illumination conditions are included. For the testing set, all images taken under 20 illumination conditions are used. We report the recognition accuracy for each Session respectively. For all of the experiments on CMU-PIE dataset, we first resize the image to 80. We then densely extract SIFT descriptors [17] and LBP descriptors [2] over a grid of 3 pixels at different scales(8 8, 12 12, 16 16, 24 24, 32 32). We reduce the feature dimension to 80 through PCA. A GMM with 512 components is learned and we build a three-level spatial pyramid(1 1, 2 2, 3 1) to incorporate the spatial information. Finally we learn a linear SVM classifier for classification. We first evaluate the effect of k in k-nearest neighbor selective pooling. One extreme case is to keep only the largest value for each row of the encoding matrix A, which basically to max pooling. The max pooling approach can be interpreted as finding the most confident local descriptor for each GMM component for the final classification. The other extreme case is to keep all the values, and we then compute a weighted local descriptor for each GMM component. In this case, the proposed pooling feature degenerates to Super Vector [29], which bears large similarity to the Fisher vector [19]. We vary the value of k and study the corresponding performance changes, as shown in Tab. 1. We find that keeping a small number of local descriptors for each component gives superior results: For k = 1, the recognition accuracies are already quite high for all three sessions:96.3%, 96.2%, 96.7%. For k = 2 and k = 3, the performance is similar. However, the performance tends to drop as k gets larger. If we keep all the local descriptors (k = 1578), which degenerates our feature to the super vector, the performance drops significantly to 92.0%, 92.4%, 92.7% on three sessions respectively. This performance change can be well explained with our selective pooling analysis discussed in Section 2: Local descriptors with low posterior

6 Table 1. The recognition accuracy of our Selective pooling vector on CMU Multi-PIE. k-nn pooling session 2 session 3 session 4 k = % 96.2% 96.7% k = % 96.3% 96.6% k = % 96.1% 96.4% k = % 94.9% 94.7% k = % 93.6% 93.8% k = % 92.5% 92.7% k = % 92.4% 92.7% probabilities have large quantization errors that are destructive to learning the classification function. Although tuning the number of neighbors k for pooling might increase the performance (e.g., performance gain on Session 3), we will use max pooling from now on for its simplicity, efficiency, as well as effectiveness. Comparison with the state of the arts. We compare the proposed local feature embedding algorithm with several state-of-the-art face recognition algorithms, including face recognition algorithm using sparse representation [23], supervised parse coding [25], and the recent structure sparse coding [12]. The face recognition comparisons are shown in Table 2. The proposed Selective pooling vector outperforms the latest work [12] by 1% 3% for three sessions. We achieve 96.3%, 96.2%, 96.7% highest recognition rates on all three sessions. Table 2. Comparisons with state of the arts on CMU Mlulti-PIE for face recognition. Algorithms session 2 session 3 session 4 SRC [23] 91.4% 90.3% 90.2% USC [25] 94.6% 91.0% 92.5% SSC [25] 95.2% 93.4% 95.1% Struct. Sparsity [12] 95.7% 94.9% 93.7% SPV (SPM) 96.3% 96.2% 96.7% 3.2. Fine-grained recognition Recently, there has been a growing interest in finegrained recognition problems. Many powerful algorithms have been proposed in the last several years, including the high throughput template matching [27], unsupervised template learning [26], segmentation based alignment [9], part localization [6] and different flavors of feature encoding and learning algorithms (e.g., fisher vector [19, 9], LLC [21, 28], POOF [4]). We evaluate the effectiveness of the proposed selective pooling vector by comparing its performance with the the aforementioned state-of-the art algorithms on two challenging benchmark fine-grained datasets: Caltech- UCSD Birds 2010 [22] and Stanford Dogs dataset [14]. The Caltech-UCSD Birds 2010 dataset contains 6, 044 images from 200 bird species; some of the species have very subtle inter-class differences. We adopt the standard training/testing split [22] on the Bird dataset, i.e., around 15 training and 15 test images per category. The Stanford Dogs dataset [14] is another popular benchmark dataset containing 20, 580 images of 120 breeds of dogs. It is a carefully selected subset of ImageNet [1]. For the experiments on these two datasets, we follow the standard evaluation protocol [27, 26, 6]: We augment the training dataset by mirroring the training images so that the training set is doubled. We use the labeled bounding boxes to normalize the images. The performance is evaluated according to the category normalized mean accuracy. For experiments on these two datasets, we densely extract SIFT descriptors [17] from the opponent color space [20] and LBP descriptors [2] over a grid of 3 pixels at five scales(16 16, 24 24, 32 32, 40 40, 48 48). The dimension of the local descriptors is then reduced by PCA and the GMM component number K is set to be Finally the Selective pooling vector representation is fed to a linear SVM classifier. Gravves et al. [9] have shown that a rough part level alignment with spatial information encoding could improve the recognition accuracy significantly. Accordingly, we report fine-grained object recognition results with two different spatial information encoding methods. The first one is the traditional spatial pyramid matching algorithm with three layers(1 1, 2 2, 4 1). The second one is the spatial encoding algorithm introduced by Gravves et al. [9]. First, we use GrabCut [18] on the labeled bounding box to compute an accurate foreground segmentation. Second, we compute the mean and covariance of the pixels on the segmentation mask, and accordingly fit an ellipse to these pixels. Third, we divide the principle axis of the ellipse equally into four segments, and define regions that falling into each segment as an object part. Finally for each segment region, we perform extract our Selective pooling vector, and concatenate all the vectors as the final object representation Caltech-UCSD Bird 2010 Dataset For the fine-grained recognition experiment on Bird dataset [22], we first compare our Selective pooling vector with the state-of-the-art feature coding and learning algorithms, i.e., LLC [21], Multi-kernel learning [28] and Fisher vector [19] under the same settings, i.e., same local descriptors and same number of Gaussian mixtures. To encode the spatial information, we first use the traditional 3-layer spatial pyramid for all algorithms. The comparison results are shown in Table 3. We observe a much higher accuracy than LLC [28] on Bird dataset: a significant performance leap from 18% to 46.7%. Comparing with state-of-the-art object recognition Fisher vector algorithm [19], our algorithm still works much better, outperforming it by 5%. Since LLC only uses the pooling coefficients for classification, these pooling coefficients are too coarse to distinguish the subtle inter-class differences in fine-grained recognition tasks.

7 Table 3. Comparison with popular feature learning algorithms on Caltech-UCSD Bird Dataset. Algorithms Accuracy LLC [28] 18.0% Multiple Kernel Learning [28] 19.0% Fisher Vector [19] 41.1% SPV (SPM) 46.7% Table 4. Comparison with state of the arts on Caltech-UCSD Bird Dataset. Algorithms Accuracy Co-Segmentation [5] 23.3% Discriminative color descriptors [13] 26.7% Unsupervised template learning [26] 28.2% Detection+Segmentation [3] 30.2% DPM+Segmentation+Fisher vector [6] 47.3% SPV (Alignment) 48.9% The Fisher vector algorithm and our algorithm both preserve the local descriptor information, which helps to differentiate the subtle differences between fine-grained object categories. However, Fisher vector uses all local descriptors to construct the feature representation (i.e., average pooling), while our feature discards local descriptors that are far away from the Gaussian mixture centers and make use of only the most confident local descriptors for classification. Therefore, the function learning in our new feature could be more accurate and as a result we can achieve better performance. Comparisons between our algorithm and many state-ofthe-art algorithms reported on this bird dataset [22] is shown in Tab. 4. In this case, we use the segmentation alignment algorithm [9] to encode the spatial information, which increases our performance by 2.2% compared with that with SPM in Tab. 3. As we can see from Tab. 4, the proposed Selective pooling vector clearly outperforms all state of the arts. Comparing with prior art [6], which is based on an elegant joint framework of deformable parts model [8] and segmentation algorithm [18] built on Fisher vector, our algorithm improves the accuracy from 47.3% to 48.9%, but with a much simpler learning and testing scheme Stanford Dog Dataset Comparing with the Bird dataset [22], the Stanford Dog dataset [14] contains more images and has even larger shape and pose variations. We again first report result comparisons with LLC coding [13] and Fisher vector coding [19] under the same experimental setup with spatial pyramid. From Tab. 5, we again observe the big performance improvement over LLC from 14.5% [13] to 47.2%. Comparing with Fisher vector under the same experiment settings, our algorithm again performs much better, around 6% higher. The results are consistent with our observations on Table 5. Comparison with popular feature learning algorithms on Stanford Dogs Dataset. Algorithms Accuracy LLC [13] 14.5% Fisher Vector [19] 41.0% SPV (SPM) 47.2% Table 6. Comparison with state of the arts on Stanford Dogs Dataset. Algorithms Accuracy Tricos [6] 26.9% Discriminative color descriptors [13] 28.1% Unsupervised template learning [26] 38.0% DPM+segmentation+fisher vector [6] 45.6% alignment+fisher vector [9] 50.1% SPV (Alignment) 52.0% the Bird dataset. We then report comparisons between our algorithm and state-of-the-art algorithms on this dog dataset in Tab. 6. Again, we use the spatial alignment algorithm in [9] to encode the spatial information. This time, it increases our performance from 47.4% with SPM to 52.0%, a leap larger than what we observe on the bird dataset. Due to the larger shape and pose variations in the Stanford Dog dataset, spatial alignment helps more. On this dataset, the unsupervised template learning algorithm [26] achieved a recognition accuracy of 38.0%. The segmentation based frameworks showed great success [6, 9] and achieved 45.6% and 50.1%, respectively. With the spatial alignment algorithm introduced by [9], we achieved an accuracy of 52%, outperforming the DPM and segmentation algorithm [6] by 6.4%, and the prior best result [9] by 1.9%. Note that the difference between our algorithm and that of [9] is the use of Selective pooling vector rather than Fisher vector Discussion We have shown the superior performance of our SPV over state-of-the-art algorithms on several fine-grained recognition tasks. In particular, we compare with similar feature representations from Super vector and Fisher vector in the framework of Spatial pyramid and spatial alignment [9]. In both cases, our SPV outperforms them significantly. One interesting observation is that our SPV can bring more improvements over Super vector when objects are not very well aligned (e.g., in the case of using spatial pyramid in Tab. 3 and 5 ), indicating that our selective pooling is more robust than the average pooling used in Super vector and Fisher vector on fine-grained recognition tasks. 4. Conclusion In this paper, we propose a novel image feature representation called Selective Pooling Vector. The new image

8 feature is derived from nonlinear function learning by linear approximation in an embedded high dimensional space. Different from previous work, we ensure the function learning accuracy by selecting only local descriptors that are confident. We apply our algorithm to CMU Multi-PIE for face recognition and fine-grained recognition tasks on Caltech- UCSD Bird 2010 dataset and Stanford Dogs dataset, all outperforming the state-of-the-art handcrafted features. References [1] The ImageNet dataset. [2] T. Ahonen, A. Hadid, and M. Pietikinen. Face description with local binary patterns: Application to face recognition [3] A. Angelova and S. Zhu. Efficient object detection and segmentation for fine-grained recognition. In CVPR, [4] T. Berg and P. N. Belhumeur. Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation IEEE Conference on Computer Vision and Pattern Recognition, [5] Y. Chai, V. Lempitsky, and A. Zisserman. Bicos: A bi-level co-segmentation method for image classification. In IEEE International Conference on Computer Vision, [6] Y. Chai, V. Lempitsky, and A. Zisserman. Symbiotic segmentation and part localization for fine-grained categorization. In IEEE International Conference on Computer Vision, [7] Y. Chai, E. Rahtu, V. Lempitsky, L. Van Gool, and A. Zisserman. Tricos: A tri-level class-discriminative cosegmentation method for image classification. In European Conference on Computer Vision, [8] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., [9] E. Gavves, B. Fernando, C. Snoek, A. Smeulders, and T. Tuytelaars. Fine-grained categorization by alignments. In The IEEE International Conference on Computer Vision (ICCV), December [10] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image Vision Comput., [11] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In IEEE Conference on Computer Vision & Pattern Recognition, [12] K. Jia, T.-H. Chan, and Y. Ma. Robust and practical face recognition via structured sparsity. In ECCV, [13] R. Khan, J. van de Weijer, F. S. Khan, D. Muselet, C. Ducottet, and C. Barat. Discriminative color descriptors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [14] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, [15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, [16] L. Liu, L. Wang, and X. Liu. In defense of soft-assignment coding. In CVPR, [17] D. G. Lowe. Distinctive image features from scale-invariant keypoints [18] C. Rother, V. Kolmogorov, and A. Blake. grabcut : interactive foreground extraction using iterated graph cuts. ACM Trans. Graph., [19] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image Classification with the Fisher Vector: Theory and Practice. International Journal of Computer Vision, [20] K. van de Sande, T. Gevers, and C. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., [21] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In CVPR, [22] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical report, California Institute of Technology, [23] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell., [24] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, [25] J. Yang, K. Yu, and T. Huang. Supervised translationinvariant sparse coding. In CVPR, [26] S. Yang, L. Bo, J. Wang, and L. G. Shapiro. Unsupervised template learning for fine-grained object recognition. In NIPS, [27] B. Yao, G. Bradski, and L. Fei-Fei. A codebook-free and annotation-free approach for fine-grained image categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [28] B. Yao, A. Khosla, and L. Fei-Fei. Combining randomization and discrimination for fine-grained image categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [29] X. Zhou, K. Yu, T. Zhang, and T. Huang. Image classification using super-vector coding of local image descriptors. In ECCV, 2010.

Aggregating Descriptors with Local Gaussian Metrics

Aggregating Descriptors with Local Gaussian Metrics Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,

More information

Fine-Grained Categorization by Alignments

Fine-Grained Categorization by Alignments Fine-Grained Categorization by Alignments E. Gavves1, B. Fernando2, C.G.M. Snoek1, A.W.M. Smeulders1, 3, and T. Tuytelaars2 1 2 University of Amsterdam, ISIS KU Leuven, ESAT-PSI, iminds 3 CWI Amsterdam

More information

ImageCLEF 2011

ImageCLEF 2011 SZTAKI @ ImageCLEF 2011 Bálint Daróczy joint work with András Benczúr, Róbert Pethes Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences Training/test

More information

IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES

IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES Pin-Syuan Huang, Jing-Yi Tsai, Yu-Fang Wang, and Chun-Yi Tsai Department of Computer Science and Information Engineering, National Taitung University,

More information

Mixtures of Gaussians and Advanced Feature Encoding

Mixtures of Gaussians and Advanced Feature Encoding Mixtures of Gaussians and Advanced Feature Encoding Computer Vision Ali Borji UWM Many slides from James Hayes, Derek Hoiem, Florent Perronnin, and Hervé Why do good recognition systems go bad? E.g. Why

More information

String distance for automatic image classification

String distance for automatic image classification String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,

More information

Combining Selective Search Segmentation and Random Forest for Image Classification

Combining Selective Search Segmentation and Random Forest for Image Classification Combining Selective Search Segmentation and Random Forest for Image Classification Gediminas Bertasius November 24, 2013 1 Problem Statement Random Forest algorithm have been successfully used in many

More information

arxiv: v3 [cs.cv] 3 Oct 2012

arxiv: v3 [cs.cv] 3 Oct 2012 Combined Descriptors in Spatial Pyramid Domain for Image Classification Junlin Hu and Ping Guo arxiv:1210.0386v3 [cs.cv] 3 Oct 2012 Image Processing and Pattern Recognition Laboratory Beijing Normal University,

More information

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011 Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition

More information

Part-based models. Lecture 10

Part-based models. Lecture 10 Part-based models Lecture 10 Overview Representation Location Appearance Generative interpretation Learning Distance transforms Other approaches using parts Felzenszwalb, Girshick, McAllester, Ramanan

More information

Part-based and local feature models for generic object recognition

Part-based and local feature models for generic object recognition Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza

More information

Part Localization by Exploiting Deep Convolutional Networks

Part Localization by Exploiting Deep Convolutional Networks Part Localization by Exploiting Deep Convolutional Networks Marcel Simon, Erik Rodner, and Joachim Denzler Computer Vision Group, Friedrich Schiller University of Jena, Germany www.inf-cv.uni-jena.de Abstract.

More information

IMAGE CLASSIFICATION WITH MAX-SIFT DESCRIPTORS

IMAGE CLASSIFICATION WITH MAX-SIFT DESCRIPTORS IMAGE CLASSIFICATION WITH MAX-SIFT DESCRIPTORS Lingxi Xie 1, Qi Tian 2, Jingdong Wang 3, and Bo Zhang 4 1,4 LITS, TNLIST, Dept. of Computer Sci&Tech, Tsinghua University, Beijing 100084, China 2 Department

More information

An Exploration of Computer Vision Techniques for Bird Species Classification

An Exploration of Computer Vision Techniques for Bird Species Classification An Exploration of Computer Vision Techniques for Bird Species Classification Anne L. Alter, Karen M. Wang December 15, 2017 Abstract Bird classification, a fine-grained categorization task, is a complex

More information

Learning Compact Visual Attributes for Large-scale Image Classification

Learning Compact Visual Attributes for Large-scale Image Classification Learning Compact Visual Attributes for Large-scale Image Classification Yu Su and Frédéric Jurie GREYC CNRS UMR 6072, University of Caen Basse-Normandie, Caen, France {yu.su,frederic.jurie}@unicaen.fr

More information

The Caltech-UCSD Birds Dataset

The Caltech-UCSD Birds Dataset The Caltech-UCSD Birds-200-2011 Dataset Catherine Wah 1, Steve Branson 1, Peter Welinder 2, Pietro Perona 2, Serge Belongie 1 1 University of California, San Diego 2 California Institute of Technology

More information

Multiple Kernel Learning for Emotion Recognition in the Wild

Multiple Kernel Learning for Emotion Recognition in the Wild Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,

More information

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications Learning Visual Semantics: Models, Massive Computation, and Innovative Applications Part II: Visual Features and Representations Liangliang Cao, IBM Watson Research Center Evolvement of Visual Features

More information

TriCoS: A Tri-level Class-Discriminative Co-segmentation Method for Image Classification

TriCoS: A Tri-level Class-Discriminative Co-segmentation Method for Image Classification TriCoS: A Tri-level Class-Discriminative Co-segmentation Method for Image Classification Yuning Chai 1,EsaRahtu 2, Victor Lempitsky 3, Luc Van Gool 1, and Andrew Zisserman 4 1 Computer Vision Group, ETH

More information

CS229: Action Recognition in Tennis

CS229: Action Recognition in Tennis CS229: Action Recognition in Tennis Aman Sikka Stanford University Stanford, CA 94305 Rajbir Kataria Stanford University Stanford, CA 94305 asikka@stanford.edu rkataria@stanford.edu 1. Motivation As active

More information

CS 231A Computer Vision (Fall 2011) Problem Set 4

CS 231A Computer Vision (Fall 2011) Problem Set 4 CS 231A Computer Vision (Fall 2011) Problem Set 4 Due: Nov. 30 th, 2011 (9:30am) 1 Part-based models for Object Recognition (50 points) One approach to object recognition is to use a deformable part-based

More information

Bilinear Models for Fine-Grained Visual Recognition

Bilinear Models for Fine-Grained Visual Recognition Bilinear Models for Fine-Grained Visual Recognition Subhransu Maji College of Information and Computer Sciences University of Massachusetts, Amherst Fine-grained visual recognition Example: distinguish

More information

Learning to Recognize Faces in Realistic Conditions

Learning to Recognize Faces in Realistic Conditions 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

TEXTURE CLASSIFICATION METHODS: A REVIEW

TEXTURE CLASSIFICATION METHODS: A REVIEW TEXTURE CLASSIFICATION METHODS: A REVIEW Ms. Sonal B. Bhandare Prof. Dr. S. M. Kamalapur M.E. Student Associate Professor Deparment of Computer Engineering, Deparment of Computer Engineering, K. K. Wagh

More information

Part based models for recognition. Kristen Grauman

Part based models for recognition. Kristen Grauman Part based models for recognition Kristen Grauman UT Austin Limitations of window-based models Not all objects are box-shaped Assuming specific 2d view of object Local components themselves do not necessarily

More information

Sparse coding for image classification

Sparse coding for image classification Sparse coding for image classification Columbia University Electrical Engineering: Kun Rong(kr2496@columbia.edu) Yongzhou Xiang(yx2211@columbia.edu) Yin Cui(yc2776@columbia.edu) Outline Background Introduction

More information

Symbiotic Segmentation and Part Localization for Fine-Grained Categorization

Symbiotic Segmentation and Part Localization for Fine-Grained Categorization 2013 IEEE International Conference on Computer Vision Symbiotic Segmentation and Part Localization for Fine-Grained Categorization Yuning Chai Dept. of Engineering Science University of Oxford chaiy@robots.ox.ac.uk

More information

Visual words. Map high-dimensional descriptors to tokens/words by quantizing the feature space.

Visual words. Map high-dimensional descriptors to tokens/words by quantizing the feature space. Visual words Map high-dimensional descriptors to tokens/words by quantizing the feature space. Quantize via clustering; cluster centers are the visual words Word #2 Descriptor feature space Assign word

More information

Object Classification Problem

Object Classification Problem HIERARCHICAL OBJECT CATEGORIZATION" Gregory Griffin and Pietro Perona. Learning and Using Taxonomies For Fast Visual Categorization. CVPR 2008 Marcin Marszalek and Cordelia Schmid. Constructing Category

More information

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213) Recognition of Animal Skin Texture Attributes in the Wild Amey Dharwadker (aap2174) Kai Zhang (kz2213) Motivation Patterns and textures are have an important role in object description and understanding

More information

arxiv: v1 [cs.lg] 20 Dec 2013

arxiv: v1 [cs.lg] 20 Dec 2013 Unsupervised Feature Learning by Deep Sparse Coding Yunlong He Koray Kavukcuoglu Yun Wang Arthur Szlam Yanjun Qi arxiv:1312.5783v1 [cs.lg] 20 Dec 2013 Abstract In this paper, we propose a new unsupervised

More information

Bag-of-features. Cordelia Schmid

Bag-of-features. Cordelia Schmid Bag-of-features for category classification Cordelia Schmid Visual search Particular objects and scenes, large databases Category recognition Image classification: assigning a class label to the image

More information

Metric Learning for Large Scale Image Classification:

Metric Learning for Large Scale Image Classification: Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Thomas Mensink 1,2 Jakob Verbeek 2 Florent Perronnin 1 Gabriela Csurka 1 1 TVPA - Xerox Research Centre

More information

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Adding spatial information Forming vocabularies from pairs of nearby features doublets

More information

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015 on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015 Vector visual representation Fixed-size image representation High-dim (100 100,000) Generic, unsupervised: BoW,

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

BossaNova at ImageCLEF 2012 Flickr Photo Annotation Task

BossaNova at ImageCLEF 2012 Flickr Photo Annotation Task BossaNova at ImageCLEF 2012 Flickr Photo Annotation Task S. Avila 1,2, N. Thome 1, M. Cord 1, E. Valle 3, and A. de A. Araújo 2 1 Pierre and Marie Curie University, UPMC-Sorbonne Universities, LIP6, France

More information

Selection of Scale-Invariant Parts for Object Class Recognition

Selection of Scale-Invariant Parts for Object Class Recognition Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract

More information

MIL at ImageCLEF 2013: Scalable System for Image Annotation

MIL at ImageCLEF 2013: Scalable System for Image Annotation MIL at ImageCLEF 2013: Scalable System for Image Annotation Masatoshi Hidaka, Naoyuki Gunji, and Tatsuya Harada Machine Intelligence Lab., The University of Tokyo {hidaka,gunji,harada}@mi.t.u-tokyo.ac.jp

More information

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped

More information

Image Classification based on Saliency Driven Nonlinear Diffusion and Multi-scale Information Fusion Ms. Swapna R. Kharche 1, Prof.B.K.

Image Classification based on Saliency Driven Nonlinear Diffusion and Multi-scale Information Fusion Ms. Swapna R. Kharche 1, Prof.B.K. Image Classification based on Saliency Driven Nonlinear Diffusion and Multi-scale Information Fusion Ms. Swapna R. Kharche 1, Prof.B.K.Chaudhari 2 1M.E. student, Department of Computer Engg, VBKCOE, Malkapur

More information

Artistic ideation based on computer vision methods

Artistic ideation based on computer vision methods Journal of Theoretical and Applied Computer Science Vol. 6, No. 2, 2012, pp. 72 78 ISSN 2299-2634 http://www.jtacs.org Artistic ideation based on computer vision methods Ferran Reverter, Pilar Rosado,

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

Exploring Bag of Words Architectures in the Facial Expression Domain

Exploring Bag of Words Architectures in the Facial Expression Domain Exploring Bag of Words Architectures in the Facial Expression Domain Karan Sikka, Tingfan Wu, Josh Susskind, and Marian Bartlett Machine Perception Laboratory, University of California San Diego {ksikka,ting,josh,marni}@mplab.ucsd.edu

More information

Nonparametric Part Transfer for Fine-grained Recognition

Nonparametric Part Transfer for Fine-grained Recognition Nonparametric Part Transfer for Fine-grained Recognition Christoph Göring, Erik Rodner, Alexander Freytag, and Joachim Denzler Computer Vision Group, Friedrich Schiller University Jena www.inf-cv.uni-jena.de

More information

Codebook Graph Coding of Descriptors

Codebook Graph Coding of Descriptors Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'5 3 Codebook Graph Coding of Descriptors Tetsuya Yoshida and Yuu Yamada Graduate School of Humanities and Science, Nara Women s University, Nara,

More information

Beyond Bags of Features

Beyond Bags of Features : for Recognizing Natural Scene Categories Matching and Modeling Seminar Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015

More information

Using Geometric Blur for Point Correspondence

Using Geometric Blur for Point Correspondence 1 Using Geometric Blur for Point Correspondence Nisarg Vyas Electrical and Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA Abstract In computer vision applications, point correspondence

More information

Improved Spatial Pyramid Matching for Image Classification

Improved Spatial Pyramid Matching for Image Classification Improved Spatial Pyramid Matching for Image Classification Mohammad Shahiduzzaman, Dengsheng Zhang, and Guojun Lu Gippsland School of IT, Monash University, Australia {Shahid.Zaman,Dengsheng.Zhang,Guojun.Lu}@monash.edu

More information

A Survey on Image Classification using Data Mining Techniques Vyoma Patel 1 G. J. Sahani 2

A Survey on Image Classification using Data Mining Techniques Vyoma Patel 1 G. J. Sahani 2 IJSRD - International Journal for Scientific Research & Development Vol. 2, Issue 10, 2014 ISSN (online): 2321-0613 A Survey on Image Classification using Data Mining Techniques Vyoma Patel 1 G. J. Sahani

More information

Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma

Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma Presented by Hu Han Jan. 30 2014 For CSE 902 by Prof. Anil K. Jain: Selected

More information

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation M. Blauth, E. Kraft, F. Hirschenberger, M. Böhm Fraunhofer Institute for Industrial Mathematics, Fraunhofer-Platz 1,

More information

Tri-modal Human Body Segmentation

Tri-modal Human Body Segmentation Tri-modal Human Body Segmentation Master of Science Thesis Cristina Palmero Cantariño Advisor: Sergio Escalera Guerrero February 6, 2014 Outline 1 Introduction 2 Tri-modal dataset 3 Proposed baseline 4

More information

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm Group 1: Mina A. Makar Stanford University mamakar@stanford.edu Abstract In this report, we investigate the application of the Scale-Invariant

More information

IMPROVED FACE RECOGNITION USING ICP TECHNIQUES INCAMERA SURVEILLANCE SYSTEMS. Kirthiga, M.E-Communication system, PREC, Thanjavur

IMPROVED FACE RECOGNITION USING ICP TECHNIQUES INCAMERA SURVEILLANCE SYSTEMS. Kirthiga, M.E-Communication system, PREC, Thanjavur IMPROVED FACE RECOGNITION USING ICP TECHNIQUES INCAMERA SURVEILLANCE SYSTEMS Kirthiga, M.E-Communication system, PREC, Thanjavur R.Kannan,Assistant professor,prec Abstract: Face Recognition is important

More information

A 2-D Histogram Representation of Images for Pooling

A 2-D Histogram Representation of Images for Pooling A 2-D Histogram Representation of Images for Pooling Xinnan YU and Yu-Jin ZHANG Department of Electronic Engineering, Tsinghua University, Beijing, 100084, China ABSTRACT Designing a suitable image representation

More information

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Tetsu Matsukawa Koji Suzuki Takio Kurita :University of Tsukuba :National Institute of Advanced Industrial Science and

More information

Category-level localization

Category-level localization Category-level localization Cordelia Schmid Recognition Classification Object present/absent in an image Often presence of a significant amount of background clutter Localization / Detection Localize object

More information

Shape Recognition by Combining Contour and Skeleton into a Mid-Level Representation

Shape Recognition by Combining Contour and Skeleton into a Mid-Level Representation Shape Recognition by Combining Contour and Skeleton into a Mid-Level Representation Wei Shen 1, Xinggang Wang 2, Cong Yao 2, and Xiang Bai 2 1 School of Communication and Information Engineering, Shanghai

More information

Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification?

Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification? Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification? Roberto Rigamonti, Matthew A. Brown, Vincent Lepetit CVLab, EPFL Lausanne, Switzerland firstname.lastname@epfl.ch

More information

Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing

Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing Tomoyuki Nagahashi 1, Hironobu Fujiyoshi 1, and Takeo Kanade 2 1 Dept. of Computer Science, Chubu University. Matsumoto 1200,

More information

ILSVRC on a Smartphone

ILSVRC on a Smartphone [DOI: 10.2197/ipsjtcva.6.83] Express Paper ILSVRC on a Smartphone Yoshiyuki Kawano 1,a) Keiji Yanai 1,b) Received: March 14, 2014, Accepted: April 24, 2014, Released: July 25, 2014 Abstract: In this work,

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

Beyond Bags of features Spatial information & Shape models

Beyond Bags of features Spatial information & Shape models Beyond Bags of features Spatial information & Shape models Jana Kosecka Many slides adapted from S. Lazebnik, FeiFei Li, Rob Fergus, and Antonio Torralba Detection, recognition (so far )! Bags of features

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

Latest development in image feature representation and extraction

Latest development in image feature representation and extraction International Journal of Advanced Research and Development ISSN: 2455-4030, Impact Factor: RJIF 5.24 www.advancedjournal.com Volume 2; Issue 1; January 2017; Page No. 05-09 Latest development in image

More information

arxiv: v1 [cs.cv] 15 Mar 2014

arxiv: v1 [cs.cv] 15 Mar 2014 Geometric VLAD for Large Scale Image Search Zixuan Wang, Wei Di, Anurag Bhardwaj, Vignesh Jagadeesh, Robinson Piramuthu Dept.of Electrical Engineering, Stanford University, CA 94305 ebay Research Labs,

More information

A Codebook-Free and Annotation-Free Approach for Fine-Grained Image Categorization

A Codebook-Free and Annotation-Free Approach for Fine-Grained Image Categorization A Codebook-Free and Annotation-Free Approach for Fine-Grained Image Categorization Bangpeng Yao 1 Gary Bradski 2 Li Fei-Fei 1 1 Computer Science Department, Stanford University, Stanford, CA 2 Industrial

More information

Class 5: Attributes and Semantic Features

Class 5: Attributes and Semantic Features Class 5: Attributes and Semantic Features Rogerio Feris, Feb 21, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Project

More information

ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING. Wei Liu, Serkan Kiranyaz and Moncef Gabbouj

ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING. Wei Liu, Serkan Kiranyaz and Moncef Gabbouj Proceedings of the 5th International Symposium on Communications, Control and Signal Processing, ISCCSP 2012, Rome, Italy, 2-4 May 2012 ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING

More information

Patch Descriptors. EE/CSE 576 Linda Shapiro

Patch Descriptors. EE/CSE 576 Linda Shapiro Patch Descriptors EE/CSE 576 Linda Shapiro 1 How can we find corresponding points? How can we find correspondences? How do we describe an image patch? How do we describe an image patch? Patches with similar

More information

A Keypoint Descriptor Inspired by Retinal Computation

A Keypoint Descriptor Inspired by Retinal Computation A Keypoint Descriptor Inspired by Retinal Computation Bongsoo Suh, Sungjoon Choi, Han Lee Stanford University {bssuh,sungjoonchoi,hanlee}@stanford.edu Abstract. The main goal of our project is to implement

More information

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

Fuzzy based Multiple Dictionary Bag of Words for Image Classification Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2196 2206 International Conference on Modeling Optimisation and Computing Fuzzy based Multiple Dictionary Bag of Words for Image

More information

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009 Analysis: TextonBoost and Semantic Texton Forests Daniel Munoz 16-721 Februrary 9, 2009 Papers [shotton-eccv-06] J. Shotton, J. Winn, C. Rother, A. Criminisi, TextonBoost: Joint Appearance, Shape and Context

More information

Exemplar-specific Patch Features for Fine-grained Recognition

Exemplar-specific Patch Features for Fine-grained Recognition Exemplar-specific Patch Features for Fine-grained Recognition Alexander Freytag 1, Erik Rodner 1, Trevor Darrell 2, and Joachim Denzler 1 1 Computer Vision Group, Friedrich Schiller University Jena, Germany

More information

Fisher and VLAD with FLAIR

Fisher and VLAD with FLAIR Fisher and VLAD with FLAIR Koen E. A. van de Sande 1 Cees G. M. Snoek 1 Arnold W. M. Smeulders 12 1 ISLA, Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands 2 Centrum Wiskunde &

More information

An Associate-Predict Model for Face Recognition FIPA Seminar WS 2011/2012

An Associate-Predict Model for Face Recognition FIPA Seminar WS 2011/2012 An Associate-Predict Model for Face Recognition FIPA Seminar WS 2011/2012, 19.01.2012 INSTITUTE FOR ANTHROPOMATICS, FACIAL IMAGE PROCESSING AND ANALYSIS YIG University of the State of Baden-Wuerttemberg

More information

Experiments of Image Retrieval Using Weak Attributes

Experiments of Image Retrieval Using Weak Attributes Columbia University Computer Science Department Technical Report # CUCS 005-12 (2012) Experiments of Image Retrieval Using Weak Attributes Felix X. Yu, Rongrong Ji, Ming-Hen Tsai, Guangnan Ye, Shih-Fu

More information

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014 Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014 Problem Definition Viewpoint estimation: Given an image, predicting viewpoint for object of

More information

Multiple cosegmentation

Multiple cosegmentation Armand Joulin, Francis Bach and Jean Ponce. INRIA -Ecole Normale Supérieure April 25, 2012 Segmentation Introduction Segmentation Supervised and weakly-supervised segmentation Cosegmentation Segmentation

More information

Patch Descriptors. CSE 455 Linda Shapiro

Patch Descriptors. CSE 455 Linda Shapiro Patch Descriptors CSE 455 Linda Shapiro How can we find corresponding points? How can we find correspondences? How do we describe an image patch? How do we describe an image patch? Patches with similar

More information

Multipath Sparse Coding Using Hierarchical Matching Pursuit

Multipath Sparse Coding Using Hierarchical Matching Pursuit Multipath Sparse Coding Using Hierarchical Matching Pursuit Liefeng Bo, Xiaofeng Ren ISTC Pervasive Computing, Intel Labs Seattle WA 98195, USA {liefeng.bo,xiaofeng.ren}@intel.com Dieter Fox University

More information

Metric learning approaches! for image annotation! and face recognition!

Metric learning approaches! for image annotation! and face recognition! Metric learning approaches! for image annotation! and face recognition! Jakob Verbeek" LEAR Team, INRIA Grenoble, France! Joint work with :"!Matthieu Guillaumin"!!Thomas Mensink"!!!Cordelia Schmid! 1 2

More information

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION Maral Mesmakhosroshahi, Joohee Kim Department of Electrical and Computer Engineering Illinois Institute

More information

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Presented by Tushar Bansal Objective 1. Get bounding box for all objects

More information

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce Object Recognition Computer Vision Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce How many visual object categories are there? Biederman 1987 ANIMALS PLANTS OBJECTS

More information

Robust Scene Classification with Cross-level LLC Coding on CNN Features

Robust Scene Classification with Cross-level LLC Coding on CNN Features Robust Scene Classification with Cross-level LLC Coding on CNN Features Zequn Jie 1, Shuicheng Yan 2 1 Keio-NUS CUTE Center, National University of Singapore, Singapore 2 Department of Electrical and Computer

More information

Computer Vision. Exercise Session 10 Image Categorization

Computer Vision. Exercise Session 10 Image Categorization Computer Vision Exercise Session 10 Image Categorization Object Categorization Task Description Given a small number of training images of a category, recognize a-priori unknown instances of that category

More information

Evaluation and comparison of interest points/regions

Evaluation and comparison of interest points/regions Introduction Evaluation and comparison of interest points/regions Quantitative evaluation of interest point/region detectors points / regions at the same relative location and area Repeatability rate :

More information

Normalized Texture Motifs and Their Application to Statistical Object Modeling

Normalized Texture Motifs and Their Application to Statistical Object Modeling Normalized Texture Motifs and Their Application to Statistical Obect Modeling S. D. Newsam B. S. Manunath Center for Applied Scientific Computing Electrical and Computer Engineering Lawrence Livermore

More information

Multiple VLAD encoding of CNNs for image classification

Multiple VLAD encoding of CNNs for image classification Multiple VLAD encoding of CNNs for image classification Qing Li, Qiang Peng, Chuan Yan 1 arxiv:1707.00058v1 [cs.cv] 30 Jun 2017 Abstract Despite the effectiveness of convolutional neural networks (CNNs)

More information

Metric Learning for Large-Scale Image Classification:

Metric Learning for Large-Scale Image Classification: Metric Learning for Large-Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Florent Perronnin 1 work published at ECCV 2012 with: Thomas Mensink 1,2 Jakob Verbeek 2 Gabriela Csurka

More information

Supervised learning. y = f(x) function

Supervised learning. y = f(x) function Supervised learning y = f(x) output prediction function Image feature Training: given a training set of labeled examples {(x 1,y 1 ),, (x N,y N )}, estimate the prediction function f by minimizing the

More information

Object Detection Using Segmented Images

Object Detection Using Segmented Images Object Detection Using Segmented Images Naran Bayanbat Stanford University Palo Alto, CA naranb@stanford.edu Jason Chen Stanford University Palo Alto, CA jasonch@stanford.edu Abstract Object detection

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

Ensemble of Bayesian Filters for Loop Closure Detection

Ensemble of Bayesian Filters for Loop Closure Detection Ensemble of Bayesian Filters for Loop Closure Detection Mohammad Omar Salameh, Azizi Abdullah, Shahnorbanun Sahran Pattern Recognition Research Group Center for Artificial Intelligence Faculty of Information

More information

Visual Object Recognition

Visual Object Recognition Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial Visual Object Recognition Bastian Leibe Computer Vision Laboratory ETH Zurich Chicago, 14.07.2008 & Kristen Grauman Department

More information

Announcements. Recognition. Recognition. Recognition. Recognition. Homework 3 is due May 18, 11:59 PM Reading: Computer Vision I CSE 152 Lecture 14

Announcements. Recognition. Recognition. Recognition. Recognition. Homework 3 is due May 18, 11:59 PM Reading: Computer Vision I CSE 152 Lecture 14 Announcements Computer Vision I CSE 152 Lecture 14 Homework 3 is due May 18, 11:59 PM Reading: Chapter 15: Learning to Classify Chapter 16: Classifying Images Chapter 17: Detecting Objects in Images Given

More information

Enhanced and Efficient Image Retrieval via Saliency Feature and Visual Attention

Enhanced and Efficient Image Retrieval via Saliency Feature and Visual Attention Enhanced and Efficient Image Retrieval via Saliency Feature and Visual Attention Anand K. Hase, Baisa L. Gunjal Abstract In the real world applications such as landmark search, copy protection, fake image

More information