Image classification based on improved VLAD

DOI 10.1007/s11042-015-2524-6 Image classification based on improved VLAD Xianzhong Long Hongtao Lu Yong Peng Xianzhong Wang Shaokun Feng Received: 25 August 2014 / Revised: 22 December 2014 / Accepted: 18 February 2015 Springer Science+Business Media New York 2015 Abstract Recently, a coding scheme called vector of locally aggregated descriptors (VLAD) has got tremendous successes in large scale image retrieval due to its efficiency of compact representation. VLAD employs only the nearest neighbor visual word in dictionary to aggregate each descriptor feature. It has fast retrieval speed and high retrieval accuracy under small dictionary size. In this paper, we give three improved VLAD variations for image classification: first, similar to the bag of words (BoW) model, we count the number of descriptors belonging to each cluster center and add it to VLAD; second, in order to expand the impact of residuals, squared residuals are taken into account; thirdly, in contrast with one nearest neighbor visual word, we try to look for two nearest neighbor visual words for aggregating each descriptor. Experimental results on UIUC Sports Event, Corel 10 and 15 Scenes datasets show that the proposed methods outperform some state-of-the-art coding schemes in terms of the classification accuracy and computation speed. X. Long ( ) School of Computer Science & Technology, School of Software, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China e-mail: lxz@njupt.edu.cn H. Lu Y. Peng X. Wang S. Feng Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China H. Lu e-mail: htlu@sjtu.edu.cn Y. Peng e-mail: pengyong851012@sjtu.edu.cn X. Wang e-mail: wxz2453@sjtu.edu.cn S. Feng e-mail: superkkking@sjtu.edu.cn

Keywords Image classification Scale-invariant feature transform Vector of locally aggregated descriptors K-means clustering algorithm 1 Introduction As one of the most important and challenging tasks in computer vision and pattern recognition fields, image classification has recently got many attention. There are some benchmark datasets used to evaluate the classification performance of image classification algorithms, for example, UIUC sports event [23], Corel 10 [26], 15 Scenes [21], Caltech 101 [10] and Caltech 256 [14], etc. Many image classification models have recently been proposed, such as generative models [2, 22, 33], discriminative models [9, 18, 27, 39] and hybrid generative/discriminative models [3]. Generative model classifies images from the viewpoint of probability, it only depends on the data themselves and does not require training or learning parameters. In contrast, discriminative model solves classification problem from the nonprobabilistic perspective, which needs to train or learn parameters appeared in the classifier. Here, we only consider image classification based on discriminative model. In the discriminative models, the earliest bag of words (BoW) technique [35] won the greatest popularity and had a wide range of applications in the fields of image retrieval [31], video event detection [37] and image classification [6, 13]. However, the BoW representation does not possess enough descriptive capability because it is the histogram of the number of image descriptors assigned to each visual word and it ignores the spatial information of the image. To solve this problem, Spatial Pyramid Matching (SPM) model has been put forward in [21], which takes the spatial information of image into account. In fact, SPM is an extension of BoW model and has been proved to achieve better image classification accuracy than the latter [15, 36, 38]. In the image classification based on SPM model, there are five steps, i.e., local descriptor extraction, dictionary learning, feature coding, spatial pooling and classifier selection. Specifically, the commonly used local descriptors include Scale-Invariant Feature Transform (SIFT) [25], Histogram of oriented Gradients (HoG) [7], Affine Scale-Invariant Feature Transform (ASIFT) [28], Oriented Fast and Rotated BRIEF (ORB) [34], etc. After getting all images descriptors, vector quantization [21] or sparse coding [38] is utilized to train a dictionary. In the feature coding phase, each image s descriptors matrix corresponds to a coefficient matrix generated by one different coding strategy. It is necessary to illustrate the principle of spatial pooling clearly because it dominates the whole image classification framework based on SPM model. During the spatial pooling period, an image is divided into increasingly finer subregions of L layers, with 2 l 2 l subregions at layer l, l = 0, 1,,L 1. A typical partition is three layers, i.e., L = 3. At layer 0, the image itself as a whole; at layer 1, the image is divided into four regions and at layer 2, each subregion of layer 1 is further divided into 4, resulting in 16 smaller subregions. This process generates a spatial pyramid of three layers with a total of 21 subregions. Then, spatial pyramid is combined with feature coding process and different pooling functions is exploited, i.e., sum pooling [21] and max pooling [36, 38]. Finally, the feature vectors of the 21 subregions are concatenated into a long feature vector for the whole image. The process mentioned above is the spatial pyramid representation of the image. The dimensionality of the new representation for each image is 21P (P is the dictionary size). It is noteworthy that when l = 0,

SPM reduces to the original BoW model. In the last step, classifiers such as Support Vector Machine (SVM) [5] or Adaptive Boosting (AdaBoost) [11] is applied to classify images. Over the past several years, a number of dictionary learning methods and feature coding strategies have been brought forward for image classification. In [6], as one vector quantization (VQ) technique, K-means clustering algorithm was used to generate dictionary, during the feature coding phase, each local descriptor was given a binary value that specified the cluster center which the local descriptor belonged to. This process is called BoW, which produces the histograms representation of visual words. However, this approach is likely to result in large reconstruction error because it limits the ability of representing descriptors. To address this problem, SPM based on sparse coding (ScSPM) method has been proposed in [38], which employed L 1 norm-based sparse coding scheme to substitute the previous K-means clustering method and to generate dictionary by learning randomly sampled SIFT feature vectors. During the feature coding period, ScSPM used sparse coding strategy to code each local descriptor. However, the computation speed of ScSPM is very slow when the dictionary size becomes large. In order to accelerate the computation and maintain high classification accuracy, locality-constrained linear coding (LLC) was put forward in [36], which gave an analytical solution for feature coding. Furthermore, several improved image classification schemes based SPM have also been suggested recently, such as spatial pyramid matching using Laplacian sparse coding [12], discriminative spatial pyramid [15], discriminative affine sparse codes [20], nearest neighbor basis vectors spatial pyramid matching (NNBVSPM) [24], etc. How to find some efficient feature coding strategies is becoming an urgent research direction. In the field of pattern recognition, Fisher vector (FV) technique has been used for image classification [4, 19, 29, 30]. FV is a strong framework which combines the advantages of generative and discriminative approaches. The key point of FV is to represent a signal using a gradient vector derived from a generative probability model and to subsequently input this representation to a discriminative classifier. Therefore, FV can be seen as one hybrid generative/discriminative model. The vector of locally aggregated descriptors (VLAD) can be viewed as a non-probabilistic version of the FV when the gradient only associates with the mean and replace gaussian mixture models (GMM) clustering by K-means. VLAD has been successfully applied to image retrieval [1, 8, 16, 17]. When some higher-order statistics are considered, researchers proposed another two coding methods, i.e., vectors of locally aggregated tensors (VLAT) [32] and super-vector (SV) [41]. The dimensionality of VLAT is P(D + D 2 ),wherethed is the dimension of each descriptor, the high dimensionality representation of VLAT can result in very large computation time. Besides, SV is based on probability viewpoint and it is still a generative model. Therefore, we do not consider the VLAT and SV feature coding algorithms. In this paper, we only concentrate on some image classification methods based on discriminative models, BoW, ScSPM, LLC and VLAD are selected to compare with our improved VLAD methods. In order to obtain stronger coding ability and improve the classification rate or speed, three improved VLAD versions for image classification are given in this paper. First, similar to the bag of words (BoW) model, we count the number of descriptors belonging to each cluster center and add it to VLAD. In this way, our improved VLAD method possesses the characteristics of BoW. Second, in order to expand the impact of residuals, squared residuals are added into the original VLAD. This makes the dimension of new representation is two times of the original. Thirdly, there are some descriptors which have nearly the same

distance to more than one visual words. Thus, these descriptors only assigned to the nearest visual word in original VLAD are not appropriate. In contrast with one nearest neighbor visual word, we try to look for two nearest neighbor visual words for aggregating each descriptor. The remainder of the paper is organized as follows: Section 2 introduces the basic idea of existing schemes. Our improved VLAD methods are presented in Section 3. In Section 4, the comparison results of image classification on three widely used datasets are reported. Finally, conclusions are made and some future research issues are discussed in Section 5. 2 Related work Let V be a set of D-dimensional local descriptors extracted from an image, i.e., V = [v 1, v 2,, v M ] R D M. Given a dictionary with P entries, W =[w 1, w 2,, w P ] R D P, different feature coding schemes convert each descriptor into a P -dimensional code to generate the final image representation coefficient matrix H, i.e., H = [h 1, h 2,, h M ] R P M. Each column of V is a local descriptor corresponding to a coefficient, i.e., each column of H. 2.1 Bag of words (BoW) The BoW representation groups local descriptors. It first generates a dictionary W with P visual words usually obtained by K-means clustering algorithm. Each D dimension local descriptor from an image is then assigned to the closest center. The BoW representation is obtained as the histogram of the assignment of all image descriptors to visual words. Therefore, it produces a P -dimensional vector representation, the sum of the elements in this vector equals the number of descriptors in each image. However, the BoW model does not consider the spatial structure information of image and has large reconstruction error, its ability to image classification is restricted [6]. 2.2 Sparse coding spatial pyramid matching (ScSPM) In ScSPM [38], by using sparse coding in place of vector quantization followed by multilayer spatial max pooling, the authors developed an extension of the traditional SPM method [21] and presented a linear SPM kernel based on SIFT sparse coding. In the process of image classification, ScSPM solved the following optimization problem: min W,H i=1 M v i Wh i 2 2 + λ h i 1 (1) where. 2 denotes the L 2 norm of a vector, i.e., the square root of sum of the vector entries squares,. 1 is the L 1 norm of a vector, i.e., the sum of the absolute values of the vector entries. The parameter λ is used to control the sparsity of the solution of formula (1), the bigger λ is, the more sparse the solution will be. Experimental results in [38] demonstrated that linear SPM based on sparse coding of SIFT descriptors significantly outperformed the linear SPM kernel on histograms and was even better than the nonlinear SPM

kernels. Nevertheless, utilizing sparse coding to learn dictionary and to encode features are time consuming, especially for large scale image dataset or large dictionary. 2.3 Locality-constrained linear coding (LLC) In LLC [36], inspired by the viewpoint of [40] which illustrated that locality was more important than sparsity, the authors generalized the sparse coding to locality-constrained linear coding and suggested a locality constraint instead of the sparsity constraint in the formula (1). With respect to LLC, the following optimization problem was solved: min H M v i Wh i 2 2 + λ d i h i 2 2 i=1 s.t. 1 T h i = 1, i (2) where 1 = (1, 1,, 1) T, denotes the element-wise multiplication, and d i R P is a weight vector. In addition, each coefficient vector h i is normalized in terms of 1 T h i = 1. Experimental results in [36] showedthatthe LLCoutperformed ScSPMonsome benchmark datasets due to its excellent properties, i.e., better reconstruction, local smooth sparsity and analytical solution. 2.4 Vector of locally aggregated descriptors (VLAD) VLAD representation was proposed in [16] for image retrieval. V =[v 1, v 2,, v M ] R D M represents a descriptor set extracted from an image. Like the BoW, a dictionary W = [w 1, w 2,, w P ] R D P is first learned using K-means. Then, for each local descriptor v i, we look for its nearest neighbor visual word NN(v i ) in the dictionary. Finally, for each visual word w j, the differences v i w j of the vectors v i assigned to w j are accumulated. C =[c T 1, ct 2,, ct P ]T R PD (c j R D,j = 1, 2,,P) is the final vector representation of VLAD, which can be obtained according to the following formula. c j = v i :NN(v i )=w j (v i w j ) (3) The VLAD representation is the concatenation of the D dimensional vectors c j and is therefore PD dimension, where P is the dictionary size. Algorithm 1 gives the VLAD coding process. Like the Fisher vector, the VLAD can then be power- and L 2 -normalized sequently, where the parameter α is empirically set to 0.5. It is worth noting that there are no SPM and pooling process in the VLAD coding algorithm. The existing experiments have proved that VLAD is an efficient feature coding method under small dictionary size. 3 Improved VLAD In this section, three improved VLAD methods are presented. They are named as VLAD based on BoW, Magnified VLAD and Two Nearest Neighbor VLAD respectively. The same

as VLAD, the improved VLAD representations can also be power- and L 2 -normalized, where the parameter α is empirically set to 0.5. 3.1 VLAD based on BoW Inspired by the BoW, we count the number of descriptors belonging to each cluster w j (j = 1,,P) and add it to VLAD. This improved VLAD method is called VLAD based on BoW (abbreviated as: VLAD+BoW). Therefore, the dimensionality of VLAD+BoW representation is P(D + 1), and the extra one dimension is used to store the BoW representation. After integrating the histogram information of visual words into the VLAD, we hope that VLAD+BoW can possess the characteristics of BoW and improve the classification performance. The VLAD+BoW is presented in Algorithm 2. 3.2 Magnified VLAD In order to magnify the impact of residuals, squared residuals are taken into account. This improved version is called Magnified VLAD (abbreviated as: MVLAD) and its dimension is 2PD. The computation of MVLAD is given in Algorithm 3.

3.3 Two nearest neighbor VLAD In addition to a nearest neighbor center, we attempt to seek a second nearest neighbor center for each descriptor. This process is referred to two nearest neighbor VLAD (abbreviated as: TNNVLAD). The dimension of TNNVLAD representation is still PD. TNNVLAD is a kind of soft coding method and it can reduce representation error. The specific details are showed in Algorithm 4. If d 1 >βd 2, the 0.5 times differences between v i and its two nearest neighbor centers are accumulated. The value of β can be obtained according to our experiments. 4 Experimental results This section begins with an illustration of our experiments setting which is followed by comparisons between our schemes with other prominent methods on three datasets, i.e., UIUC Sports Event, Corel 10 and 15 Scenes. Figure 1 shows example images of these datasets.

4.1 Experiments setting A typical experiments setting for classifying images mainly contains four steps. First of all, we adopt the widely used SIFT descriptor [25] due to its good performance in image classification reported in [12, 21, 36, 38]. Specifically speaking, SIFT features are invariant to image scale and rotation and robust across a substantial range of affine distortion, addition of noise, and change in illumination. To be consistent with previous work, we also draw on the same setting to extract SIFT descriptor. We employ the 128-dimensional SIFT descriptor which are densely extracted from image patches on a grid with step size of 8 pixels under one patch size, i.e., 16 16. We resize the maximum side (i.e., length or width) of each image to 300 pixels except for UIUC Sports Event. For UIUC Sports Event dataset, we resize the maximum side to 400 because of the high resolution of original images. Next, about twenty descriptors from each image are chosen at random to form a new matrix which is taken as an input of K-means clustering or sparse coding algorithm, and we then learn a dictionary of specified size. In the third step, we then exploit BoW, sparse coding, LLC, VLAD and improved VLAD schemes to encode the descriptors and produce image s new representation. For the BoW model, the dimensionality of the new representation is dictionary size P. In the ScSPM and LLC, we combined three layers spatial pyramid matching model (including 21 subregions) with max pooling function, thus, the dimension of the new representation is 21P. The dimensionality for the VLAD and the improved VLAD methods can be found from the Algorithms 1-4. At the final step, we apply linear SVM classifier

Fig. 1 Image examples of the datasets UIUC Sports Event (the left four), Corel 10 (the middle four), and 15 Scenes (the right four)

for the new representations, randomly selecting some columns per class to train and some other columns per class to test. Then, it is not difficult for us to get a classification accuracy for each category by comparing the obtained label of test set with the ground-truth label of test set. Eventually, we sum up classification accuracy of all categories and divide it by the number of categories to get the classification accuracy of all categories. All the results are obtained by repeating five independent experiments, and the average classification accuracy and the standard deviation over five experiments are reported. All the experiments are conducted in MATLAB, which is executed on a server with an Intel X5650 CPU (2.66GHz and 12 cores) and 32GB RAM. For the TNNVLAD algorithm, Fig. 2 gives the choice process of parameter β on three different datasets. Specifically speaking, Fig. 2 shows the classification accuracy of our TNNVLAD method when β changes in an interval [0.1, 1] where the dictionary size is 130. The experimental results presented in Fig. 2 indicate that β = 0.8 is the best choice for TNNVLAD. Therefore, in our experiments, we fix β = 0.8 in TNNVLAD algorithm. 4.2 UIUC sports event dataset UIUC Sports Event [23] contains 8 categories and 1579 images in total, with the number of images within each category ranging from 137 to 250. These 8 categories are badminton, bocce, croquet, polo, rock climbing, rowing, sailing and snow boarding. In order to compare with other methods, we first randomly select 70 images per class as training data and randomly select 60 images from each class as test data. We compare the classification accuracy of our three improved VLAD schemes with other four methods under different dictionary 85 84 83 82 81 80 79 78 77 UIUC Sports Event 76 Corel 10 15 Scenes 75 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 beta Fig. 2 Classification accuracy of our TNNVLAD algorithm under different β on the UIUC Sports Event, Corel 10 and 15 Scenes datasets

90 85 80 75 70 65 60 55 50 BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD 50 100 150 200 250 300 350 400 Dictionary Size Fig. 3 Classification accuracy comparisons of various coding methods under different dictionary size on the UIUC Sports Event dataset size in Fig. 3, where the dictionary size ranges from 10 to 420 and the step length is 10. From the results presented in Fig. 3, we notice that the classification accuracy of our methods surpass all the other algorithms when the dictionary size is small and are comparable to the existing schemes when the dictionary size becomes large. This phenomenon may be explained for the fact that the goal of VLAD is for aggregating local image descriptors into compact codes. VLAD can obtain good performance in the case of small dictionary size. Besides, we can know the results from the Fig. 3 that the performance of BoW is the lowest and ScSPM is better than BoW, yet, the classification accuracy of LLC is further better than ScSPM, these observations are consistent with reports in the existing literature sources. BasedonFig.3, we list the best classification accuracy of various approaches in Table 1, where the average classification accuracy, standard deviation and corresponding dictionary Table 1 The best classification accuracy comparisons on the UIUC Sports Event dataset (mean±std-dev)% Algorithm Classification Accuracy (Dictionary Size) BoW [6] 73.38 ± 0.85 (390) ScSPM [38] 83.71 ± 2.20 (400) LLC [36] 84.17 ± 1.36 (330) VLAD [17] 84.38 ± 2.67 (220) VLAD+BoW 85.29 ± 0.87 (210) MVLAD 84.75 ± 1.85 (220) TNNVLAD 85.25 ± 1.26 (220)

The Confusion Matrix of VLAD+BoW algorithm on UIUC Sports Event (%) badminton 95.67 1.00 1.67 bocce 6.33 64.33 1 2.33 4.67 5.00 6.67 croquet 18.67 71.67 2.33 3.00 2.67 polo 1.67 5.00 3.33 8 1.67 4.00 3.00 rockclimbing 96.67 rowing 2.33 3.33 87.67 2.33 2.33 sailing 1.00 5.33 9 snowboarding 3.33 1.00 1.67 5.33 86.33 badminton bocce croquet polo rockclimbing rowing sailing snowboarding The Confusion Matrix of MVLAD algorithm on UIUC Sports Event (%) badminton 94.33 2.67 1.00 bocce 4.67 58.67 13.33 4.33 5.67 4.67 8.33 croquet 19.67 7 3.33 3.00 2.67 polo 3.00 5.33 4.67 79.33 1.67 2.33 3.00 rockclimbing 94.00 4.33 rowing 2.33 87.00 3.00 2.33 sailing 1.00 1.00 4.33 91.00 1.00 snowboarding 2.67 3.67 4.33 86.33 badminton badminton bocce croquet polo rockclimbing rowing sailing snowboarding The Confusion Matrix of TNNVLAD algorithm on UIUC Sports Event (%) 93.00 1.67 1.00 bocce 3.67 65.33 1 6.67 4.00 2.33 6.67 croquet 2 74.33 1.00 polo 3.33 4.33 85.33 2.33 rockclimbing 1.67 91.00 1.67 4.00 rowing 3.00 1.67 85.67 4.00 1.67 sailing 6.33 9 1.00 snowboarding 2.33 1.00 3.33 2.67 2.67 85.33 badminton bocce croquet polo rockclimbing rowing sailing snowboarding Fig. 4 Confusion Matrices of our algorithms on UIUC Sports Event dataset

size are given. From Table 1, we can draw the conclusion that the best classification accuracy of our three improved methods are better than those of the other four schemes on the UIUC Sports Event dataset. Our VLAD+BoW and TNNVLAD methods achieve more than 1 % higher accuracy than LLC, which is the state-of-the-art and is based on SPM model. Furthermore, the original VLAD and improved VLAD can get the best classification accuracy under small dictionary size, but the BoW, ScSPM and LLC obtain their highest classification accuracy needing large dictionary size. Moreover, the confusion matrices of our algorithms for UIUC Sports Event dataset are showninfig.4. In the process of obtaining confusion matrices, the dictionary size is set to 130 in our three improved VLAD methods. In the confusion matrices, the element in the i th row and j th column (i = j) is the percentage of images from class i that are misidentified as class j. Average classification accuracies of five independent experiments for individual classes are listed along the main diagonal. Figure 4 shows the classification and misclassification status for each individual class. Our algorithms perform well for class badminton and rock climbing. What is more, we also notice that the class bocce and croquet have a high percentage being classified wrongly, and this may result from that they are visually similar to each other. Balls in the class bocce and croquet have very similar appearance. To further demonstrate the superiority of our methods in running speed, the computation time comparisons of various approaches with different dictionary size on the UIUC Sports Event dataset is reported in Fig. 5. The computation time of all methods is the total time of five independent experiments and the corresponding unit is seconds. From Fig. 5, we can know that the computing speed of BoW method is the fastest due to its low dimensional representation. Meanwhile, we also observe that ScSPM algorithm is the slowest. This is because that sparse coding strategy is used to learn a dictionary and to encode features in ScSPM. To solve the optimization problem of minimizing the L 1 norm is very time-consuming. The computation time of VLAD and our three improved VLAD methods Computation Time (seconds) 8000 7000 6000 5000 4000 3000 2000 BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD 1000 0 50 100 150 200 250 300 350 400 Dictionary Size Fig. 5 Computation time comparisons of various coding methods under different dictionary size on the UIUC Sports Event dataset

85 80 75 70 65 60 55 50 45 40 35 BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD 50 100 150 200 250 300 350 400 Dictionary Size Fig. 6 Classification accuracy comparisons of various coding methods under different dictionary size on the Corel 10 dataset are smaller than LLC. This experimental results show that our algorithms have a certain advantage on the computation time. 4.3 Corel 10 dataset Corel 10 [26] contains 10 categories and 100 images per category. These categories are beach, buildings, elephants, flowers, food, horses, mountains, owls, skiing and tigers. Like the setting of [12, 26], we randomly select 50 images from each class as training data and use the rest 50 images per class as test data. Similarly, classification accuracy comparisons of various coding methods under different dictionary size on the Corel 10 dataset are described in Fig. 6. We again see that our improved VLAD algorithms can obtain good performance when the dictionary size is small. According to Fig. 6, the best classification accuracy of different algorithms are reported in Table 2. From the results, we can see that the best classification accuracies of our three improved VLAD algorithms are better than those of the other four schemes on the Corel 10 Table 2 The best classification accuracy comparisons on the Corel 10 dataset (mean±std-dev)% Algorithm Classification Accuracy (Dictionary Size) BoW [6] 67.44 ± 0.91 (340) ScSPM [38] 75.24 ± 1.24 (340) LLC [36] 79.20 ± 1.66 (380) VLAD [17] 78.76 ± 1.47 (110) VLAD+BoW 79.88 ± 0.48 (130) MVLAD 79.96 ± (280) TNNVLAD 81.32 ± 1.45 (130)

The Confusion Matrix of VLAD+BoW algorithm on Corel 10 (%) beach 7 4.00 8.80 2.40 2.40 buildings 84.40 2.40 elephants 85.60 6.40 9 2.40 food 1 8 horses 96.40 mountains 4.80 8.00 7.20 52.40 6.00 16.00 owls 7.60 2.40 8.00 65.20 4.00 7.20 skiing 2.40 6.00 74.00 4.40 tiger 4.40 8 beach buildings elephants food horses mountains owls skiing The Confusion Matrix of MVLAD algorithm on Corel 10 (%) tiger beach 67.60 5.60 4.00 2.40 8.80 2.40 buildings 8 elephants 88.40 6.40 9 food 1 77.20 horses 95.20 mountains 6.00 8.40 5 5.20 14.40 owls 1 4.80 67.20 4.40 skiing 6.00 5.20 7 6.80 tiger 4.40 79.20 beach buildings elephants food horses mountains owls skiing The Confusion Matrix of TNNVLAD algorithm on Corel 10 (%) tiger beach 74.00 4.40 8.40 buildings 4.00 85.20 4.00 elephants 84.40 8.80 2.40 9 food 7.20 78.40 2.40 4.00 horses 95.60 mountains 4.40 4.40 5.60 5 1 14.80 owls 6.40 5.60 5.20 66.40 6.40 skiing 9.20 4.80 66.80 4.40 tiger 87.20 beach buildings elephants food horses mountains owls skiing tiger Fig. 7 Confusion Matrices of our algorithms on Corel 10 dataset

dataset. Moreover, all the algorithms based on VLAD obtain the best classification accuracy under small dictionary size, but the BoW, ScSPM and LLC get their best classification accuracy needing large dictionary size. Our TNNVLAD method has two percentage point higher than the other best method LLC. The confusion matrices for Corel 10 dataset are also given in Fig. 7. Our algorithms perform well for class flower and horse, and get poor performance on class mountain. Figure 8 gives the computation time comparisons of various coding methods under different dictionary size on Corel 10 dataset. ScSPM algorithm requires the most time than other six algorithms. Although MVLAD needs more time than BoW and LLC, but it still far less than ScSPM. 4.4 15 Scenes dataset The 15 Scenes dataset [21] contains 15 categories and 4485 images in total, with the number of images within each category ranging from 200 to 400. These 15 categories are bedroom, suburb, industrial, kitchen, living room, coast, forest, highway, inside city, mountain, open country, street, tall building, office and store. The image content is different, containing not only indoor scenes, like livingroom and store, but also outdoor sceneries, such as coast and forest etc. In order to compare with other methods, we randomly select 100 images per class as training data and use the rest as test data. Figure 9 gives the classification accuracy comparisons of various coding methods under different dictionary size on the 15 Scenes dataset. Algorithms based on VLAD cat get better performance than ScSPM and LLC when the dictionary size is small, but they become slightly lower than LLC when the dictionary size increases. Computation Time (seconds) 3500 3000 2500 2000 1500 1000 BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD 500 0 50 100 150 200 250 300 350 400 Dictionary Size Fig. 8 Computation time comparisons of various coding methods under different dictionary size on Corel 10 dataset

85 80 75 70 65 60 55 50 45 BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD 50 100 150 200 250 300 350 400 Dictionary Size Fig. 9 Classification accuracy comparisons of various coding methods under different dictionary size on the 15 Scenes dataset On the basis of data in Fig. 9, the most prominent classification accuracy are presented in Table 3. For the 15 Scenes dataset, the best performance of our improved VLAD algorithms are comparable with or slightly lower than LLC and ScSPM. The confusion matrices for 15 Scenes dataset are shown in Fig. 10. Our algorithms perform well for class calsuburb and forest. Besides, we know that the class bedroom and living room have a high percentage being classified wrongly, meanwhile, the class kitchen and living room also have high misclassification rate, and these may result from that they are visually similar to each other. Figure 11 reports the computation time comparisons of various coding methods under different dictionary size on 15 Scenes dataset. ScSPM algorithm requires the most time than other six algorithms. Table 3 The best classification accuracy comparisons on the 15 Scenes dataset (mean±std-dev)% Algorithm Classification Accuracy (Dictionary Size) BoW [6] 65.87 ± 0.61 (90) ScSPM [38] 79.54 ± 0.70 (420) LLC [36] 80.85 ± 1.02 (420) VLAD [17] 77.35 ± 0.50 (400) VLAD+BoW 80.09 ± 0.51 (280) MVLAD 78.82 ± 0.50 (280) TNNVLAD 79.23 ± 0.62 (400)

62.76 3.13 5.64 17.78 0.13 0.10 0.39 0.16 1.39 1.02 0.69 99.86 1.71 0.63 0.26 0.13 1.44 0.07 1.23 0.21 0.70 0.65 3.10 60.85 3.27 5.50 0.44 0.38 1.35 0.07 0.06 0.42 0.31 4.56 5.86 2.37 6 12.70 0.13 1.15 0.16 6.43 3.44 13.62 3.98 10.18 44.66 0.25 0.29 0.15 1.39 5.58 0.17 1.71 0.18 86.85 4.13 0.19 1.53 10.97 0.28 0.11 0.62 95.88 0.48 2.41 5.16 0.39 1.58 0.69 0.09 0.32 2.38 88.25 0.10 1.39 2.06 2.71 1.90 2.65 1.64 0.85 1.63 82.69 6.15 2.73 3.91 0.76 0.18 0.42 2.31 2.11 0.38 0.10 89.56 5.61 0.63 0.86 0.84 0.57 7.31 0.61 1.88 0.10 4.31 73.94 0.21 0.17 1.52 0.42 1.88 5.87 0.22 0.52 86.98 0.94 1.02 4.64 0.42 0.31 0.38 3.27 0.22 1.88 93.67 1.30 6.90 1.42 10.55 7.41 0.38 88.70 2.05 4.14 0.14 14.31 6.36 8.78 0.70 0.50 2.50 0.07 0.06 0.83 0.55 1.39 74.05 bedroom calsuburb industrial kitchen livingroom coast forest highway insidecity mountain opencountry street tallbuilding store The Confusion Matrix of VLAD+BoW algorithm on 15 Scenes (%) bedroom calsuburb industrial kitchen livingroom coast forest highway insidecity mountain opencountry street tallbuilding store 55.52 2.56 6.00 16.30 0.15 0.10 0.07 0.26 0.10 0.31 1.74 1.12 0.34 99.43 2.18 0.85 0.26 0.38 1.15 0.22 0.10 0.23 0.70 1.02 2.93 53.74 1.64 4.44 1.35 0.07 0.39 0.52 1.64 0.17 2.42 12.76 2.56 58.73 12.06 0.13 0.96 8.17 3.16 14.66 0.14 3.32 12.91 46.46 0.50 0.29 0.22 0.21 0.23 1.04 4.74 1.80 86.38 4.25 0.29 1.90 13.48 0.31 0.23 0.17 0.38 0.31 96.58 0.13 0.77 3.87 7.03 0.23 0.84 0.86 1.90 0.32 3.23 86.75 0.38 1.53 2.45 2.71 0.65 1.55 0.14 3.41 3.64 1.06 1.88 81.54 0.15 4.48 3.28 7.53 1.23 0.36 0.53 1.38 2.02 1.13 0.19 87.30 4.90 0.73 1.95 1.71 7.85 0.70 2.38 0.29 4.01 68.71 3.51 0.18 0.31 0.09 1.63 6.44 0.36 0.32 86.88 1.25 1.49 0.69 0.14 6.26 1.09 0.74 0.15 0.25 3.27 0.15 2.92 90.39 0.84 6.55 0.76 9.27 7.51 0.29 0.31 86.78 3.16 4.14 0.14 14.69 6.18 9.74 0.35 0.63 2.69 0.15 0.45 1.04 0.39 1.22 71.07 bedroom calsuburb industrial kitchen livingroom coast forest highway insidecity mountain opencountry street tallbuilding store The Confusion Matrix of MVLAD algorithm on 15 Scenes (%) bedroom calsuburb industrial kitchen livingroom coast forest highway insidecity mountain opencountry street tallbuilding store 63.45 0.14 4.08 6.36 17.88 0.23 0.13 0.29 0.32 0.10 0.39 2.09 0.93 0.52 98.72 1.61 0.42 0.23 0.44 0.38 1.15 0.22 1.74 0.21 1.04 1.12 1.72 55.07 0.73 5.08 0.44 0.25 0.77 0.29 0.45 1.15 2.03 0.17 4.47 7.24 3.22 63.45 13.44 0.25 1.25 0.16 5.39 4.28 15.52 0.43 3.79 10.55 49.42 0.48 0.22 0.47 2.26 3.44 0.17 2.27 86.38 0.09 4.50 0.87 1.39 10.84 0.31 0.23 0.14 0.47 0.62 96.05 0.38 0.58 2.92 6.00 0.39 1.30 0.34 1.71 0.18 0.42 2.69 88.00 1.53 2.13 2.60 0.16 0.56 1.21 0.14 4.08 3.27 1.38 1.75 80.77 0.13 4.48 3.05 0.52 5.02 0.95 0.18 0.42 1.62 2.28 0.75 87.81 6.84 0.63 0.78 3.07 0.28 1.42 7.92 0.35 1.63 0.10 4.23 70.45 0.16 0.09 3.22 0.21 0.09 1.25 5.38 0.51 0.19 86.98 1.64 2.23 0.34 4.93 0.73 0.21 0.15 0.50 3.65 0.58 2.19 9 1.95 5.00 1.42 9.64 5.71 0.13 0.48 0.15 0.23 86.61 2.88 4.48 0.14 11.75 4.91 5.40 0.26 0.13 3.17 0.44 0.32 1.35 0.23 1.91 68.65 bedroom calsuburb industrial kitchen livingroom coast forest highway insidecity mountain opencountry street tallbuilding store The Confusion Matrix of TNNVLAD algorithm on 15 Scenes (%) bedroom calsuburb industrial kitchen livingroom coast forest highway insidecity mountain opencountry street tallbuilding store Fig. 10 Confusion Matrices of our algorithms on 15 Scenes dataset

Computation Time (seconds) 12000 10000 8000 6000 4000 BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD 2000 0 50 100 150 200 250 300 350 400 Dictionary Size Fig. 11 Computation time comparisons of various coding methods under different dictionary size on the 15 Scenes dataset 5 Conclusion and future work In this paper, three feature coding schemes based on VLAD are proposed for image classification. We compare our schemes with some state-of-the-art methods, including BoW, ScSPM, LLC and VLAD. Experiments on different kinds of datasets (UIUC Sports Event dataset, Corel 10 dataset and 15 Scenes dataset) demonstrate that classification accuracy of our improved VLAD coding strategies are better than the previous four classical methods under small dictionary size. At the same time, it is noteworthy that our schemes are much faster than ScSPM because ScSPM algorithm needs more time to learn dictionary and code features using sparse coding strategy. In many cases, we need to consider the classification accuracy and classification speed simultaneously. In the future, we will try to find more efficient feature coding strategies and apply them to large scale image datasets. Acknowledgments This work is sponsored by NUPTSF (Grant No. NY214168), National Natural Science Foundation of China (Grant No. 61300164, 61272247), Shanghai Science and Technology Committee (Grant No. 13511500200) and European Union Seventh Framework Programme (Grant No. 247619). References 1. Arandjelovic R, Zisserman A (2013) All about vlad. In: IEEE conference on computer vision and pattern recognition, pp 1578 1585 2. Boiman O, Shechtman E, Irani M (2008) In defense of nearest-neighbor based image classification. In: IEEE conference on computer vision and pattern recognition, pp 1 8 3. Bosch A, Zisserman A, Muoz X (2008) Scene classification using a hybrid generative/discriminative approach. IEEE Trans Pattern Anal Mach Int 30(4):712 727 4. Cinbis RG, Verbeek J, Schmid C (2012) Image categorization using fisher kernels of non-iid image models. In: IEEE conference on computer vision and pattern recognition, pp 2184 2191 5. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273 297

6. Csurka G, Dance CR, Fan LX, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, p 22 7. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition, vol 1, pp 886 893 8. Delhumeau J, Gosselin PH, Jégou H, Pérez P (2013) Revisiting the vlad image representation. In: ACM international conference on Multimedia, pp 653 656 9. Elad M, Aharon M (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans Image Proc 15(12):3736 3745 10. Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comp Vision Image Underst 106 (1):59 70 11. Freund Y, Schapire R (1995) A desicion-theoretic generalization of on-line learning and an application to boosting. In: Computational learning theory, pp 23 37 12. Gao SH, Tsang IWH, Chia LT, Zhao PL (2010) Local features are not lonely laplacian sparse coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 3555 3561 13. Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of image features. In: International conference on computer vision, vol 2, pp 1458 1465 14. Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset 15. Harada T, Ushiku Y, Yamashita Y, Kuniyoshi Y (2011) Discriminative spatial pyramid. In: IEEE conference on computer vision and pattern recognition, pp 1617 1624 16. Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: IEEE conference on computer vision and pattern recognition, pp 3304 3311 17. Jégou H, Perronnin F, Douze M, Sánchez J, Pérez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Int 34(9):1704 1716 18. Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: International conference on computer vision, vol 1, pp 604 610 19. Krapac J, Verbeek J, Jurie F (2011) Modeling spatial layout with fisher vectors for image categorization. In: IEEE international conference on computer vision, pp 1487 1494 20. Kulkarni N, Li BX (2011) Discriminative affine sparse codes for image classification. In: IEEE conference on computer vision and pattern recognition, pp 1609 1616 21. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE conference on computer vision and pattern recognition, vol 2, pp 2169 2178 22. Li FF, Pietro P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE conference on computer vision and pattern recognition, vol 2, pp 524 531 23. Li LJ, Li FF (2007) What, where and who? Classifying events by scene and object recognition. In: International conference on computer vision, pp 1 8 24. Long X, Lu H, Li W (2012) Image classification based on nearest neighbor basis vectors. Multimed Tools Appl:1 18 25. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91 110 26. Lu Z, Ip HHS (2009) Image categorization with spatial mismatch kernels. In: IEEE conference on computer vision and pattern recognition, pp 397 404 27. Moosmann F, Triggs B, Jurie F (2007) Fast discriminative visual codebooks using randomized clustering forests. Advances in neural information processing systems 19 28. Morel J, Yu G (2009) Asift: a new framework for fully affine invariant image comparison. SIAM J Imaging Sci 2(2):438 469 29. Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: IEEE conference on computer vision and pattern recognition, pp 1 8 30. Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: European conference on computer vision, pp 143 156 31. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: IEEE conference on computer vision and pattern recognition, pp. 1 8 32. Picard D, Gosselin PH (2011) Improving image similarity with vectors of locally aggregated tensors. In: IEEE international conference on image processing, pp 669 672 33. Quelhas P, Monay F, Odobez JM, Gatica-Perez D, Tuytelaars T, Van Gool L (2005) Modeling scenes with local descriptors and latent aspects. In: International conference on computer vision, vol 1, pp 883 890

34. Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: An efficient alternative to sift or surf. In: International conference on computer vision 35. Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: International conference on computer vision, pp 1470 1477 36. Wang JJ, Yang JC, Yu K, Lv FJ, Huang T, Gong YH (2010) Locality-constrained linear coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 3360 3367 37. Xu D, Chang S (2008) Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans Pattern Anal Mach Int 30(11):1985 1997 38. Yang JC, Yu K, Gong YH, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 1794 1801 39. Yang L, Jin R, Sukthankar R, Jurie F (2008) Unifying discriminative visual codebook generation with classifier training for object category recognition. In: IEEE conference on computer vision and pattern recognition, pp 1 8 40. Yu K, Zhang T, Gong YH (2009) Nonlinear learning using local coordinate coding. Adv Neural Inf Process Syst 22:2223 2231 41. Zhou X, Yu K, Zhang T, Huang TS (2010) Image classification using super-vector coding of local image descriptors. In: European conference on computer vision, pp 141 154 Xianzhong Long obtained his Ph.D. degree from Shanghai Jiao Tong University on June 2014. He received his B.S. degree from Henan Polytechnic University in 2007 and M.S. degree from Xihua University in 2010, both in computer science. Now, he is an assistant professor at Nanjing University of Posts and Telecommunications. His research interests are computer vision, machine learning and image processing, specifically on image classification, object recognition and clustering.

Hongtao Lu got his Ph.D. degree in Electronic Engineering from Southeast University, Nanjing, in 1997. After graduation he became a postdoctoral fellow in Department of Computer Science, Fudan University, Shanghai, China, where he spent two years. In 1999, he joined the Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, where he is now a professor. His research interest includes machine learning, computer vision and pattern recognition, and information hiding. He has published more than sixty papers in international journals such as IEEE Transactions, Neural Networks and in international conferences. His papers got more than 400 citations by other researchers. Yong Peng received the B.S degree in computer science from Hefei New Star Research Institure of Applied Technology, the M.S degree from Graduate University of Chinese Academy of Sciences. Now he is working towards his PhD degree in Shanghai Jiao Tong University. His research interests include machine learning, pattern recognition and evolutionary computation.

Xianzhong Wang received the B.S degree in computer science from An Hui University Of Technology.Now he is a Master candidate in Computer Science and Engineering Department of Shanghai Jiao Tong University. His research interests include machine learning and human action recognition. Shaokun Feng received the B.S degree in information science from University of Shanghai for Science and Technology. Now he is working towards his M.S degree in Shanghai Jiao Tong University. His research interests include machine learning, pattern recognition and deep learning.