Locality-constrained and Spatially Regularized Coding for Scene Categorization

Size: px

Start display at page:

Download "Locality-constrained and Spatially Regularized Coding for Scene Categorization"

Cleopatra Snow
5 years ago
Views:

1 Locality-constrained and Spatially Regularized Coding for Scene Categorization Aymen Shabou Hervé Le Borgne CEA, LIST, Vision & Content Engineering Laboratory Gif-sur-Yvettes, France Abstract Improving coding and spatial pooling for bag-of-words based feature design have gained a lot of attention in recent works addressing object recognition and scene classification. Regarding the coding step in particular, properties such as sparsity, locality and saliency have been investigated. The main contribution of this work consists in taking into acount the local spatial context of an image into the usual coding strategies proposed in the state-ofthe-art. For this purpose, given an imgae, dense local features are extracted and structured in a lattice. The latter is endowed with a neighborhood system and pairwise interactions. We propose a new objective function to encode local features, which preserves locality constraints both in the feature space and the spatial domain of the image. In addition, an appropriate efficient optimization algorithm is provided, inspired from the graph-cut framework. In conjunction with the maximum-pooling operation and the spatial pyramid matching, that reflects a global spatial layout, the proposed method improves the performances of several state-of-the-art coding schemes for scene classification on three publicly available benchmarks (UIUC 8-sport, Scene- 15 and Caltech-101). 1. Introduction In recent works addressing object recognition and scene classification tasks, the bag-of-words (BoW) is one of the most popular model for feature design. Inspired by the seminal work of [26], different approaches have been proposed to improve both its generative property to describe accurately images and its discriminatory power for classification. Despite remarkable progresses, it remains challenges concerning the extraction of local descriptors, codebook design, local descriptors coding and pooling, including a spatial layout into the final feature, and the final classification. Given a training dataset, the first step of the BoW method consists in extracting local features, such as SIFT [21], HOG [8] and SURF [1], from images. Then a codebook (or Spatial domain of dense local features X p X q X { p 0.9 X q : Local features 0.7 X r X r 0.9: similarity between X p, X q 0.7: similarity between X p, X r 1 1. Locality-constrained assignement Visual word Selected visual word 2 X p X q 2. Locality-constrained and spatially regularized assignement X r Codebook Figure 1. Schematic comparison of basis selection methods to code dense descriptors. The first configuration is the one adopted by some recent coding approaches [30, 12, 20]. The second configuration corresponds to the proposed LCSR method. a dictionary), which is a set of visual words, is built to represent them. Initial methods are based on clustering techniques, such as K-means [26]. Despite their efficiency, the obtained codebooks suffer from several drawbacks such as distortion errors and low discriminative ability [14, 28]. A more appropriate unsupervised dictionary learning method is sparse coding which aims to learn an over-complete codebook ensuring sparse representation of local descriptors [23, 22]. However, this approach is computationally expensive even if progress was made toward accelerating the process [16]. Other approaches have rather attempted to improve the discriminative power of the codebook while compacting it relying on supervised methods [14, 22, 2]. However, recent works of [6, 24] show that, for the recognition task, codebook design is less critical than the next stages (coding, pooling and spatial layout).

2 Coding consists in decomposing local features over a codebook in order to satisfy some desirable properties. Various strategies are proposed in the literature. The earliest one is the hard coding [26], a voting scheme that is simple yet highly sensitive to reconstruction errors induced by the codebook. A more robust voting approach is the soft coding [28], which assigns a descriptor to all the visual words according to their distances. Sparse coding is an alternative [32] that is time consuming and which is, moreover, non-consistent to encode similar descriptors [30, 11]. Authors of [33] introduced another coding property, called locality, that ensures sparsity while remaining efficient. Several implementations have been proposed by [30, 20], where each descriptor is coded on locally selected bases. Note also that in [12], the authors give another explanation about the success of the locality coding, which is saliency. Indeed, for a given descriptor and corresponding local bases, the closer the nearest visual word to the descriptor in comparison to the remaining local bases, the stronger its coding response should be. The next step of BoW design is pooling the obtained codes to obtain a compact signature. Usually, the maxpooling operation is used, leading to signatures that are appropriate to linear classifiers [32, 3, 2, 20]. Finally, the Spatial Pyramid Matching (SPM) step, proposed in [15], is usually exploited to include some spatial layout information to the BoW. Such vectors of fixed size can then feed a machine learning algorithm such as SVM [7] or Boosting [25]. In the current work, the local feature coding step is investigated. While several techniques have outperformed the classic hard assignment by introducing either the locality or the similarity constraints in the feature space [30, 20, 12, 11], we propose a new formalism that implicitly preserves these properties while adding the local contextual information from the spatial domain of the image. Figure 1 shows a schematic comparison. The proposed coding approach is divided into two steps. 1. The first step is an optimal basis selection for each local feature, formulated as a labeling problem. For this purpose, we introduce a novel objective function that includes locality and similarity (or coherency) constraints in both the feature space and the spatial domain of the image. Furthermore, we provide an appropriate efficient optimization algorithm, called α knn - expansion, which is inspired from the fast optimization tools dedicated to Markov Random Field (MRF) based energy minimization task [4]. 2. The second step consists in assigning responses (or values) to the selected optimal bases. This new approach enriches the BoW signature leading to more accurate features for classification than the state-ofthe-art methods. Furthermore, it is generic and can thus be added to several recent coding strategies. The remainder of this paper is as follows. In section 2, some details about related work to the coding step within the BoW feature generation framework are discussed. The new coding strategy is introduced in section 3. Section 4 highlights experimental studies and results on the following benchmarks: UIUC 8-sport [17] and scenes-15 [15] for event and scene classification respectively and Caltech- 101 [9] for object recognition. 2. Related work Let us consider a codebook denoted by B = {b i ; b i R d ; i N ; N = 1,..., K}, with N = K the size of the codebook and d the dimensionality of a visual word (or a basis vector). The codebook is constructed on a subset of local descriptors {x i ; x i R d ; i = 1,..., N} extracted from the training dataset. In the original BoW method [26], coding local descriptors is performed with hard assignment. Each local descriptor is assigned to the nearest visual word, i.e., 1 if j = argmin x i b j 2 2, z i,j = j=1,...,k (1) 0 otherwise, with z i the code of size K associated to the descriptor x i. As reported in [14, 22, 28], such coding has several limitations, mainly the sensitivity to distortion errors of the codebook. Using sparse coding [32] as an alternative has significantly improved its robustness to these problems. Therefore, coding is performed by solving the l 1 -norm regularized approximate problem: z i = argmin z R K x i Bz λ z 1, λ R. (2) Nevertheless, this optimization problem is computationally expensive and leads to non consistent encoding of similar descriptors [30, 11]. Indeed, it might select different bases for similar descriptors due to the over-completeness of the codebook, which results in large deviations in representing similar local features. Therefore, authors of [30, 20] propose more efficient and consistent coding methods relying on the locality property introduced by [33]. Their hypothesis is that descriptors approximately reside on a lower dimensional manifold in an ambient descriptor space. Then, using Euclidean distances for assigning descriptors to visual words is only meaningful within a local region. Hence, local bases are selected to perform the coding. The generalized formulation of the locality constrained coding (LCC) problem is the following: z i = argmin z R K x i Bz λ d i z 2 2, s.t. 1 T z i = 1, (3)

3 with d i = exp( dist(xi,b) σ ), dist(x i, B) = [dist(x i, b 1 ),..., dist(x i, b K )] T the Euclidean distances between x i and the basis vectors; and σ a parameter controlling the weight decay speed for the locality. An alternative to improve the consistency of sparse coding has been proposed by [11]. It consists in adding the Laplacian matrix to the objective function (2) to perform codebook learning as well as coding local features, i.e., argmin B,Z X BZ λ i z i 1 + βtr(zlz T ), s.t. b j 2 1, j N, (4) with L = A W the Laplacian matrix obtained form the similarity matrix W encoding the relationship between local features and A m,m = n W m,n. Due to the extremely high number of local descriptors in a dataset, constructing the Laplacian matrix and learning sparse codes simultaneously is computationally infeasible. A restriction to a selected set of local features, called template features, was necessary. But this technique is still computationally expensive due to the search of the k-nearest template features followed by the similarity constrained sparse coding. Salient coding [12] is another alternative approach that has shown interesting results while remaining efficient. It exploits the locality constraint to replace the conventional hard assignment in (1) with a saliency degree regarding the nearest bases b j to x i, which is defined as: ( ) x i b j 2 2 ψ i,j = φ 1 k k 1 m j x i ˆb, (5) m 2 2 where φ(.) is a monotonically decreasing function and {ˆb m } m=1,...,k is the set of the k-nearest bases to x i. Salient coding improves the consistency of the locality coding when it is performed on a small number of local bases compared to the dimension of the descriptors. All the aforementioned coding schemes are applied on local features independently, except for the Laplacian sparse coding, where a global similarity between local features is considered to constrain sparsity. However, dense local features share some contextual information locally in the spatial domain of a given image. This could be seen simply by computing a local pairwise similarity map between local features extracted from a given image, that shows local correlations in some regions of the image (see figure 4). Discarding this contextual information in the coding step would induce codes that are non-consistent in term of contextual spatial information, and also less reliable when the spatial pooling operation is conducted to design the final signature. In the same figure, we can show indexes of bases assigned to local features (as colored maps) using localityconstrained hard and soft coding strategies. We note that considering the local spatial information improves the consistency of the coding regarding to the context of the image. To our knowledge, introducing the local contextual information in the coding step of the BoW approach has not been proposed in any of the previous works. In the next section, we propose a robust and fast method to achieve this type of coding. 3. Locality-constrained and spatially regularized coding We propose here an energy based formulation to achieve a robust basis selection, required to encode dense local features of a given image. Then, a new fast resolution algorithm is provided relying on the graph-cut framework. Finally, we present the additional operations required to generate the final BoW signature Energy model Considering the Markovian assumption on a given image I, we can assume that neighboring patches within a constant or a smooth region of an image have to be coded on shared, or even similar bases from the codebook. In contrast, for neighboring patches within a discontinuous region, the corresponding local bases can be different, depending on the local descriptor information. Such an assumption leads to some interesting properties on the final coding: less noisy assignment considering the local contextual spatial information; saliency property is extended to take into account the spatial neighborhood; coding similar descriptors is consistent with the Markovian prior assumed on images. In order to obtain a robust basis selection following the proposed spatial assumption as well as the locality assumption stated in [33], we reformulate the problem as a labeling one. Formally, let us consider an image I. We denote by P = {1,..., N I } the set of indexes of dense patches (or more generally sites) in I. A set of local features X = {x p ; x p R d ; p P} is extracted from I at all sites. Given a codebook B = {b i ; b i R K ; i N }, we consider that each local feature is assigned to a subset of basis vectors, with cardinality m, belonging to the codebook. For simplicity of notations, a local feature x p is assigned to a set of indexes y p of bases in B. Therefore, Y = {y p ; y p N m ; p P} denotes the assignment of all the local features of the image I. In the LCC case for instance, each vector y p reflects the indexes of the m- nearest visual words to x p. The set of basis vectors related to the indexes in y p is denoted ˆB p = {ˆb p,i ; i = 1,..., m},

4 and we define ˆB = { ˆB p ; p P}. We also denote by L p = {l 1 p, l 2 p,..., l k p} the set of indexes of the k-nearest visual words to a local feature x p and call it the set of possible labels that a site p can take. Following the locality assumption, each local feature should be assigned to bases of cardinality m within the set of the k-nearest visual words in the codebook (k is set large enough to consider a large neighborhood in the feature space, i.e., k > m). The labeling problem we consider here consists in retrieving optimal basis for each local feature among the k- nearest ones, under the spatial contextual constraint. As a result, the locality assumptions both in the feature space and in the spatial domain of the image would be enforced. We introduce the following energy function to model the current problem: E(Y) = f data (x p, ˆB p ) +β w p,q f prior ( }{{} ˆB p, ˆB q ), }{{} p P p q E p(y p) E p,q(y p,y q) }{{}}{{} E data E prior (6) where f data (x p, ˆB p )) = m i=1 x p ˆb p,i 2 2 is the total distance between a descriptor x p and its m selected bases; p q indicates the indexes of two spatially neighboring patches under a fixed neighboring system (the grid of 4-nearest neighbors for instance); f prior ( ˆB p, ˆB q ) = m i=1 ˆb p,i ˆb q,i is a sum of the distances between the bases assigned to neighboring patches x p and x q ; w p,q is a local regularization parameter that corresponds to the similarity between local patches x p and x q. The more similar the local patches are, the higher we regularize the basis selection operation. Among the existing similarity measures of local features, we consider the histogram intersection kernel [31], denoted by K(.,.), since it has shown interesting performances when dealing with histogram based local features. We set the local hyper-parameters as the following: { K(x p, x q ) if K(x p, x q ) T, w p,q = (7) 0 otherwise. On the one hand, this binary form of local hyperparameters ensures regularization only on similar neighboring patches, above a similarity threshold T. On the other hand, it reduces the sensitivity of the model to the global regularization parameter β; E data is then a likelihood term that penalizes assigning visual words far from descriptors, whereas E prior is a prior term that penalizes assigning different visual words to similar neighboring patches. Minimizing (6) leads to an optimal assignment configuration: Ỹ = argmin E(Y X, ˆB, W). (8) Y We shall note that some particular cases of the proposed energy function lead to the state-of-the-art basis selection required for the coding, e.g., β = 0 and m = 1: hard and salient coding [26, 15, 12], β = 0 and m > 1: approximate LCC as implemented in [30, 20] Fast optimization algorithm The proposed energy function (6) is non-convex for general forms of distance functions f prior and f data. Its minimization can be performed efficiently inspiring with fast optimization tools dedicated to pairwise multi-label MRF energies [18]. In particular, the graph-cut approach has been successfully used to efficiently solve many labeling problems in computer vision [27]. One of the most popular iterative graph-cut based approximate multi-label optimization algorithms is the α-expansion [4], which relies on iterative binary moves of the desired configuration [29]. At a given iteration (i), each site can keep its current label or change it to a new one α (i) L, with L a discrete set of labels. This binary move is performed optimally by building an appropriate graph where a minimum-cut/maximum-flow is computed [10]. Several binary partition moves are iterated until convergence to a local optimum of the energy. Such large partition moves efficiently reach good local optima of nonconvex energy functions. Besides its effectiveness, it can be applied to various labeling problem, even with unordered labels. To solve the optimization problem (8), we propose an appropriate optimization algorithm extending the α-expansion in two directions. On the one hand, it deals with vectorial labels, i.e., a finite set of labels is assigned to each site. On the other hand, we constrain each site to take a label within an associated subset of labels, that can change from one site to another. The proposed optimization algorithm will be called α knn -expansion. It performs, at each iteration, a binary expansion move to a set of labels α = {α 1, α 2,..., α m } N m only for a subset of sites S α P that have α within the set of the k-nearest visual words indexes, i.e., S α = {p P ; such that α L p }. These sites are called active sites. Thereby, at each iteration of the optimization algorithm, a global binary move of all active sites is performed. Binary moves are iterated for a number of vectorial labels {α 1, α 2,..., α n } within a cycle. If the energy decreases, a new cycle is started, until convergence to a local optimum (figure 2). We note that a possible

5 Input: Y (0), W, ˆB, X Output: Ŷ S S For each cycle c do 1. Select n vectors of labels {α i} n i=1 within the set of k-nearest basis indexes to the local features 2. For each iteration i n do (a) Perform an optimal binary expansion move to α i: (b) Ỹ := Y (i) Y (i) = argmin E(Y X, ˆB, W), Y such that: y p (i) {y p (i 1), α i}, p P 3. If E(Ỹ) < E(Y(c 1) ), Then Y (c) := Ỹ Else return Ỹ Figure 2. α knn -expansion based optimization algorithm. initialization Y (0) could be the m-nearest bases, as used for LCC. In order to achieve an optimal binary move (step 2.a in figure 2), a directed graph G α = (A α, E α ) is built, where A α is a set of nodes related to active sites and E α is a set of oriented edges connecting neighboring nodes. Two auxiliary nodes s and t are added for the maximum-flow computation. Based on the efficient graph construction originally proposed in [13] for the α-expansion, a graphical illustration of the graph topography for the proposed α knn - expansion move as well as the capacities on edges are described in figure 3. Once the graph is constructed, the maximum-flow is computed in a polynomial time due to the sub-modular property of the proposed energy function [13]. Indeed, for a binary expansion move, the sub-modularity constraint on the energy function is verified for any likelihood terms and metric prior terms [4]. This is the case of (6), since we are using a metric as a priori to compute distances between optimal bases. The efficient polynomial time maximumflow algorithm, proposed by [13] 1, is well suited to gridstructured graphs, and thus to our problem. Additionally, minimizing (6) with the proposed α knn -expansion algorithm is fast, since expansion moves are restricted to active sites only Coding and pooling Once optimal bases are selected, assigning a response to each one of the basis vector can be achieved with various strategies. We consider recent strategies in the literature, namely hard responses [15], salient responses [12] and soft responses either by solving a linear system [30] or by com- 1 http : //pub.ist.ac.at/ vnk/software.html t (a) E s p p q r E p t E p q t (b) E s q E q t E s p E p (y p ) + E p,q (y p, y q ) E p t E p (α) + E p,q (α, y q ) E p q E p,q (y p, α) + E p,q (α, y q ) E p,q (y p, y q ) E s q E p,q (α, y q ) + E q (y q ) + E q,r (y q, y r ) E q t E q (α) + E q,r (α, y r ) Figure 3. Illustrative example of the graph construction on an image with 4 4 sites. (a) Graph topography to compute one optimal move of the α knn -expansion algorithm for a given label α. Active sites (the gray ones) are connected to the source and sink nodes, whereas black points are non-active sites at the current iteration; (b) detailed construction for two possible neighboring configurations: active-active and active-nonactive with corresponding edge capacities depicted in the table. puting a posteriori probabilities of local features belonging to the selected optimal bases [20]. For each image, the obtained codes are then aggregated with a max-pooling operation, following recent works [30, 12, 20], resulting in a unique and compact signature vector. 4. Experimental results In this section the proposed approach is tested for scene classification and object recognition tasks and evaluated on three well known benchmarks, intensively used in the literature [15, 30, 12, 20]: UIUC 8-sport [17], 15-natural scenes [15] and Caltech-101 object categories [9] Pipeline In all the experiments we conduct, the same processing chain is considered following the literature settings to ensure consistency. The pipeline is as follows: dense local features (SIFTs) of size 128 are extracted within a regular spatial grid, and only one scale from images downsized to no more than pixels (resp pixels) for 15-natural scenes (resp. UIUC 8-sport). The step-size is fixed to 4 pixels and patch size to pixels; a codebook of size 1024 is created using the K-means clustering method on a randomly selected subset of SIFTs belonging to the training dataset ( 10 5 SIFTs);

setting the map elements p the method of [19], (c) a pairwise local h )2 + (w v )2 ; p q ; (p, q) P 2 } to zero under the threshold T = 0.

with LLC (we use a RGB color map to visualize the three visual words indexes for each local patch), spatially regularized (f) hard assignment and (g) LLC based assignment using the proposed approach.

Even if the theoretical coding formalism conducted in section 3 concerns patches surrounding all the pixels, using a small step-size does not reduce the consistency of the coding, and accelerates its

In order to accelerate the convergence, we set the number of optimal bases retrieved for each local feature to m = 3 among the k = 10 nearest visual words.

We observed empirically that enlarging the size of the set of retrieved optimal bases improves slightly the classification performances while augmenting the computational time.

classifier is used, since it has shown good performances in categorization when paired with the max-pooling operation [32].

6 (a) (d) (b) (e) (c) (f) (g) Figure 4. Illustrative example of local basis selection required for coding: (a) original image, (b) visualization of extracted dense SIFTs using features similarity map for visualization only, obtained by setting the map elements p the method of [19], (c) a pairwise local h )2 + (w v )2 ; p q ; (p, q) P 2 } to zero under the threshold T = 0.7, (d) the conventional hard assignment (color map { (wp,q p,q corresponds to indexes of selected visual words from the codebook), (e) 3-nearest bases selected for each local feature as performed with LLC (we use a RGB color map to visualize the three visual words indexes for each local patch), spatially regularized (f) hard assignment and (g) LLC based assignment using the proposed approach. for coding we fix the step-size for patch extraction to 4 pixels and patch size to pixels. Even if the theoretical coding formalism conducted in section 3 concerns patches surrounding all the pixels, using a small step-size does not reduce the consistency of the coding, and accelerates its computation. Regarding the computation time required to retrieve the optimal bases, the choice of the two parameters m and k is crucial. In order to accelerate the convergence, we set the number of optimal bases retrieved for each local feature to m = 3 among the k = 10 nearest visual words. It results in a reduced number of possible vectorial labels αi N 3 to be visited within each cycle of the optimization algorithm. We observed empirically that enlarging the size of the set of retrieved optimal bases improves slightly the classification performances while augmenting the computational time. the max-pooling operation is performed (even with hard assignment codings) and the SPM [15] with 3 levels 1 1, 2 2 and 4 4 is adopted, leaving a same weight at each layer; a one-vs-all linear SVM classifier is used, since it has shown good performances in categorization when paired with the max-pooling operation [32]. Our method is integrated into several coding schemes, namely hard assignment coding (HC) [15], localityconstrained linear coding (LLC) [30], saliency coding (SC) [12] and localized soft assignment coding (LSC) [20], showing the improvement achieved when the local spatial context within images is included into the coding process. We have to mention that for some of the state-of-the-art works, results shown in the original papers are obtained using different pipelines than that we adopt here. For instance, sparse coding is used as an alternative to the K-means algorithm for learning the codebook in [30, 11], dense SIFTs are extracted with three scales in [30, 12], a mix-order maxpooling operation is applied in [20], and the number of the selected local bases varies from 5 to 10, etc. Therefore, in order to achieve fair comparisons to these works and obtain a coherent assessment, it was necessary to conduct all the experiments with the same pipeline and a common implementation. Since source codes of existing methods are not always available, we had to re-implement them and, as it often happens, our engineering choices led to minor performance differences compared to results reported in the original papers. Nevertheless, performances we obtained remain fully consistent with existing works. Our experimental approach, consisting in using a common implementation and pipeline to insure consistency, was also suggested by [5] UIUC 8-sport This dataset contains 8 sport categories for image-based event classification [17]. There are 1579 images. Each class has 137 to 250 images. Following the standard setting on this data set, we use 10 random splits of the data, we randomly select 70 training images and 60 test images for each category. In table 1, the classification accuracies resulting from the state-of-the-art coding approaches and the use of the proposed LCSR coding method are reported. As we can see, for all the coding strategies, adding the contextual spatial information improves the classification accuracy significantly. On the one hand, it improves the classic hard assignment method (HC) by reducing assignment errors due

codebook distortions. We note that classification accuracy with such a regularized hard assignment reaches the performances of recent locality based coding methods.

7 codebook distortions. We note that classification accuracy with such a regularized hard assignment reaches the performances of recent locality based coding methods. On the other hand, it improves the optimal basis selection step required by the locally-constrained coding strategies (LLC, SLC and SC). We note that for the saliency based coding (SC), our approach extends the saliency property originally considered in the feature space only, to cope with the local spatial information. Hence, non-salient patches in the spatial domain of the image will be discarded at the maxpooling step. More discriminative features in term of classrelative salient visual words are obtained, which are well suited to linear classifiers. We shall indicate that the LCSR operation is computationally fast and so does not affect the computational time required for several coding methods. As an order of magnitude, for an image of size , leading to SIFTs, time required to select optimal 3 bases for each SIFT among the 10-nearest bases to each one is less than 1 second using a CPU with a frequency of 2.66 GHz. To compare our result to other recent works, we outperform the best classification rate (85.31%) of [11] with a different and highly computationally demanding pipeline. Classification accuracies (%) Coding method original approach using the LCSR HC [15] ± ± 1.68 LLC [30] ± ± 1.52 LSC [20] ± ± 1.56 SC [12] ± ± 1.14 Table 1. Classification accuracies on the UIUC 8-sport data set. Rock Climbing Badminton Bocce Croquet Polo Rowing Sailing Snowboardin Rock Climbing 95.8 Badminton 93.5 Bocce 64.8 Croquet 84.5 Polo 84.7 Rowing 90 Sailing 96.5 Figure 5. Confusion matrix for the UIUC 8-sport data set natural scenes This dataset [15] contains 4485 images of 15 scene categories, each one containing 200 to 400 images. Scenes vary from indoor to outdoor environments. Following the standard setup, we use 10 random splits of the data, while Snowboardin 88 considering 100 random images per class for training and the rest for testing. In table 2 classification accuracies using several approaches are provided. Similarly to the previous dataset, adding the spatial context into local feature coding always improves the performances by 2% on average. Classification accuracies (%) Coding method original approach using the LCSR HC [15] ± ± 0.5 LLC [30] ± ± 0.36 LSC [20] ± ± 0.51 SC [12] ± ± 0.5 Table 2. Classification accuracies on the 15-natural scenes data set Caltech-101 This dataset [9] has 101 object categories containing from 31 to 800 images each one. We use 10 random splits of the data, while considering 30 random images per class for training and the rest for testing and provide the average classification rates in table 3. In contrast to the two previous datasets, containing scene images, the current task rather deals with object recognition. Local spatial context is then less relevant for image understanding. Nevertheless, it leads to some improvement in classification accuracies. Indeed, the local spatial information may improve the coding step by reducing assignment errors that are mainly due to some artifacts characterizing images of the current dataset (often resulted from transformations synthetically made on images to evaluate the classification s robustness toward them). Classification accuracies (%) Coding method original approach using the LCSR HC [15] ± ± 0.52 LLC [30] ± ± 1.23 LSC [20] ± ± 0.81 SC [12] ± ± 0.75 Table 3. Classification accuracies on the Caltech-101 data set. 5. Conclusion In this paper, we presented a promising local feature encoding method that exploits interesting locality properties in both the features space and the spatial domain of the image. Results show that our contribution improves state-ofthe-art coding schemes, increasing the classification rates by 1% to 4% on UIUC 8-sport, 15-natural scenes and Caltech-101, using a standard experimental pipeline. Ongoing efforts are devoted to analyze the proposed method in case of multi-label and multi-instance recognition tasks, since incorporating the local spatial information into features would be beneficial to recognize multiple objects in a given image.

8 Acknowledgment: This work has been partially funded by I2S in the context of the project Polinum. We acknowledge support from the French ANR (Agence Nationale de la Recherche) via the YOJI (ANR-09-CORD-104) and PERIPLUS (ANR-10-CORD-026) projects. References [1] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In ECCV, [2] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In CVPR, , 2 [3] Y. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in vision algorithms. In ICML, [4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. PAMI, , 4, 5 [5] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, [6] A. Coates and A. Y. Ng. The importance of encoding versus training with sparse coding and vector quantization. In ICMA, [7] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, [8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, [9] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVIU, , 5, 7 [10] L. R. Ford and D. R. Fulkerson. Maximal flow through a network. Classic papers in combinatorics, [11] S. Gao, I. Tsang, L. Chia, and P. Zhao. Local features are not lonely - Laplacian sparse coding for image classification. In CVPR, , 3, 6, 7 [12] Y. Huang, K. Huang, Y. Yu, and T. Tan. Salient coding for image classification. In CVPR, , 2, 3, 4, 5, 6, 7 [13] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? PAMI, [14] S. Lazebnik and M. Raginsky. Supervised learning of quantizer codebooks by information loss minimization. PAMI, , 2 [15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, , 4, 5, 6, 7 [16] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. NIPS, [17] L. J. Li and L. Fei-Fei. What, where and who? classifying events by scene and object recognition. In ICCV, , 5, 6 [18] S. Z. Li. Markov random field modeling in image analysis. Springer-Verlag New York Inc, [19] C. Liu, J. Yuen, and A. Torralba. Sift flow:dense correspondence across scenes and its applications. PAMI, [20] L. Liu, L. Wang, and X. Liu. In defense of softassignment coding. In ICCV, , 2, 4, 5, 6, 7 [21] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, [22] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. In CVPR, , 2 [23] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, [24] R. Rigamonti, M. A. Brown, and V. Lepetit. Are sparse representations really relevant for image classification? In CVPR, [25] R. E. Schapire. The strength of weak learnability. Machine learning, [26] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, , 2, 4 [27] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother. A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. PAMI, [28] J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek. Visual word ambiguity. PAMI, , 2 [29] O. Veksler. Efficient Graph-Based Energy Minimization. PhD thesis, Cornell University, [30] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In CVPR, , 2, 4, 5, 6, 7 [31] J. Wu and J. M. Rehg. Beyond the Euclidean distance: Creating effective visual codebooks using the histogram intersection kernel. In CVPR, [32] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, , 6 [33] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. NIPS, , 3

String distance for automatic image classification

String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,