ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING. Wei Liu, Serkan Kiranyaz and Moncef Gabbouj

Proceedings of the 5th International Symposium on Communications, Control and Signal Processing, ISCCSP 2012, Rome, Italy, 2-4 May 2012 ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING Wei Liu, Serkan Kiranyaz and Moncef Gabbouj Department of Signal Processing, Tampere University of Technology, Tampere, Finland wei.liu@tut.fi, serkan@cs.tut.fi, moncef.gabbouj@tut.fi ABSTRACT Natural scene recognition and classification have received considerable attention in the computer vision community due to its challenging nature. Significant intra-class variations have largely limited the accuracy of scene categorization tasks: a holistic representation forces matching in strict spatial confinement; whereas a bag of features representation ignores the order or spatial layout of the scene completely, resulting in a loss of scene logic. In this paper, we present a novel method, called ARP (Angular Radial Partitioning) Gist, to classify the scene. Experiments show that the proposed method has improved recognition accuracy by better representing the structure in a scene and striking a balance between spatial confinement and freedom. Index Terms scene classification, angular radial partitioning, scene gist 1. INTRODUCTION Recent advances in the field of computer vision have spanned the approach to understanding the semantics of natural scene images into two directions: global representation and an orderless bag of features (BOF) model [13]. The former, proposed in [10], attempts to capture the gist of a scene without object segmentation and recognition. The Gist descriptor is a low-dimensional representation of the attributes of a scene, namely naturalness, openness, roughness, expansion and ruggedness. The scene classification paradigm based on this holistic perspective on natural scene images is later compared to human performance in a rapid scene classification experiment [4], the result of which has provided evidence that the concept of representing the global attributes of a scene is in parallel to the human visual and cognitive system. Further experiments [5] in psychology and cognitive science suggest that a mere 50ms on average can be sufficient for scene recognition. These findings explain why the Gist descriptor performs remarkably well on scene recognition tasks, (especially on outdoor categories,) with applications extending to place recognition [12]. By dividing the image into an N-by-N grid, however, the Gist descriptor imposes strong constraints on spatial layout, and yet fails to delineate the spatial structure in each block. Consequently, mismatch occurs due to averaging operation in individual blocks. On the other end of the spectrum is the BOF model. Inspired by the bag of words model in text categorization, this paradigm represents each image as an occurrence histogram of visual words that are local descriptors of regions or patches in the image. The SIFT descriptor [8] has been widely used as a local feature for the BOF model. The first stage to compute the SIFT (Scale Invariant Feature Transform) descriptor is to detect interest points that are repeatable under moderate local transformations [9]. Then a descriptor is generated in an image patch around the interest point. This powerful descriptor is highly discriminative and invariant to scale, clutter and partial occlusion, change in illumination and view point, etc. These patch features are generated from images to form the codebook using the k- means algorithm, with the centroids of the clusters representing the visual vocabulary. Then an image can be perceived as the occurrence counts of each visual word in the vocabulary resulting in an orderless representation of the scene. To ensure the accuracy of classification, the codebook should be large enough so that each image can be properly represented by the histogram. Due to significant intra-class variations, such requirement is not easily satisfied. Furthermore, the codebook-building process is often computationally intensive, which limits the efficiency of its application. But the most prominent weakness arises from the absence of logic of the scene due to complete ignorance of spatial layout. We argue that the logic of a scene is essential to its recognition and classification while the computational cost imposed by the BOF model is highly undesirable. In order to better capture the shape characteristics of objects and the spatial structure in a block of an image, in this paper, we propose a novel algorithm that not only delineates the structures within a block, but also provides leeway for spatial freedom. The paper is organized as follows. Section 2 briefly summarizes the algorithm of the original Gist descriptor, followed by a complete introduction of the proposed method, with important elements discussed in detail. Parameters used in feature extraction and classification experiments are described in Section 3. Section 4 reports experimental results with comparison to other well-known implementations. Section 5 concludes the paper. 978-1-4673-0276-0/12/$31.00 2012 IEEE

2. THE PROPOSED ALGORITHM In this section, we shall present the proposed ARP (Angular Radial Partitioning) Gist technique in detail, followed by explicit theoretical justifications. 2.1. Implementation Procedure The proposed algorithm is built upon the implementation of the original Gist descriptor 1, which is summarized in the following: 2.1.1. Original Gist Descriptor First, a grayscale image is pre-processed by a whitening filter to preserve dominant structural details and then normalized with respect to local contrast. The pre-processed image is then passed through a cascade of Gabor filters (Figure 1) in S scales with O orientations at each scale. Each of these S O images (orientation maps), representing the original image at one orientation in each scale, is then divided into an N-by-N grid. Within each block on the grid, the average intensity is calculated to represent the feature in that block. The final output is a concatenated feature vector of S O N N dimensions. Figure 2: Flowchart of the original Gist and ARP Gist. 2.2. Angular Radial Partitioning Figure 1: Gabor Filter (4 Scales, 8 Orientations per Scale). 2.1.2. ARP Gist Instead of taking the average value within each block on the N-by-N grid, we further partition each block into A bins using Angular Radial Partitioning (ARP) [2]. To avoid overpartitioning, only angular partitioning is considered; in other words, the number of radial partitioning is set to 1 for all blocks. Then the average intensity level is calculated in each angular bin, followed by a 1-D discrete Fourier transform on the angular bins in each block and then taking the magnitude of the coefficients to achieve positional invariance. Finally, the feature vector is obtained by concatenating all the DFT transformed bins in the image across all the orientations and scales, resulting in an S O N N A dimensional feature vector. Figure 2 shows the complete block diagram for the proposed method. Note that the implementation procedure of the original Gist (circled in read) is also included in the figure. 1 http://people.csail.mit.edu/torralba/code/spatialenvelope/ ARP has been successfully applied in content-based image retrieval (CBIR), sketch-based image retrieval (SBIR) applications [3] and object recognition [1]. It employs both angular and radial partitioning that is similar to the polar coordinate system. One main advantage of ARP is its ability to capture intricate structures in an angular-spatial manner, as opposed to a simple spatial distribution in a rectangular partitioning scheme. Figure 3 shows a typical ARP strategy. Figure 3: Angular Radial Partitioning Spatial layout is an important part of a scene image as it carries essential information regarding its category. In order to preserve relative spatial layout of a scene image and allow moderate intra-class variations in scenes from each class, i.e., the presence of the stove can be in the middle of the image or on the left center of the image, the Gist

descriptor is computed on an N-by-N grid. Even though this coarse partitioning scheme has yielded significant success in terms of recognition accuracy in scene classification, it fails to represent spatial structures efficiently within a block as the averaging operator often renders different structures indistinguishable, resulting in mismatch among scene categories. Figure 4 shows an example of such a deficiency. It is clear that even though the spatial structures are visually distinct for human observers, the Gist feature vectors cannot really discriminate between the two distinct images. Figure 6 shows the same example in Figure 4 but with additional ARP. Since these two blocks are divided into 4 additional angular bins, the dissimilarity between the two resulting feature vectors becomes significant enough to distinguish the two different structures. Figure 6: ARP Gist in a block: a simple spatial structure in a block and the corresponding ARP Gist feature vector; another spatial structure and the corresponding ARP Gist feature vector Figure 4: Limitations of the Gist descriptor: a simple spatial structure in a block and the corresponding Gist feature vector; another spatial structure and the corresponding Gist feature vector. For a better representation of the spatial structures of a scene image, we propose a strategy that builds on the success of the original Gist feature. In addition to the N-by- N rectangular partitioning (Figure 5 ), we further divide each block using ARP into A angular bins, which not only extracts the coarse spatial layout but also the finer angular layout in a scene image. To avoid coincidence with further rectangular partitioning, we use the upper right diagonal as the starter of ARP in each block, as illustrated in Figure 5. Figure 5: Demonstration of rectangular partitioning and ARP: image partitioned on a 4-by-4 grid, angular partitioning in addition to the original rectangular partitioning (A=8). 2.3. Positional Invariance Even though ARP can better delineate the spatial structure in a block, the risk is the same with other type of blocks: over-partitioning. The idea of dividing an image into blocks is to preserve some spatial layout in the process of recognition or matching. Finer partitioning means stricter layout confinement, which is not the case for different scene images in the same category. This is the reason why the original Gist descriptor is calculated on a 4-by-4 grid instead of an 8-by-8 one. Experiments (see Section 4 for results) have shown that over-partitioning will not improve classification accuracy, and sometimes may even induce accuracy erosion. This is also true with ARP. Further dividing the 4-by-4 grid can sometimes degrade the leeway gained by better representing the structure since the same spatial structures in different scene images within the same category often enjoy spatial freedom within an area of the image, i.e., a computer can be at different positions along the surface of the desk. In light of such dilemma, the proposed method utilizes the discrete Fourier transform to achieve rotational or positional invariance. Let I denote an image block and A denote the number of angular partitioning. The angle in each bin can be calculated asθ = 2 π /A. Then the i th element of the feature vector of one block can be formulated as follows: 1 (1)

for i = 0,1,2... A 1, where S is the total number of pixels that fall into each bin. If the block is rotated counterclockwise τ = l2 π / A radian ( l = 0,1,2... A 1 ) around the center of the block, then the image block, denoted as I τ, can be represented by the following equation: (2) Through simple mathematical deduction, we can find the relationship between fτ () i and f () i : (3) Clearly fτ () i and f () i are not the same. But with a simple 1-D discrete Fourier transform, the similarity of the two can be easily observed. After applying DFT to f () i and, we obtain: f () i τ 1 / 1 / 1 / 1 / (4) (5) (6) (7) / (8) According to equation (8), the DFT of the rotated feature vector is just multiplied by a certain angle to that of the original one. Note that the magnitudes of the DFT of both feature vectors are the same, that is Fτ ( u) = F( u). Therefore, we use the magnitude of 1- D DFT coefficients to achieve rotational invariance so that regardless of the angular position of structural details, the same structures always render the same feature vectors. Figure 7 shows a simple example of such transformation. In the figure, shows two identical spatial structures at different positions in two image blocks. Without DFT, the two feature vectors, shown in and generated using ARP, are visually distinct from each other. But with a simple 1-D DFT and taking the magnitude of the coefficients (Figure 7 (c)), the two feature vectors are virtually the same, which exhibits the fact that our descriptor is position invariant. (c) Figure 7: An example of rotational invariance. same structures at different locations in two image blocks; feature vectors without DFT; (c) feature vectors with DFT 3. FEATURE EXTRACTION AND TEST SETTINGS 3.1. Image Normalization Since the algorithm is based on the spatial structures within scene images, we consider only the luminance component, for which we use the mean of the R, G, B channels. In order to ensure comparability, all images are resized to the resolution of 256 256 using bilinear interpolation; therefore the aspect ratio of each image is ignored. This is in line with the experimental setup used by Oliva et al. in their implementation. 3.2. Parameter Settings for Feature Extraction The parameters for image pre-processing (image whitening and local contrast normalization) are kept the same with the original Gist, and so are the parameters for Gabor filter. The images are filtered by Gabor filter at 4 scales, with 8 orientation channels at each scale. For the original Gist descriptor, each image is divided into N N (N=4,8) blocks and the average is taken in each block. Hence, the total dimension of the feature vector for each image is 4 8 N N=32N 2 for the original Gist descriptor. ARP is applied to each block on a 4-by-4 grid. The number of angular partitioning (A) used in our experiment is 3, 4, 5 and 6 respectively to evaluate the performance of ARP Gist. In each angular bin, we take the average value to represent the feature of that bin, resulting in a feature vector of size 4 8 4 4 A=512A.

3.3. Classifier Training SVM training and testing are conducted 1000 times so that generality can be achieved. We randomly select 100 images in each category for training and the rest for testing. This processing is done 1000 times to ensure effective comparison between the proposed algorithm and the original Gist descriptor. Note that the comparison is based on the same 1000 sets of training and testing data. In our experiment, we use Gaussian Radial Basis Function as the kernel to build one-versus-all classifiers, which has the following form:, exp (9) The scaling factor γ in equation (9) is defined in our experiment as the following: 1 (10) where p is the kernel parameter, which is set to 0.003 in all our experiments, and f is the number of dimensions of the feature vector. The confusion matrix for every training/testing set is recorded during each run. The final classification accuracy is the average value of the mean of the confusion matrix diagonal. gain in the performance of the proposed method is not simply as a result of further partitioning. 4.1. On the Spatial Envelope Dataset The MIT spatial envelope dataset is the testbed for the original Gist descriptor. It consists of 8 outdoor scene categories: coast, mountain, forest, open country, street, inside city, tall buildings and highways. There are 2688 color images in total and approximately 300 images in each category. Notably, each image has the same resolution of 256 256. Figure 8 shows some sample images from the dataset, with one from each category. Table 1: Comparison of classification accuracy on MIT dataset. Method Original Gist ARP Gist Classification Accuracy N=4 83.2661±0.7757 N=8 83.2664±0.7417 A=2 84.5626±0.7358 A=3 84.7671±0.7141 A=4 84.6186±0.7097 A=5 84.2832±0.6986 A=6 83.6655±0.7177 As the results summarized in Table 1 indicate, the average classification accuracy obtained by the original Gist descriptor (N=4) is 83.2661%, with a standard deviation of 0.7757. Note that this is slightly lower than the 83.7% reported by Oliva et al. because of different training configurations: in our experiment, we have selected 1000 different training/testing configurations in SVM to evaluate the average performance. In contrast, the proposed ARP Gist has shown improvement over the original, with the best configuration (A=3) yielding a classification accuracy of 84.7671%. To show the validity of the improvement, we have also tested the original Gist on an 8-by-8 (N=8) grid (resulting in a feature vector of 2048 dimensions, equivalent to ARP Gist when A is set to 4), which results in 83.2664% accuracy. This is in parallel to the 4-by-4 grid Gist with almost intelligible accuracy improvement. As observed in Table 1, the proposed algorithm has the superiority in terms of classification accuracy. Figure 8: Sample images from the MIT spatial envelope dataset. 4. EXPERIMENTAL RESULTS In this section, we report the performance of ARP Gist under different configurations in comparison with the original Gist descriptor with varying value of N based on two publicly available datasets: the MIT spatial envelope dataset [10] and UIUC 15 scene category dataset [7]. The reason for varying N in the original Gist is to ensure that any Figure 9: Sample images from the additional categories of the UIUC 15 scene category dataset.

4.2. On the 15 Scene Category Dataset The UIUC 15 scene category dataset is an extension to the original spatial envelope dataset. It contains not only all the outdoor scene images shown in Figure 8 (all of the MIT spatial envelope dataset images are presented in grayscale), but also some additional indoor and outdoor categories: suburb, office, kitchen, living room, store, bedroom and industrial. Most additional images are in grayscale. The resolution and aspect ratio of the additional images vary within each category and among categories also. Figure 9 shows the sample images from additional categories. Table 2: Comparison of classification accuracy on 15 scene category dataset. Method Classification Accuracy Original Gist N=4 72.6739±0.7133 N=8 72.4312±0.7068 A=2 74.4612±0.6864 ARP Gist A=3 75.0379±0.6811 A=4 75.2474±0.6717 A=5 74.8499±0.6713 A=6 74.2130±0.6777 BOF M=200 72.2±0.6 M=400 74.8±0.3 In this dataset, the classification accuracy achieved by the original Gist is only 72.6739% with a standard deviation of 0.7133. In contrast to the previous dataset, the overpartitioned original Gist (8-by-8 grid) has suffered a slight accuracy erosion, with a classification rate of 72.4312%. On the other hand, the proposed ARP Gist has yielded classification rates above 74%. The best result (75.2474%) is obtained when the number of angular partitioning is set to 4. Note that the feature vector dimension in this configuration is the same with 8-by-8 grid of the original Gist. To show the significance of performance improvement obtained by ARP Gist, we summarize the classification rates using BOF algorithm [6] in Table 2, along with the standard results. The BOF feature is based on image patches on a densely sampled grid, without the usual process of interest point detection. The SIFT descriptor is calculated on each image patch. The experiment is conducted on 200 (M=200) and 400 (M=400) vocabulary size models. It is evident that even without building the codebook, saving significant computational cost, ARP Gist is still superior to the BOF model. (i.e., note that images used in BOF model are not normalized to 256 256 resolution. If normalized, the model will suffer significant accuracy degradation [11].) 5. CONCLUSION This paper presents a novel approach for scene representation. Built on the original Gist descriptor, the proposed ARP Gist descriptor utilizes the effectiveness of angular partitioning to capture the finer details of scene images. With the DFT transform and magnitude of its coefficients, the ARP Gist allows positional invariance of scene structures within a rectangular block. The proposed method not only preserves rough spatial layout, but also provides flexibility in each block, achieving a balance between spatial constraints and freedom. Experiments on two datasets have shown that the proposed method is superior to the original Gist and rivals the state-of-the-art BOF model in terms of classification accuracy and computational cost. 6. REFERENCES [1] S. Belongie and J. Malik, Shape Matching and Object Recognition Using Shape Contexts, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):509-522, 2002. [2] A. Chalechale, A. Mertins and G. Naghdy, Edge Image Description Using Angular Radial Partitioning, IEE Proceedings - Vision, Image and Signal Processing, 151(2):93-101, 2004. [3] A. Chalechale, G. Naghdy and A Mertins, Sketch-based Image Matching Using Angular Partitioning, IEEE Transactions on Systems, Man, and Cybernetics, 35(1):28 41, 2005. [4] M. R. Greene and A. Oliva, Recognition of Natural Scenes from Global Properties: Seeing the Forest without Representing the Trees, Cognitive Psychology, 58(2):137-179, 2009 [5] M. R. Greene and A. Oliva, The Briefest of Glances: The Time Course of Natural Scene Understanding, Psychological Science, 20:464 472, 2009 [6] S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, IEEE Conference on Computer Vision and Pattern Recognition, 2:2169-2178, 2006. [7] L. Fei-Fei and P. Perona, A Bayesian Hierarchical Model for Learning Natural Scene Categories, IEEE Conference on Computer Vision and Pattern Recognition, 2:524-531, 2005. [8] D. G. Lowe, Distinctive Image Features from Scale-invariant Keypoints, International Journal of Computer Vision, 60(2):91 110, 2004. [9] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir and L.V. Gool, A Comparison of Affine Region Detectors, International Journal of Computer Vision, 65(1 2):43 72, 2005. [10] A. Oliva and A. Torralba, Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope, International Journal of Computer Vision, 42(3):145 175, 2001. [11] A. Quattoni and A. Torralba, Recognizing Indoor Scenes, IEEE Conference on Computer Vision and Pattern Recognition, 413 420, 2009. [12] C. Siagian and L. Itti, Rapid Biologically-inspired Scene Classification Using Features Shared with Visual Attention, IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(2):300 312, 2007. [13] J. Sivic and A. Zisserman, Video Google: A Text Retrieval Approach to Object Matching in Videos, International Conference on Computer Vision, 2:1470-1477, 2003.