Mutual Information Based Codebooks Construction for Natural Scene Categorization

Size: px

Start display at page:

Download "Mutual Information Based Codebooks Construction for Natural Scene Categorization"

Jack Leonard
6 years ago
Views:

1 Chinese Journal of Electronics Vol.20, No.3, July 2011 Mutual Information Based Codebooks Construction for Natural Scene Categorization XIE Wenjie, XU De, TANG Yingjun, LIU Shuoyan and FENG Songhe (Institute of Computer Science and Engineering, Beijing Jiaotong University, Beijing , China) Abstract The codebook is an intermediate level representation which has been proven to be very powerful for addressing scene categorization problems. However, in most scene categorization methods, a scene is characterized by a single histogram based on the sole universal codebook, which is lack of enough discriminative ability to separate the similar images among different categories and results in low classification accuracy. In order to solve this problem, in this paper, we propose a novel scene categorization method that constructs class-specific codebooks based on feature selection strategy. Specifically, feature selection method mutual information is adopted to measure the visual word s contribution to each category and construct class-specific codebooks. Then, an image is characterized by a set of combined histograms (one histogram per category), each of which is generated by concentrating the traditional histogram based on universal codebook and the class-specific histogram grounded on class-specific codebook with an adaptive weighting coefficient. The improved combined-histogram provides useful information or cue to overcome the similarity of inter-class images. The proposed method is sufficiently evaluated over three wellknown scene classification datasets, and experimental results show that our proposed scene categorization method outperforms the state-of-the-art approaches. Key words Scene categorization, Mutual information, Combined histogram, Class-specific codebook, Adaptive weight. response, gradient, etc, to characterize scene. This may be sufficient for separating scenes with significant differences in the global properties. However, if images of different categories (e.g. office vs. living room) have similar low-level global features, the global features may not be discriminative enough. Recently, a method called bag-of-words representation is widely concerned [12 14],which is demonstrated in Fig.2. Instead of taking the whole image as an entity, it uses feature points to model scenes as a collection of these points labeled by a codebook which is constructed by quantizing these points using local invariant features. The codebook provides a middle level representation that helps to bridge the semantic gap between the low-level visual features extracted from an image and the high-level semantic concept to be categorized. Recent works have shown that bag-of-words model is suitable for scene classification and shows impressive levels of performance [12]. I. Introduction Scene classification is an important aspect for computer vision, and has received considerable attention in recent years. Automatic methods for associating images with semantic labels have a high potential for improving the performance of other computer vision applications such as image browsing/retrieval [1 3], intelligent vehicle/robot navigation [4,5] and object recognition [6,7]. Since a scene composed of several entities is often organized in an unpredictable layout, as shown in Fig.1, scene classification is much more difficult than conventional object classification, and is still a challenging question. Early work on scene categorization used low-level global features extracted from the whole image to classify images into a small number of categories [8 11]. The basic idea is to consider the image as a whole and exact low-level features, including color, texture, edge Fig. 1. Typical scene image containing diverse entities This image representation using feature points is analogous to the bag-of-words representation of text documents in terms of form and semantics. However, the main difference between scene categorization and text categorization is that there is no given codebook for scene categorization and it has to be learnt from training images. The codebook in the bag-of-words model may be constructed in various methods. Sivic et al. [15] originally proposed to cluster the low-level visual features using K-means algorithm to construct codebook, where each centroid corresponds to a visual word. When building a histogram, each feature vector is assigned to its closest centroid. The codebook describes an image as a bag of discrete visual words and the frequency distributions of visual words in an image allow classification. W.H. Hsu et al. [16] implemented visual cue cluster construction via information bottleneck principle and ker- Manuscript Received Apr. 2010; Accepted Mar This work is supported by the National Natual Science Foundation of China (No , No ), the Fundamental Research Funds for the Central Universities (No.2009JBM024) and China Postdoctoral Science Foundation (No ).

2 420 Chinese Journal of Electronics 2011 nel density estimation to automatically discover adequate mid-level features and obtain more discriminative codebook. F. Perronnin et al. [17] proposed to use Gaussian Mixture Model to perform clustering and create specifically tuned codebooks for each image category. Gemert et al. [18] improved the codebook model by introducing the uncertainty modeling, where the uncertainty is achieved with the techniques based on kernel density estimation. However, due to it s uncertainly and complexity, how to construct a reasonable and effective codebook is still a difficult problem. words t = {t 1,t 2,,t K }. This representation is denoted by R(I), which is a vector r = R(I), r R K that indicates the distribution or the presence of the visual words. The problem then becomes the issue of finding a projection: f : R(I) c (1) which projects the visual words representation of the image to the scene category e i, i =1,,u where it belongs. 2. Overall framework In this section, we introduce the framework of constructing class-specific codebooks based on feature selection method. The overall framework is depicted in Fig.3. Fig. 2. The flowchart of bag of words model Generally, an image is described by a single histogram using the bag-of-words model. This traditional histogram is generated based on the sole universal codebook constructed by considering images of all categories, which means this histogram contains lots of redundancy information and has limited discriminative ability to separate the similar images among different categories. In this paper, we propose a novel framework that employs feature selection method Mutual information (MI) to build class-specific codebooks (one codebook per category) with the bag-of-words model. For a given category, MI can be used to estimate each visual word s classification contribution to this category, and the visual words with higher contribution are selected to form the class-specific codebook. Then, class-specific histogram is generated by removing the corresponding bins of traditional histogram according to the class-specific codebook. Finally, traditional histogram which is based on the universal codebook and the class-specific histogram grounded on the class-specific codebook are concentrated together with the proposed adaptive weighting coefficient. As a result, an image is represented by a set of combined histograms (one histogram per category), which can not only retain the traditional histogram s discriminative ability, but also improve the discriminative ability by separating for each class the relevant information from the irrelevant information. Experiments on three different dataset including 8, 13 and 15 scene categories show that our proposed method outperforms the stateof-the-art approaches. The paper is organized as follows. Section II explains the proposed algorithm in a more detailed manner. In Section III, we demonstrate the experiments and results. Conclusion is discussed in Section IV. II. The Proposed Approach 1. Problem formulation The scene categorization problem based on the bag-of-words representation can be formulated in the following manner: given an image I R m n and a set of scene categories c = {c 1,c 2,,c u}, where u is the number of image category. The image I is firstly represented by a universal codebook V which consists of a set of visual Fig. 3. Framework of the proposed method Firstly, for visual words creation, input images are decomposed into multiple layers (in our experiment, we adopt four layers). The first layer is the original image, and the under layer is obtained by taking every second pixel in each row and column in the upper layer. The dense Scale invariant feature transform (SIFT) [19] is adopted to describe the feature points on each layer of image. Secondly, K-means algorithm is implemented with all the training images to generate the universal codebook V. Then in the processing of generating traditional histogram H t, there is inherent weakness of codebook model, that is, the hard assignment of discrete visual words to continuous image features. In order to solve the problem, in our paper, soft-assign method [13] is adopted to produce the traditional histogram. For each feature point in an image, instead of only searching for the nearest visual word, the top-n nearest visual words are selected to form H t with appropriate weight, that is, given a universal codebook V, weuseak-dimensional vector W = {w 1,,w k,,w K } with each component w k representing the weight of a visual word k in an image such that N M i 1 w k = sim(j, k) (2) 2i 1 i=1 j=1 where M i represents the number of feature point whose ith nearest neighbor is visual word k. sim(j, k) represents the similarity between feature point j and visual word k. Notice that in Eq.(2) the contribution of a feature point is dependent on its similarity to word k weighted by 1/2 i 1, representing the word is its ith nearest neighbor. In our experiments, we empirically find N = 4 is a reasonable setting. Thirdly, given a category c, feature selection method is used to estimate visual words t s contribution to classification of category c, and then visual words t are sorted according to its contribution of the category c. Visual words t with lower contribution are removed and the remaining visual words t are used to construct the codebook V s for category c. Finally, class-specific histograms H s are generated by removing the corresponding bins of traditional histogram according to the class-specific codebooks V s. Then traditional histogram H t and

Mutual Information Based Codebooks Construction for Natural Scene Categorization 421 class-specific histograms H s are combined with adaptive weighting coefficient.

The combined histograms H c can describe whether the image content is best modeled by the traditional codebook or the corresponding class-specific codebook.

3 Mutual Information Based Codebooks Construction for Natural Scene Categorization 421 class-specific histograms H s are combined with adaptive weighting coefficient. As a result, an image is characterized by a set of combined histograms H c (one per category) composed of a traditional histogram H t and a class-specific histogram H s. The combined histograms H c can describe whether the image content is best modeled by the traditional codebook or the corresponding class-specific codebook. To classify these combined histograms H c, we use Support vector machine (SVM) classifier (one SVM per category). Each SVM is trained in a one-vs-all manner. 3. Feature selection Feature selection is a process that chooses a subset from the original feature set according to some criterions. The selected feature retains original physical meaning and provides a better understanding for the data and learning process. Feature selection methods have been successfully applied to text categorization [20], and preliminary investigations show feature selection has significant influence in bag-of-words image representation [14]. In our paper, it is adopted to construct codebooks for each category in scene classification. In this paper, as Mutual information (MI) has the capability of measuring the dependence between two random variables, it is used to serve as our feature selection method. The MI between a visual word t and a category c is defined as: MI(t, c) = u i=1 P (t, c i )log P (t, c i) P (t)p (c i ) where u is the number of image category, P (t, c i )meansthejoint probability of visual word t and the images which belong to category c i. P (t) means the probability of the visual word t. andp (c i ) means the probability of the images which belong to category c i. As Eq.(3) shown, visual word with a higher MI values for specific class means that there is strong relativity between this visual word and the specific class. So, in order to enhance the discriminative ability of histogram, it is reasonable to choose the visual words with higher MI value to form the class-specific codebook. In such way, a set of class-specific codebooks, one per category, can be constructed. Then, class-specific histograms are generated by removing the corresponding bins of traditional histogram according to the class-specific codebooks. 4. Combined histogram After the universal histogram and the class-specific histograms are generated, for each class, the universal histogram and the classspecific histogram are concentrated with adaptive weighting coefficient obtained by the following equations: (3) H c =[w 1 H s; w 2 H t] (4) sum 2 (H s) w 1 = sum 2 (H t)+sum 2 (H s) (5) w 2 =1 w 1 (6) where w 1 means the weighting coefficient for class-specific histogram H s, w 2 stands for the weighting coefficient for traditional histogram H t.andsum(h) denotes the sum of each bin of the histogram, that is, each histogram is corresponding to a multidimensional vector. sum(h) is equal to the sum of the value of each element of the corresponding multidimensional vector. The combined process is shown in Fig.4. In traditional bag-ofwords model, histogram based on the universal codebook has finite discriminative ability to implement the classification task. However, as histogram grounded on the class-specific codebook is customized to this specific class and contains abundant visual words which carry plentiful visual information of the specific class, class-specific histogram has higher discriminative ability to separate images of this class from images of other classes. Taking the office image in Fig.4 as an example, histogram only based on universal codebook hardly distinguishes it from the kitchen image because of the strong similarity of these two categories. However, the office-class codebook consists of the visual words that have higher relativity with office category, and the histogram grounded on the office-class codebook provides useful cue to separate the office images from the other categories. Furthermore, the weight is got adaptively based on the amount value of traditional histogram and the office-class histogram, as Eq.(4) shown. In such way, for images belonging to the office category, as images contain abundant visual words which carry plentiful visual information of the office class, the office-class histogram will reserve much visual words and the weight of officeclass histogram will have a biggish value, this means the office-class histogram can get bigger proportion of describing images than the traditional histogram and vice versa. As a result, the combined histogram can amplify the difference between the office category and the other categories, and integrate the discriminative ability of the traditional histogram with that of the office-class histogram to obtain better classification result. Fig. 4. The construction of the combined histogram III. Experimental Results This section reports the experimental setup and the results of our proposed method. In this paper, Performance of the proposed scene classification method is evaluated on three datasets, which have been widely used in the previous work [21 24]. The brief introduction of the three datasets is given below. Dataset 1 Consists of 2688 images from 8 categories: 360 coast, 328 forest, 274 mountain, 410 open country, 260 highway, 308 inside city, 356 tall buildings, and 292 streets. The average size of each image is Dataset 2 Contains 3759 images from 13 categories: This dataset is an extension of Dataset1 by adding 5 new scene categories, i.e. 216 bedroom, 210 kitchen, 289 living room, 215 office, and 241 suburb. Fig. 5. Example images from the Dataset 3. The variations in the content of the images within the same scene category can be observed

4 422 Chinese Journal of Electronics 2011 Dataset 3 Contains 4485 images from 15 categories. This data set is a further extension of Dataset 2 by adding two new scene categories, i.e. 311 industrial and 315 store. To the best of our knowledge, this dataset is the current published largest data set for scene categorization, as Fig.5 shown. In order to remove the effect of color information of the images, here the gray version of the images is used for our experiments. In our experiments, each scene category is divided randomly into two separate sets of images: 100 images for training and the rest images for testing. Experiments are run with Matlab 7.0 by using computer with Pentium 4 3.0GHz processor. In each image, the dense SIFT feature [19] is computed over a regular grid with the size of pixels. The grid begins at the left-up corner of image, and shifts 8 pixels every time to right or bottom respectively. In this paper, we firstly discuss how the classification performance is affected by the size of codebook in the traditional bag-ofwords model. As shown in the previous works [12,14,21], a proper size codebook is needed to a certain image dataset. If the codebook is too large, each part of the image will match to a single, unique visual word, which defines the purpose of a codebook. On the other hand, if the codebook is too small, several different visual features will be represented by the same visual word. Thus, the size of codebook influences the generalization ability and the discriminative power of the method obviously. The performance variations with different sizes of codebooks in three data sets are shown in Fig.6. In the experiments below, the sizes of codebooks which obtain the highest accuracy are adopted to generate traditional histogram H t, i.e. 700 visual words for Dataset 1, 700 visual words for Dataset 2 and 1100 visual words for Dataset 3. Fig. 6. Performance variation with different size of codebook in three data sets The main contribution of this paper is an investigation of using feature selection method to construct class-specific codebooks. In our paper, Mutual information (MI) is adopted to measure each visual word s contribution for class-specific classification. In order to estimate the contribution of visual word t to the class c, weconsider the images of class c as one group, and remaining images as other group. So, the MI value between visual word t and class c is defined as: P (t, c) P (t, c) MI(t, c) =P (t, c)log + P (t, c)log (7) P (t)p (c) P (t)p ( c) where P (t, c) means the joint probability of visual word t and the images which belong to category c. P (t) means the probability of the visual word t. P (c) means the probability of the images which belong to category c. P (t, c) means the joint probability of visual word t and the images which not belong to category c. P ( c) means the probability of the images which not belong to category c. Then, all visual words are sorted according to the MI value and the visual words with higher MI value are selected to construct the class-specific codebooks. As depicted in Section II.4, the combined histogram H c is generated by linearly concentrating the class-specific histogram H s and the traditional histogram H t. So, the size of class-specific codebook is another important parameter. Given the size of the traditional histogram which generates the high accuracy, the performance variations with different sizes of classspecific histograms on three data sets are presented in Fig.7, which shows that the highest accuracy are obtained when the sizes of classspecific codebooks are set to 500 for Dataset 1, 600 for Dataset 2, 800 for Dataset 3. Here, taking coast category in Dataset 3 as an example, we consider the coast images as positive examples and the images in other 14 categories as negative examples. According to the definition of Eq.(5), a MI value can be obtained for each visual word of universal codebook V visual words (the size of traditional histogram generated highest accuracy) are sorted based on MI values. Then, according to Fig.6, 300 visual words with lower MI value are removed and the remaining 800 visual words are selected to compose the codebook for coast category. Then, for images of all categories, the coast category histograms is generated by removing the corresponding bins of traditional histogram according to the coast codebooks. In such way, 15 class codebooks are constructed and an image is described by one traditional 1100-bin histogram and 15 specific-class 800-bin histograms. Additionally, we propose a practical approach to combine the traditional histogram and the class-specific histogram. These two kinds of histograms are linearly concentrated with an adaptive weight which is defined in Eq.(4). With this weight acquiring approach, traditional histogram and class-specific histogram compete to characterize an image. If an image belongs to class c, thesumof each bin of c-class histogram will be a higher value and the c-class histogram will be set a higher weight value, which means this c-class histogram competes to obtain more ability to describe the image of this category. So, this weight approach can amplify the diversity of images of different categories. In order to verify the adaptive weight acquirement approach, we compare the results between the combination with fixed weight and the combination with our proposed adaptive weight. The experimental results on three datasets are shown in Table 1. The classification accuracy of adaptive weight method can reach % for Dataset 1, % for Dataset 2, % for Dataset 3, which outperform the fixed weight method by 3.262%, 3.529%, 4.073% respectively. Finally, extended experiments are implemented to compare our proposed method with several previous representative scene classification methods, including the gist feature based method [25,26], the probabilistic Latent semantic analysis model (plsa) [23], and the Spatial pyramid model (SPM) [22] -one of the best scene classification methods. The implementation of these methods for comparison is depicted as follows: Fig. 7. Performance variation with different size of class-specific codebooks on three data sets. The size of traditional histogram which generates the highest accuracy is selected

5 Mutual Information Based Codebooks Construction for Natural Scene Categorization 423 GIST: the gist feature is implemented based on the code provided by Oliva and Torralba. Four scale levels (1:256, 1:128, 1:64, 1:32) and four orientations (0, 45, 90, 135 ) are used for the gist feature. The SVM classifier with linear kernel is used for classification. plsa model: the size of vocabulary is set to The number of topics is set to 25 and the number of neighbors is set to 10 for nearest neighbor classifier. SPM: each image is respectively segmented to 1 1, 2 2, 3 3 patches, and histograms based on different segmentation are concatenated to form a high dimension vector. Table 1. Comparison between the adaptive weight method and the fixed weight method Dataset 1 Dataset 2 Dataset 3 Adaptive weight % % % Fixed weight % % % Table 2. Performance comparison between different methods Our method Gist plsa SPM Dataset % 77.79% 82.5% 88.19% Dataset % 72.11% 74.3% 84.40% Dataset % 67.85% 72.7% 83.30% From Table 2, we can get the conclusion that our proposed method outperforms the gist feature based method, the plsa model, the SPM model by %, 8.258%, 2.568%, respectively on Dataset 1; by %, %, 4.725% respectively on Dataset 2, by %, %, 5.621% respectively on Dataset 3. Although Spatial pyramid method can generate a competitive accuracy with different datasets, it makes use of absolute spatial information and is lack of the robustness with respect to partial occlusion, clutters, and changes in viewpoint and illumination. Besides, large size of codebook or excessive segmentation may lead to curse of dimensionality. IV. Conclusions In this paper, we propose a novel and practical framework for scene category, where the class-specific codebooks are constructed using feature selection method mutual information. According to the contribution of visual words to classification, universal codebook is tailored to form the class-specific codebook for each category. Then, an image is characterized by a set of combined histograms (one histogram per category) which are generated by concentrating the traditional histogram based on universal codebook and the class-specific histogram grounded on class-specific codebook. Additionally, we also propose a practical adaptive weighting method that leads to competition between the traditional histogram and the class-specific histogram. For image in a given category, the class-specific histogram can obtain more ability to describe the image with our proposed weight method. In such way, the proposed method can provide much more effective information to overcome the similarity of images of different categories and improve the categorization performance. Finally, a comparative study of the proposed method with three state-of-the-art scene classification algorithms, i.e. the gist feature based method, the probabilistic Latent Semantic Analysis model and Spatial Pyramid Model, shows the superiority of the proposed method. References [1] J.Z. Wang, L. Jia, G. Wiederhold, Simplicity: semanticssensitive integrated matching for picture libraries, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.23, No.9, pp , [2] E. Chang, G. Kingshy, G. Sychay, W. Gang, Cbsa: contentbased soft annotation for multimodal image retrieval using Bayes point machines, IEEE Transactions on Circuits and Systems for Video Technology, Vol.13, No.1, pp.26 38, [3] A. Vailaya, M. Figueiredo, A. Jain, H.J. Zhang, Content-based hierarchical classification of vacation images, Proc. of IEEE International Conference on Multimedia Computing and Systems, Florence, Italy, Vol.1, pp , [4] C. Siagian, L. Itti, Gist: a mobile robotics application of context-based vision in outdoor environment, Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, California, Vol.3, pp , [5] R. Manduchi, A. Castano, A. Talukder, L. Matthies, Obstacle detection and terrain classification for autonomous off-road navigation, Autonomous Robots, Vol.18, No.1, pp , [6] A. Torralba, Contextual priming for object detection, International Journal of Computer Vision, Vol.53, No.2, pp , [7] A. Torralba, K.P. Murphy, W.T. Freeman, Contextual models for object detection using boosted random fields, Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, pp , [8] A. Vailaya et al., On image classification: city vs landscapes, Pattern Recognition, Vol.31, No.12, pp , [9] A. Vailaya, M. Figueiredo, A. Jain, H. Zhang, Content-based hierarchical classification of vacation images, Proc. of IEEE International Conference on Multimedia Computing and Systems, Florence, Italy, Vol.1, pp , [10] A. Vailaya, A. Figueiredo, A. Jain, H. Zhang, Image classification for content-based indexing, IEEE Transactions on Image Processing, Vol.10, pp , [11] E. Chang, K. Goh, G. Sychay, G. Wu, Cbsa: Content-based soft annotation for multimodal image retrieval using bayes point machines, IEEE Transactions on Circuits and Systems for Video Technology Special Issue on Conceptual and Dynamical Aspects of Multimedia Content Description, Vol.13, No.1, pp.26 38, [12] A. Bosch, X. Munoz and R. Martí, Which is the best way to organize/classify images by content?, Image and vision computing, Vol.25, No.6, pp , [13] Y.G. Jiang, C.W. Ngo and J. Yang, Towards optimal bag-offeatures for object categorization and semantic video retrieval, Proc. of the 6th ACM International Conference on Image and Video Retrieval, New York, USA, pp , [14] J. Yang, Yugang Jiang et al., Evaluating bag-of-visual-words representations in scene classification, Proc. of the ACM International workshop on Workshop on Multimedia Information Retrieval, New York, USA, pp , [15] J.S. Sivic and A. Zisserman, Video google: A text retrieval approach to object matching in videos, Proc. of International Conference on Computer Vision, Nice, France, Vol.2, pp , [16] W.H. Hsu and S.F. Chang, Visual cue cluster construction via information bottleneck principle and kernel density estimation, Proc. of ACM Conference on Image and Video Retrieval, Singapore, pp , [17] F. Perronnin, Universal and adapted vocabularies for generic visual categorization, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.30, No.7, pp , [18] J.C. van Gemert, C.J. Veenman et al., Visual word ambiguity, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.32, No.7, pp , [19] David G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, Vol.60, No.2, pp , 2004.

424 Chinese Journal of Electronics 2011 [20] Y. Yang and J. Pedersen, A comparative study on feature selection in text categorization, Proc. of 14th International Conference on Machine Learning, pp.

of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, USA, Vol.2, pp.524 531, 2005. [22] S. Lazebnik, C. Schmid, J.

2169 2178, 2006. [23] A. Bosch, A. Zisserman and X. Munoz, Scene classification using a hybrid generative/discriminative approach, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.

145 175, 2001. [25] C. Siagian, L. Itti, Gist: a mobile robotics application of context-based vision in outdoor environment, Proc.

6 424 Chinese Journal of Electronics 2011 [20] Y. Yang and J. Pedersen, A comparative study on feature selection in text categorization, Proc. of 14th International Conference on Machine Learning, pp , [21] L. Feifei, P. Perona, A Bayesian hierarchical model for learning natural scene categories, Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, USA, Vol.2, pp , [22] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, USA, Vol.2, pp , [23] A. Bosch, A. Zisserman and X. Munoz, Scene classification using a hybrid generative/discriminative approach, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.30, No.4, pp , [24] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, International Journal of Computer Vision, Vol.42, No.3, pp , [25] C. Siagian, L. Itti, Gist: a mobile robotics application of context-based vision in outdoor environment, Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, California, Vol.3, pp , [26] C. Siagian, L. Itti, Rapid biologically-inspired scene classification using features shared with visual attention, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.29, No.2, pp , XIE Wenjie was born in Shandong Province, China in He has been working towards the Ph.D. degree in Institute of Computer Science and Engineering, Beijing Jiaotong University, Beijing. His current research interests include computer vision and pattern recognition. ( xiewenjiebj@126.com) XU De was born in Jiangsu Province, China in He is now a professor in Institute of Computer Science and Engineering, Beijing Jiaotong University, Beijing. His current research interests include database system and multimedia processing. ( dxu@bjtu.edu.cn) TANG Yingjun was born in Jiangxi Province, China in She has been working towards the Ph.D. degree in Institute of Computer Science and Engineering, Beijing Jiaotong University, Beijing. Her current research interests include computer vision and pattern recognition. ( @bjtu.edu.cn) LIU Shuoyan was born in Shanxi Province, China in She has been working towards the Ph.D. degree in Institute of Computer Science and Engineering, Beijing Jiaotong University, Beijing. Her current research interests include computer vision and pattern recognition. ( @bjtu.edu.cn) FENG Songhe was born in Jiangsu Province, China in He received the Ph.D. degree and now is an assistant professor in Institute of Computer Science and Engineering, Beijing Jiaotong University, Beijing. His current research interests include image annotation and retrieval. ( songhe feng@163.com)

arxiv: v3 [cs.cv] 3 Oct 2012

arxiv: v3 [cs.cv] 3 Oct 2012 Combined Descriptors in Spatial Pyramid Domain for Image Classification Junlin Hu and Ping Guo arxiv:1210.0386v3 [cs.cv] 3 Oct 2012 Image Processing and Pattern Recognition Laboratory Beijing Normal University,