Improved Spatial Pyramid Matching for Image Classification Mohammad Shahiduzzaman, Dengsheng Zhang, and Guojun Lu Gippsland School of IT, Monash University, Australia {Shahid.Zaman,Dengsheng.Zhang,Guojun.Lu}@monash.edu Abstract. Spatial analysis of salient feature points has been shown to be promising in image analysis and classification. In the past, spatial pyramid matching makes use of both of salient feature points and spatial multiresolution blocks to match between images. However, it is shown that different images or blocks can still have similar features using spatial pyramid matching. The analysis and matching will be more accurate in scale space. In this paper, we propose to do spatial pyramid matching in scale space. Specifically, pyramid match histograms are computed in multiple scales to refine the kernel for support vector machine classification. We show that the combination of salient point features, scale space and spatial pyramid matching improves the original spatial pyramid matching significantly. 1 Introduction Image classification has attracted large amount of research interest in the past few decades due to the ever increasing digital image data generated around the world. Traditionally, images are represented and retrieved using low level features. Recently, machine learning tools have been widely used to classify images into semantic categories. Now low level features can be used more efficiently than ever. Image classification is an important application in computer vision. Our research goal is to improve methods for Image classification, more specifically natural scene images or images with some spatial configurations. We want to classify an image based on its semantic category of a scene like forest, road or building etc. Our approach to whole image categorization employs to renowned techniques namely Spatial Pyramid Matching (SPM) [1] and scale space theory. Our objective is to combine the power of these two methods. In this paper, scene categorization is attempted by global image representation developed from low level image properties. There is another approach for this task that is to get idea of high level semantic attributes by segmentation of objects on the scene (like bed or car) and classify the scene accordingly. We believe scene classification can be done without extracting this high level object cues. This is inspired by the publications of [2] where they proved that people can recognize natural scenes while overlooking most of the details in it (i.e. the constituent objects). In another publication [3] it is also shown that global information is as important as local information for scene classification by human subjects. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part IV, LNCS 6495, pp. 449 459, 2011. Springer-Verlag Berlin Heidelberg 2011
450 M. Shahiduzzaman, D. Zhang, and G. Lu Scale is an important aspect of local feature finding in prominent cue detection in images. The most prominent example of using scale space and characteristics scale is the local invariant feature detector SIFT [4]. In SIFT the authors used maxima/minima of neighboring scale space to find the interest points or key points of an image. Scene features like sands in a beach or certain textures in the curtain of a room would be more evident in bigger scales. Scale-space theory is a framework for multi-scale signal representation. It is a formal theory for handling image structures at different scales, by representing an image as a one-parameter family of smoothed images, the scale-space representation, parameterized by the size of the smoothing kernel used for suppressing fine-scale structures [5]. In recent years the bag-of-features (BoF) model has been extremely popular in image categorization. The method treats an image as a collection of unordered appearance descriptors extracted from local patches. Then the patches or descriptors are quantized into discrete visual words of a codebook dictionary, and then the image histograms are compared and classified according to the dictionary. The BoF approach discards the spatial order of local descriptors, which severely limits the descriptive power of the image representation. By overcoming this problem, one particular extension of the BoF model, called spatial pyramid matching (SPM) [1], has made a remarkable success on a range of image classification benchmarks and was the major component of the state-of-the-art systems, e.g., [6]. Our method is based on SPM. Similarly like SPM we have used the subdivide and disorder principle. The essence of this principle is to partition the image into smaller blocks and calculate orderless statistics of low level image features. Existing methods differs by the use of features (like pixel value, gradient orientation, and filter bank outputs) and the subdivision method (regular grid, quad trees, and flexible image windows). SPM and as well as our method is independent in choice of features, anyone can plug any other type of features to get a classification result. Authors of [7] offered an early insight into subdivide and principle by suggesting that locally orderless image play an important role in visual perception. While SPM authors did not consider their Gaussian scale space of apertures, we integrated that idea into SPM. Importance of locally orderless statistics is also evident from few recent publications. To summarize, our method provides a unified framework to combine the gains from subdivide and disorder principle and scale space aperture with a choice of low level features. It will enable to combine the locally orderless statistics results from multiple scales and different fixed hierarchy or rectangular windows to achieve the scene classification task. 2 Related Methods In this work we combine the power of multiresolution histogram with spatial pyramid matching. So our method consists of two concepts - multiresolution or scale space analysis of image and spatial pyramid matching. In kernel based learning methods like support vector machine (SVM), we need to provide a
Improved Spatial Pyramid Matching for Image Classification 451 Fig. 1. Schematic illustration of Pyramid match kernel with two levels kernel for learning and testing. There are many kernels, which varies in formulation. For example, histogram intersection kernel is a kernel matrix which is built by histogram intersection. Essentially it provides a pair wise similarity measure of the training and testing images. A pyramid match kernel (PMK) [1] works with an unordered image representation/features. The idea of the method is to compute multiresolution histograms and finding the histogram intersection at each resolution. In figure 1, for two different images X and Y, histograms and the corresponding histogram intersections are computed at three resolution levels (0,1,2). The bin size is doubled in successive higher resolutions while the bin numbers are down sampled by 2. After that, all new histogram matching in each resolution is weighted and summed up to form the histogram intersection kernel. It has the limitation of discarding all spatial information. Let us construct a sequence of grids at resolutions 0,1,...,L such that the grid at level lhas2 l cells along each dimension. Number of matches (I l ) at level l is given by the histogram intersection function. Therefore, the number of new matches found at level l is given by I l I l+1 for l = 0,1,...,L-1. The weight associated 1 with level l is set to (2 L l ). Spatial pyramid matching (SPM) takes a different approach of performing pyramid matching in the two-dimensional image space, and using traditional clustering techniques in feature space. So in SPM the histogram computation is done at a single resolution and in multiple pyramid levels within the same resolution, whereas in PMK it is done in multiresolution. PMK dont employ any feature clustering, directly map features in multiresolution histogram bins. On the other hand, SPM uses feature clustering during histogram computation to find the representative feature sets. In SPM, all feature vectors are first quantized into M discrete types (i.e. the total number of histogram indices is M). In figure 2, we are showing an example of constructing a three-level spatial pyramid. The image has three types of features, indicated by triangles, circles and stars. At the top row, the image is subdivided at three different levels of resolution. At the bottom row, the number of features that fall in each subregion is counted. The spatial histograms are weighted according to pyramid
452 M. Shahiduzzaman, D. Zhang, and G. Lu Fig. 2. Three-level spatial pyramid example match kernel. During kernel computation, each type calculation comprised of two sets of two- dimensional vectors, X m and Y m, representing the coordinates of features of type m found in the respective images. The final kernel is then the sum of the separate channel kernels: K L (X, Y )= M K L (X m,y m ) (1) m=1 This method reduces to a standard bag of features when it is a single level. Considering the fact that pyramid match kernel is simply a weighted sum of histogram intersections, and c min(a, b) = min(ca, cb) for positive numbers, K L can be implemented as a single histogram intersection of long vectors formed by concatenating the appropriately weighted histograms of all channels at all resolutions. So essentially we are weighting the histograms before computing the histogram intersection for convenience as the reverse would yield the same result. For L levels and M channels and S scales, the resulting vector has dimensionality: (M L 4 l ) S = M 1 3 (4L+1 1) S (2) l=1 Several experiments reported in results section use the settings of M = 200, L = 3 and S = 3 resulting in (3 17000) -dimensional histogram intersections. However
Improved Spatial Pyramid Matching for Image Classification 453 these operations are efficient because the histogram vectors are extremely sparse, the computational complexity of the kernel is linear in the number of features. One important aspect of the training and test images that we run the experiment only on gray level images; even if color images are available we converted in to gray level images. We decide this from the finding of [9] that removing color information from images doesnt make the scene categorization tasks more attention demanding. 3 Proposed Method: Multi-scale SPM SPM uses a mechanism to combine local salient features and their spatial relationship so as to provide a robust feature matching. However, in many cases, different image and block can have similar histograms, this degrade the performance of SPM. This drawback can be overcome by analyzing images in scale space, as confusions in previous case can be clarified at different scales. For example, in figure 3, images (a) and (b) are artificially generated images with almost similar histograms, later they are Gaussian blurred and hence their histograms are also more discriminative than the original histograms. For a given image f(x,y), its linear (Gaussian) scale-space representation is a family of derived signals L(x,y;t) defined by the convolution of f(x,y) with the Gaussian kernel: g t (x, y) = 1 2πt e (x2+y2) 2t Such that L(x, y; t) =(g g t f)(x, y) (3) Inspired by scale space theory we want to propose a multi-scale spatial pyramid matching method. Key idea behind our method is the use of scale space to gain (a) (b) (c) (d) (e) (f) (g) (h) Fig. 3. (a) and (c) are different images with almost similar image histograms (b) and (d). (e) and (g) are corresponding Gaussian blurred images and the previous small difference in histograms is now more prominent in higher scales(f and g).
454 M. Shahiduzzaman, D. Zhang, and G. Lu Fig. 4. Block diagram of the proposed method more discriminative power in classification. The major steps of our algorithm are (figure 4). 3.1 Feature Generation in Different Scales First SIFT features are generated from all the images in different scales in a regular grid. Here a dense feature representation is used to avoid the problems superfluous data like clutter, occlusion etc. 128 bit SIFT descriptors are calculated for all images in all scales in 8*8 regular grid settings and using a 16*16 patch in the grid centers. These features are saved into files for use in later steps. 3.2 Calculate Dictionary The features are clustered according to the parameter M which is the total number of bins in of the computed histograms. It is often believed that increasing the number of M will increase the classification accuracy. But, in our experiments we are getting comparable accuracy from M=200 setup compared to M=400 and M=600. Again the dictionary is built for all images in all scales. Dictionary is calculated using K-means based clustering using all the extracted SIFT features in a specific scale. In figure 5 (left image), we are showing the corresponding histogram of the values of a 200 sized dictionary. Separate dictionaries are calculated for separate scales. The dictionaries are calculated for using in histogram generation in later stages.
Improved Spatial Pyramid Matching for Image Classification 455 Fig. 5. Histogram plot of the calculated dictionary (left) and combined pyramid histogram plot of all individual histograms in different levels (right) 3.3 Compile Pyramid Histogram For all scales, the image is divided ranging from coarse to finer resolution and compute histogram in each area and assign weight according to PMK. Match in finer resolution will be given more weight than match in coarse resolution. After these steps now we have all the data required to build the pyramid histogram. With the different scale level histograms, we can just concatenate those forming a long histogram or compute inter-scale intersection/selection before forming the concatenation. We are taking the first approach in our method. Though this will essentially increase the size of the long histogram by the scale factor, but that wouldnt be a problem performance-wise. In this research our focus is on increasing classification accuracy and leveraging performance on the currently available powerful hardware. In figure 5 (right image), one such combined pyramid histogram is shown. According to equation 2, size of the histogram is 34000 for dictionary size 200, 3 pyramid levels and scale level 1. 3.4 Kernel Computation and SVM Classification For SVM, we just need to build the histogram intersection kernel from the compiled pyramid histograms. As we explained before, for the histogram intersection kernel computation we just need to find the intersections of the long histogram concatenation formed in the previous step. For training kernel intersection is computed between the same concatenated histograms and for training kernel it is between training histogram and testing histogram. A grey scale image map of the testing and training kernel is shown in figure 6. For training kernel, a white line is visible along the diagonal, as there will be a perfect match for corresponding training pairs. In testing kernel the matches are scattered as training and testing sets are different. For SVM, we are using a modified version of libsvm library [10] which implements the one vs. all classification. scales and different fixed hierarchy or rectangular windows to achieve the scene classification task.
456 M. Shahiduzzaman, D. Zhang, and G. Lu Fig. 6. Histogram intersection kernel as image for Training images (left) and testing images (right) 4 Experimental Results 4.1 Test Dataset We tested our method on scene category dataset [1], Caltech-101 [11] and Caltech- 256 [12]. A brief statistical comparison of these three datasets is given in table 1. 4.2 Performance Metric Two separate performance metric is used to measure the results combined accuracy and average of per class accuracy. Per class accuracy (P) is defined as the ratio of correctly classified images in a class with respect total number of images in that particular class. If total number of image categories is N, then combined accuracy and average of per class accuracy is defined as: Average of per class accuracy = N i=1 P i N (4) Combined accuracy = Total number of correctly classified images 100 Total number of images in the dataset (5) Table 1. Statistical information of the image datasets used Dataset No. of Total No. of Avg. image Max. no. of train/test categories images size images used Scene category 15 4485 300*250 100/rest Caltech-101 102 9144 300*200 30/300 Caltech-256 257 30607 351*300 60/300
Improved Spatial Pyramid Matching for Image Classification 457 Table 2. Accuracy results on different combination of parameters. Bold font means its the best for a certain codebook size and pyramid level. Codebook Pyramid Scale Combined Avg. of per class Size level level accuracy (%) accuracy (%) 200 3 1 81.47 ± 0.59 81.11 ± 0.68 200 3 2 83.69 ± 0.50 83.31 ± 0.59 200 3 3 83.45 ± 0.57 83.21 ± 0.61 200 2 1 79.88 ± 0.52 81.1 ± 0.30 200 2 2 82.69 ± 0.67 82.25 ± 0.52 200 2 3 82.78 ± 0.70 82.21 ± 0.75 400 3 1 81.95 ± 0.57 81.1 ± 0.60 400 3 2 83.78 ± 0.64 83.48 ± 0.58 400 3 3 83.71 ± 0.54 83.29 ± 0.70 400 2 1 80.28 ± 0.53 81.4 ± 0.50 400 2 2 83.22 ± 0.44 82.75 ± 0.40 400 2 3 83.10 ± 0.63 82.67 ± 0.78 Table 3. Our result compared to the original SPM for codebook size = 400, pyramid level = 3 and scale level = 2 SPM [1] Proposed method Average of per class accuracy(%) 81.1 ± 0.60 83.48 ± 0.58 Combined accuracy(%) 81.95 ± 0.57 83.78 ± 0.64 Table 4. Caltech-101 result for codebook size=400, pyramid level=3 and scale level=3 SPM [1] Proposed method Average of per class accuracy(%) 64.6 ± 0.7 67.36 ± 0.17 Combined accuracy(%) 70.59 ± 0.16 76.65 ± 0.46 Table 5. Caltech-256 result for codebook size=400, pyramid level=3 and scale level=3 SPM [12] Proposed method Average of per class accuracy(%) 32.62 ± 0.41 37.54 ± 0.31 Combined accuracy(%) 34.98 ± 0.60 40.19 ± 0.12 Table 2 is the extensive experiment done with codebook size, pyramid level, scale level. Results are first grouped by codebook size and pyramid levels. The notable thing here is that, scale level greater than one always produce better results than single level. Using the combined accuracy metric, we get our best result from codebook size 400, pyramid level 3 and scale level 2. Scale level 1 is basically the original SPM. So for scale level 1, we use the results from [1]. But as the authors of [1] didn t report the result of combined accuracy, we calculated it using our own implementation of SPM. All results are obtained using a 2*64 bit Quad core processor with 48
458 M. Shahiduzzaman, D. Zhang, and G. Lu Fig. 7. Per class accuracy for the result (average of per class accuracy) reported in Table 2 GB ofram. All experiments arerun for ten times with randomly selected training and testing images. The average of all the runs and standard deviation is reported here. Table 3 summarizes our best result compared to the original SPM. In figure 7, we showed the per class accuracy for the best result reported in Table 4. Our method outperforms SPM in eleven categories and provides comparable performance in the four categories. We tested whether the difference between two methods reported in table 2 is statistically significant by the Matlab function ttest. In this case, ttest result indicated that the improvement obtained the by the proposed method is indeed statistically significant. The results on Caltech-101 and Caltech- 256 are presented in table 4, 5 and it is in line with the results obtained from scene category dataset. On both of these databases, according to overall average accuracy metric, proposed method is better than SPM by around 3% margin and using the average of per class accuracy metric, the margin is around 6%. 5 Conclusion and Future Scope This paper presents an improvement to the spatial pyramid matching scheme. We provided a simple, intuitive and effective way to improve the SPM method.
Improved Spatial Pyramid Matching for Image Classification 459 To the best of our knowledge, this has not been done by previous researchers. The proposed extension is quite general and not limited to any specific feature descriptors or classifiers and can be used as a surrogate module or new baseline for SPM in image categorization systems. The weight mechanism of the spatial pyramid matching (SPM) method is not sophisticated enough. It defines uniform and better weight level to the finer resolution blocks and punishes the coarse resolution blocks by assigning less weight. As a basic method this is okay, but consider a finer resolution block containing only background or clutter, then assigning it more weight is only misleading calculation. So in the future, there is room for redesigning this weight mechanism to only assigning more weight to the corresponding blocks irrespective of scale or spatial resolution. References 1. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2196 2178 (2006) 2. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42(3), 145 175 (2001) 3. Ogel, J., Schwaninger, A., Wallraven, C., Bülthoff, H.H.: Categorization of Natural Scenes: Local versus Global Information and the Role of Color. ACM Transactions on Applied Perception 4(3) (2007) 4. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(3), 91 110 (2004) 5. Witkin, A.P.: Scale-space filtering. In: Proceedings of 8th International Joint Conference on Artificial Intelligence, pp. 1019 1022 (1983) 6. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge. In: VOC 2009 (2009), http://www.pascal-network.org/challenges/voc/voc2009/workshop/index. html 7. Koenderink, J., Doorn, A.V.: The structure of locally orderless images. International Journal of Computer Vision 31(199), 159 168 8. Grauman, K., Darrell, T.: The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. In: Proceedings of the IEEE International Conference on Computer Vision, ICCV (2005) 9. Fei-fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2005) 10. Chang C., Lin C.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm 11. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: Proceedings of IEEE Workshop on Generative-Model Based Vision, CVPR (2004) 12. Griffin, G., Holub, A., Perona, P.: Caltech-256 Object Category Dataset. Caltech Technical Report. Technical Report, Caltech (2007)