Combining Selective Search Segmentation and Random Forest for Image Classification

Size: px

Start display at page:

Download "Combining Selective Search Segmentation and Random Forest for Image Classification"

Archibald Day
5 years ago
Views:

Combining Selective Search Segmentation and Random Forest for Image Classification Gediminas Bertasius November 24, 2013 1 Problem Statement Random Forest algorithm have been successfully used in

showed that a random forest composed of the decision trees where every node is a discriminative classifier outperforms state-of-the-art results in the finegrained image categorization problems [8].

1 Combining Selective Search Segmentation and Random Forest for Image Classification Gediminas Bertasius November 24, Problem Statement Random Forest algorithm have been successfully used in many computer vision tasks such as image classification [1] and image segmentation [4]. Recently, Yao et al. showed that a random forest composed of the decision trees where every node is a discriminative classifier outperforms state-of-the-art results in the finegrained image categorization problems [8]. Yao et al. attributed their success to the two main components of their system: discrimination and randomization. Discrimination refers to the use of SVM to learn the splits at each node, whereas randomization refers to a random selection of image patches, which are used as a form of features to learn the splits at each node. There are several problems that may arise from this randomization procedure. Firstly, if we consider image patches of size in an image, sampling space may contain thousands of patches, which makes it less likely that a randomly selected patch will contain an object of interest for the image categorization. In addition, randomly selected samples are more likely to overlap with each other, which would cause redundancy. Therefore, in this project, I investigated new ways for selecting image patches. In theory, more informative patch selection should result in higher quality splits at each tree node, which in turn should increase overall accuracy of the classifier. Figure 1: image a) illustrates how random image patch selection might work in original Yao s et al. [8] algorithm. Image b) shows the regions that we should be using instead, to learn the splits in a tree growing procedure. 1

2 2 Proposed Method Specifically, to fix the problems related to random patch selection I integrated a selective search segmentation algorithm [5] into the original random forest framework. Image patches selected using selective search segmentation are more likely to contain the objects of interest. In addition, segmentation should eliminate redundant overlapping between the image patches, which will make our feature space more diverse. Fixing these two problems should result in an increased discriminative power of random forest. My proposed method works in a following way. First, for each tree a random subset of images is selected for the training stage. Then, selective search segmentation algorithm is used on these images to produce coordinates of the most relevant regions in each image. Afterwards, at each split the algorithm randomly picks N regions from the regions produced by selective search segmentation. Then, the features from these regions are trained via SVM model and the region that produces the best split is selected as a splitting rule (splitting rule would be represented in a form of region coordinates). All of the images are normalized (resized) beforehand so that the coordinates would correspond to the same region in each image. Below is a pseudocde that describes high-level procedure of growing a decision tree according to my proposed method: for each tree t do do -Select a set of training examples D; -Randomly pick a subset of coordinates corresponding to image patches produced by a selective search segmentation; -From each image extract image patches corresponding to the selected coordinates; -Train SVM on each image patch and then select the best image patch to split the dataset D into D 1, D 2 ; -Recursively split datasets D 1, D 2 ; -Return tree t; end Algorithm 1: Proposed Method for Tree Growing Procedure 3 Implementation 3.1 Low Level Features Similarly to [8] I used SIFT [3] visual descriptors as my low level image representation. After extracting SIFT descriptors, I applied k-means clustering algorithm to construct visual vocabularies of size 1024 and 256 for Caltech 256 and Stanford 40 Actions datasets respectively. Then I utilized Locality Constrained Linear Coding [6] to match the descriptors with the specific words in the constructed vocabulary. 3.2 Selective Search Segmentation Before beginning Random Forest procedure, I standardize each image by rescaling them to the same size and then apply Selective Search Segmentation to extract important regions from each image. Each region is represented by 4 coordinates in the image (points in the bottom left and top right corners of the 2

3 region). Then, k-means is applied to all the regions that were returned by Selective Search Segmentation and its centroids are chosen as the final candidate regions. In this particular case, I used 1024 centroids. 3.3 Decision Tree Framework To build the trees, I use an identical scheme as in [8]. However, instead of choosing candidate image regions randomly as is done in [8] I choose from the image regions that were previously selected by selective search segmentation algorithm (as described in section 3.2). At every node, each of the selected regions is considered as a possible splitting rule for the node, where the region in the tree is represented by its 4 coordinates. For each region, I use linear SVM to determine the best hyperplane for splitting the remaining images at that node. Then the region and its respective SVM model that had highest information gain are stored as splitting rules at that particular node. It is important to note that before applying SVM the labels of the remaining images are randomly binarized in a way that images belonging to the same class share the same binary label. This procedure allows to apply binary SVM procedure rather than having to use the multi-class SVM, which is highly desirable. 4 Evaluation 4.1 Datasets To compare the performance of my proposed method with the original Yao s et al. [8] algorithm, I ran both methods on two datasets: Caltech 256 [2] and Stanford 40 Actions dataset [7]. Caltech 256 contains 256 image categories and have approximately 90 image samples for each category. Stanford 40 Actions dataset consists of images of humans performing 40 different actions where each action contains 300 images. It is important to note that these two datasets are significantly different. In Caltech 256, most of the objects are localized and centered at the image. In addition, there is minimal background clutter in the images. Stanford 40 Actions dataset, on the other hand, is completely opposite in these aspects. Actions of interest may appear anywhere in the image, not necessarily in the center. In addition, most of the images in Stanford 40 Actions dataset contain other objects and lots of background clutter, which makes classification significantly more challenging. Sample images from both datasets are presented in Figure Results Results suggest that incorporating Selective Search Segmentation algorithm into Yao s et al. [8] Random Forest framework yields higher accuracy rates when classification is done on challenging image datasets, in which objects are not localized and there is lots of background clutter. All of this makes sense and perfectly aligns with my expectations. Obviously if the objects in the images are well localized and there is no background clutter, segmentation is simply redundant because almost every patch in the image is informative by itself. However, in the case where images contain other objects in the background, segmentation helps to identify which of those objects are actually relevant for 3

Figure 2: a) illustrates types of images in Caltech 256 dataset (object class is baseball glove in this case) whereas b) displays a sample image from Stanford 40 Actions dataset (action class is

1 Results on Stanford 40 Actions Dataset Just as discussed earlier, incorporating segmentation into the original algorithm yields higher accuracy on Stanford 40 Actions dataset.

Analyzing accuracy rates for individual classes also reveal that selecting patches via segmentation is beneficial for specific class identification.

4 Figure 2: a) illustrates types of images in Caltech 256 dataset (object class is baseball glove in this case) whereas b) displays a sample image from Stanford 40 Actions dataset (action class is washing dishes in this case) the classification, which increases the accuracy of the classifier. All of these hypotheses are well supported by the results presented in the sections below Results on Stanford 40 Actions Dataset Just as discussed earlier, incorporating segmentation into the original algorithm yields higher accuracy on Stanford 40 Actions dataset. I compared both of the methods for different number of samples used in the training procedure, and in all cases my proposed method produced higher accuracy rates. Analyzing accuracy rates for individual classes also reveal that selecting patches via segmentation is beneficial for specific class identification. My proposed method performed better in all but one classes as illustrated in Figure 3. Figure 3: plot a) shows accuracy of the two methods for different number of training samples whereas b) displays the accuracies of both methods for sample of individual classes in Stanford 40 Actions dataset in Figure 4, I also presented the confusion matrices produced by both methods. The summary of the results for Stanford 40 Actions dataset are presented in Figure 5. As a side note, it is interesting to note that the accuracy rate in Figure 3 is not steadily improving as we increase the number of samples used in the 4

training procedure. This can be explained by a couple of things. First, with the higher number of samples it is clearly much harder for the linear SVM to learn meaningful splits.

5 training procedure. This can be explained by a couple of things. First, with the higher number of samples it is clearly much harder for the linear SVM to learn meaningful splits. Furthermore, I did not have enough time to fine tune all of the decision tree parameters. Therefore, with the higher number of training samples the classifier may be slightly overfitting. These couple of details would explain the accuracy behavior in Figure 3. Figure 4: Confusion matrices produced by original Yao s et al. [8] algorithm and my modified algorithm respectively on Stanford 40 Actions Figure 5: Summary of the results of both methods on Stanford 40 Actions dataset Results on Caltech 256 Dataset However, incorporating segmentation into Yao s et al. original algorithm did not improve success rates on Caltech 256 dataset. As mentioned earlier, this is because images in Caltech 256 dataset are already well localized and do not contain much background clutter. As a result, it is possible to select very informative patches even via random patch selection. In fact, random patch selection in this case may be even more beneficial as it produces more diverse trees, which in turn improves accuracy of the overall classification. These statements are well supported by the Figure 6, which illustrates that random patch selection produces equivalent or even better results on Caltech 256. In Figure 7 I also presented confusion matrices produced by both methods. In addition, the summary of the results for Caltech 256 dataset are presented in Figure 8. Similarly to the results for Stanford 40 Actions dataset, the accuracy rate in Figure 6 does not exhibit a steady increase with the higher number of training examples. Once again this could be explained by the same reasons: linear SVM may fail to find good splits with the higher number of samples. Furthermore due to suboptimal learning parameters decision trees may be overfitting the data as I increase the number of training samples. 5

6 Figure 6: plot a) shows accuracy of the two methods for different number of training samples whereas b) displays the accuracies of both methods for sample of individual classes in Caltech 256 dataset Figure 7: Confusion matrices produced by original Yao s et al. [8] algorithm and my modified algorithm respectively on Caltech 256 Figure 8: Summary of the results of both methods on Caltech 256 dataset 5 Conclusions and Future Work As illustrated by the results, it is only beneficial to incorporate segmentation into the original Yao s et al. algorithm in the case when images in the dataset are very challenging. That includes cases when images contain lots of background objects and when there is no localization of the objects of interest. In such cases segmentation will definitely help to identify which regions in the image are important for the classification. In addition, because segmentation can be applied before the training procedure this will not affect the run time of decision tree learning procedure in any way, a property which is highly desirable. However, in the cases when images are well localized and do not contain any background clutter (like images in Caltech 256), picking regions via segmentation may actually hurt the performance of the classifier. This is because the 6

7 resulting trees will be less diverse, which is an important criteria for the overall effectiveness of a random forest framework. Overall, I believe my proposed method is highly beneficial because most of the images in the real word will indeed contain lots of background objects, in which case the regular Yao s et al. algorithm may not perform as well as my proposed method. Having said that, I believe some improvements could be made to enhance the performance of this algorithm. For one, there may be a better way to represent patches within the trees than simply storing them as coordinates in an image. This coordinate representation makes an implicit assumption that objects are distributed in a uniform way across all of the images, which is obviously not a correct assumption in most cases. Therefore, better ways to represent patches inside the tree structure should be explored in the future. References [1] A. Bosch, A. Zisserman, and X. Munoz. Image classification using random forests and ferns. In IEEE International Conference on Computer Vision, [2] G. Griffin, A. Holub, and P. Perona. Caltech-256 Object Category Dataset. Technical Report CNS-TR , California Institute of Technology, [3] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91 110, [4] Jamie Shotton, Matthew Johnson, and Roberto Cipolla. Semantic texton forests for image categorization and segmentation. In CVPR. IEEE Computer Society, [5] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2): , [6] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and Yihong Gong. Locality-constrained linear coding for image classification. In IN: IEEE CONFERENCE ON COMPUTER VISION AND PATTERN CLASSIFICATOIN, [7] Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas J. Guibas, and Li Fei-Fei. Action recognition by learning bases of action attributes and parts. In International Conference on Computer Vision (ICCV), Barcelona, Spain, November [8] Bangpeng Yao, Aditya Khosla, and Li Fei-Fei. Combining randomization and discrimination for fine-grained image categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Springs, USA, June

Aggregating Descriptors with Local Gaussian Metrics

Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,