Linear combinations of simple classifiers for the PASCAL challenge

Similar documents
Detection III: Analyzing and Debugging Detection Methods

Deformable Part Models

Category-level localization

Human detection using histogram of oriented gradients. Srikumar Ramalingam School of Computing University of Utah

Object Detection with Discriminatively Trained Part Based Models

Find that! Visual Object Detection Primer

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Beyond Sliding Windows: Object Localization by Efficient Subwindow Search

Object Recognition II

Part-based and local feature models for generic object recognition

Object Detection Design challenges


COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Bag-of-features. Cordelia Schmid

Selective Search for Object Recognition

Development in Object Detection. Junyuan Lin May 4th

Learning Spatial Context: Using Stuff to Find Things

Object Category Detection. Slides mostly from Derek Hoiem

Lecture 12 Recognition

Computer Vision. Exercise Session 10 Image Categorization

Loose Shape Model for Discriminative Learning of Object Categories

Modern Object Detection. Most slides from Ali Farhadi

Object Category Detection: Sliding Windows

Part based models for recognition. Kristen Grauman

Lecture 12 Recognition. Davide Scaramuzza

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Robotics Programming Laboratory

Human Detection and Action Recognition. in Video Sequences

Patch-based Object Recognition. Basic Idea

Category vs. instance recognition

Boundaries and Sketches

Discovering Visual Hierarchy through Unsupervised Learning Haider Razvi

CS6670: Computer Vision

An Implementation on Histogram of Oriented Gradients for Human Detection

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Integrated Feature Selection and Higher-order Spatial Feature Extraction for Object Categorization

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Object Classification for Video Surveillance

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

CS 231A Computer Vision (Winter 2018) Problem Set 3

String distance for automatic image classification

Tri-modal Human Body Segmentation

Beyond Bags of features Spatial information & Shape models

Sketchable Histograms of Oriented Gradients for Object Detection

Real-Time Human Detection using Relational Depth Similarity Features

Evaluation of Hardware Oriented MRCoHOG using Logic Simulation

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Recap Image Classification with Bags of Local Features

HISTOGRAMS OF ORIENTATIO N GRADIENTS

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Learning Representations for Visual Object Class Recognition

Hand Posture Recognition Using Adaboost with SIFT for Human Robot Interaction

Window based detectors

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Discriminative classifiers for image recognition

CS229: Action Recognition in Tennis

A novel template matching method for human detection

Lecture 18: Human Motion Recognition

Basic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies?

Patch Descriptors. CSE 455 Linda Shapiro

Object recognition. Methods for classification and image representation

CS 221: Object Recognition and Tracking

CPPP/UFMS at ImageCLEF 2014: Robot Vision Task

Object Category Detection: Sliding Windows

Digital Image Forgery Detection Based on GLCM and HOG Features

CAP 6412 Advanced Computer Vision

ImageCLEF 2011

Human Motion Detection and Tracking for Video Surveillance

OCCLUSION BOUNDARIES ESTIMATION FROM A HIGH-RESOLUTION SAR IMAGE

Segmentation as Selective Search for Object Recognition in ILSVRC2011

Kernels for Visual Words Histograms

CS4495/6495 Introduction to Computer Vision. 8C-L1 Classification: Discriminative models

Attributes. Computer Vision. James Hays. Many slides from Derek Hoiem

Classification and Detection in Images. D.A. Forsyth

2D Image Processing Feature Descriptors

24 hours of Photo Sharing. installation by Erik Kessels

Real-time Object Detection CS 229 Course Project

c 2011 by Pedro Moises Crisostomo Romero. All rights reserved.

Object detection using non-redundant local Binary Patterns

Face Detection and Alignment. Prof. Xin Yang HUST

Classification of objects from Video Data (Group 30)

Semantic Pooling for Image Categorization using Multiple Kernel Learning

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Incremental Learning of Object Detectors Using a Visual Shape Alphabet

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Learning an Alphabet of Shape and Appearance for Multi-Class Object Detection

CS 223B Computer Vision Problem Set 3

Object Detection Using Segmented Images

Bias-Variance Trade-off (cont d) + Image Representations

Modeling 3D viewpoint for part-based object recognition of rigid objects

Selection of Scale-Invariant Parts for Object Class Recognition

CS 231A Computer Vision (Fall 2012) Problem Set 3

Person Detection in Images using HoG + Gentleboost. Rahul Rajan June 1st July 15th CMU Q Robotics Lab

HOG-based Pedestriant Detector Training

An evaluation of Nearest Neighbor Images to Classes versus Nearest Neighbor Images to Images Instructed by Professor David Jacobs Phil Huynh

Patch-Based Image Classification Using Image Epitomes

Active View Selection for Object and Pose Recognition

Beyond Bags of Features

Transcription:

Linear combinations of simple classifiers for the PASCAL challenge Nik A. Melchior and David Lee 16 721 Advanced Perception The Robotics Institute Carnegie Mellon University Email: melchior@cmu.edu, dlee1@andrew.cmu.edu Abstract Our class project for the 16 721 Advanced Perception was an entry in the PASCAL Visual Object Classes Challenge 2006. The goal of this challenge is to determine whether an ect from one of ten classes appears in a given image. Labelled training data was provided, and participants were free to use any method. We chose to implement a bag of words classifier using color, shape, and texture information. Adaboost was used to determine the best weights for combining classifiers based on these simple descriptors. This report details the development of the overall classifier and discusses results of the challenge. I. INTRODUCTION This paper describes our entry in the PASCAL Visual Object Classes Challenge 1. This challenge was divided into two different competitions, which were termed classification and detection. The goal of the classification challenge, in which we participated, was to determine the confidence that a given ect appears in an image. Participants were free to choose a subset of the ten ect classes selected for the challenge. Submissions for the classification challenge consisted a real valued confidence that an ect of the given class appears in each test image. This permits the calculation of Receiver Operating Characteristic (ROC) curves and the area under curve (AUC), which is used for quantitative comparisons between entries. The detection competition required participants to determine the bounding box of each detected ect in the image. The competition began on February 14 with the release of development code and 2618 labelled training images. For each training image, the bounding box and pose of each instance of an ect class was included. The ect classes are bicycle, bus, car, cat, cow, dog, horse, motorbike, person, and sheep. The test set consisted of an additional 2686 images released on March 31 without ground truth labels. Unlike typical computer vision data sets, the images used in this challenge were rarely well composed photographs of the ects to be detected. Objects often appear occluded, partially out of frame, silhouetted, or very distant. We began the challenge hoping to develop an entry based on a relatively simple classifier. The difficulties of this particular data set led us to believe that no single feature, or set of features, would be sufficient to classify every image. 1 http://www.pascal-network.org/challenges/voc/ voc2006/index.html Image features based on color, texture, and shape could each contribute to the overall classification, but some features would be more helpful in some classes than in others. For example, most of the images of sheep and cows appear in the context of green grass and blue skies, so color information alone should help classify these image. The images in figure 1 illustrate some of the difficulties mentioned above. We would like our classifier to recognize and benefit from the ability to classify in the lower dimensional space of color features without confounding by texture and shape. To this end, we chose to perform classification independently with each of the seven possible combinations of our simple features. The individual classifiers used a bag of words [1] approach, and AdaBoost [2] was used to discover the appropriate weighting of these classifiers for each ect class. The single feature classifiers were based on simple histograms of colors, the textons developed by Martin [3], and the shape information obtained from Histograms of Oriented Gradients (HoG) [4]. Each of these features, and the overall classifier are described in depth in the following section. In section III, we present the results of the classification challenge, including the comparison of our classifier with other entries. Finally, section IV describes possible improvements to our strategy. II. ALGORITHM We classify patches of multiple scales within an image independently (so-called bag-of-words approach). Our classifier is a combination of fourteen weak classifiers, seven classifiers trained on the bounding box of the ect of interest and seven trained on the entire image including the background. A weighted sum of these fourteen weak classifiers, where the weights are chosen by AdaBoost, gives us a confidence for each patch. Confidence of all the patches in an image is averaged to give a final confidence of the image A. Bag of Words Classifier The bag of words classifier extracts features from image patches (or sometimes uses the actual pixel values of the patches themselves) for classification. The hope is that the patches are more provide more information for classification than the more traditional interest point features, without losing generalization ability.

(a) Object in silhouette (b) Color and texture are misleading Fig. 1. Some difficult images The features which we extract from patches are described below. The patches themselves are 16 by 16 pixel squares extracted from the image at three different scales. We evaluated the algorithm when features were extracted from the entire image as well as limiting the features to the ground truth bounding box of the ect. For the sake of speed, the image patches used for training were nonoverlapping. During testing, the improvement of a sliding window approach did not justify the added running time, but we added an additional rescaling of the image. Next, the features extracted during training must be stored for comparison with those of the test images. Rather than store all extracted features, the typical approach is to cluster the features and store a limited number of cluster centers. For each of our fourteen simple classifiers, we created positive and negative dictionaries from these cluster centers. Negative dictionaries were built using features extracted from entire images which did not contain the ect class. To accommodate AdaBoost s requirement to weight training examples at each iteration of training, we implemented an extension to kmeans which altered the distances to examples based on their weights. B. Descriptors Three types of descriptors are computed on 16-by-16 pixel image patches to get a reduced dimension descriptors. 1) Color: Histogram in RGB space is used as the color descriptor. We choose the top 2 entries in the histogram to get a 6 dimensional descriptor for color. 2) Texture: Histogram of textons are used as the texture descriptor. Textons are trained with a small set of images. A set of filters in a filterbank is applied to images and the response of filters are clustered into groups to get 32 textons. We used textons that were already trained by Martin et al. [3] When an image is given, each pixel in the image is assigned to be one of the 32 textons, by applying the same filterbank and finding its closest texton. Histogram of textons gives us a 32 dimensional texture descriptor. 3) Histogram of Oriented Gradients: We have used the method proposed by Dalal et al. [4] with some minor difference in parameters. 16-by-16 pixels image patch is divided into four subregion of 8-by-8 pixels. For each subregion, gradient angle is binned into 18 bins (360 degrees/20 deg/bin) to get a histogram. Histograms from four regions are concatenated to get a 72 dimensional descriptor. C. Combining Classifiers The results of the individual bag of words classifiers are combined using AdaBoost. We used the Matlab implementation provided by the MSU Graphics & Media Lab, Computer Vision Group, http://graphics.cs.msu.ru. We used a variant of AdaBoost included in this toolbox called Modest AdaBoost [5], which uses regularized tradeoff to help avoid overfitting and improve generalization. We experimented with several techniques for combining the results of the fourteen simple classifiers, but the best results were obtained by simply giving one vote to each feature with each classifier. We expected to see better results by weighing the votes based on the classification margin or using a nearest neighbor ratio test to determine which features were truly discriminative, but no variation of these tests improved classification results. III. RESULTS Twenty teams submitted entries to the classification challenge, and most teams tackled all ten ect classes. Although our results were the best among the few entries from Carnegie Mellon, our algorithm performed near the middle of the pack in the final results. An example of the final ROC curves of all entries is shown in figure 2 for the sheep class. Our entry is labelled AP06 Lee. Table II summarizes our performance compared to the winning entry in each class. We were able to compute the results shown for the validation set prior to submission using the ground truth labels provided with the training data. The test results were computed by the competition organizers, and the ground truth for this image

the ect center by having each patch vote in image space for the candidate ect center and utilize that information by weighting votes of patches that estimates the ect center correctly. We have been somewhat successful in estimating the ect center 4 but failed to utilize that information to improve the classification rate due to the lack of time. This work still has potential for improvements, such as better engineered descriptors, more powerful weak classifiers, and using spatial information by estimating the location of ect through spatial voting. If we continue on this track, we may get better standing in the next year s challenge. REFERENCES Fig. 2. A collection of ROC curves for one ect class set has not yet been released. The results of all entries are temporally available at http://www.robots.ox.ac. uk/ me/voc2006res/. A full report describing the results and each participant s approach is expected to be available shortly from the challenge website. Figure 3(a) shows 100 images from the validation set that scored the highest for class sheep, with bottom right being the most confidant and top left the least confidant. Figure 3(b) shows 100 images with the lowest score, again with top left being the least confidant. There were a total of 132 images out of 1341 images which contained sheep in the validation set. Table I shows the weights assigned to fourteen weak classifiers by AdaBoost. Higher number means the feature is more discriminative, and zero means the feature is of no use. Negative number means that it is worse than random guess. Color+HOG and color+hog+texture are usually the best feature and HOGs in general are good. Color by itself is not very useful but becomes useful when combined with other features. Classifiers trained on the bounding box of the ect is generally better than classifiers trained on the entire image including the background, but still background provides some help. [1] G. Csurka, C. Bray, C. Dance, and L. Fan, Visual categorization with bags of keypoints, in ECCV Workshop on Statistical Learning in Computer Vision, 2004, pp. 1 22. [2] R. Schapire and Y. Singer, Improved boosting algorithms using confidence rated predictions, Machine Learning, vol. 37, no. 3, pp. 297 336, December 1999. [3] D. Martin, C. Fowlkes, D. Tal, and J. Malik, A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, in Proc. 8th Int l Conf. Computer Vision, vol. 2, July 2001, pp. 416 423. [4] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in CVPR, 2005. [5] A. Vezhnevets and V. Vezhnevets, Modest adaboost teaching adaboost to generalize better, in Graphicon, 2005. IV. CONCLUSIONS We have developed a simple but effective classifier for the PASCAL challenge. We have won the CMU mini competition and our results were better than the winner of 2005 PASCAL Challenge. We were near the middle among the participants in the 2006 challenge. We think that our success was due to noticing that no single feature can be very effective and thus using multiple features, color, texture, and histogram of oriented gradients. Our contribution is also in developing a way to combine multiple features. Our bag-of-words approach, however, is possibly losing valuable information by not considering any spatial relation between patches. We have tried to get a rough estimation of

(a) 100 images with highest score (b) 100 images with lowest score Fig. 3. Entry Bicycle Bus Car Cat Cow Dog Horse Motorbike Person Sheep Color 0.053-0.089-0.013-0.081 0.168 0.085-0.005-0.021-0.050-0.074-0.054-0.046-0.041-0.101-0.070-0.103 0.108 0.005-0.002-0.008 HOG 0.358 0.218 0.493 0.368 0.427 0.223 0.369 0.257 0.385 0.218 0.344 0.219 0.334 0.167 0.346 0.193 0.283 0.071 0.362 0.215 Texture 0.135 0.070 0.224 0.175 0.256 0.116 0.238 0.182 0.230 0.055 0.220 0.142 0.184 0.073 0.161 0.095 0.174 0.034 0.227 0.070 Results Color+HOG Color+Texture 0.350 0.180 0.256 0.112 0.609 0.467 0.243 0.491 0.332 0.255 0.400 0.252 0.333 0.249 0.425 0.335 0.285 0.326 0.190 0.267 0.113 0.320 0.205 0.193 0.074 0.368 0.272 0.208 0.164 0.294 0.191 0.089 0.061 0.377 0.294 0.330 0.293 TABLE I HOG+Texture 0.359 0.237 0.496 0.377 0.450 0.229 0.370 0.262 0.388 0.357 0.227 0.328 0.175 0.356 0.201 0.296 0.074 0.363 0.221 Color+HOG+Texture 0.360 0.246 0.618 0.437 0.500 0.283 0.392 0.311 0.468 0.342 0.331 0.257 0.309 0.164 0.231 0.313 0.115 0.323 W EIGHTS ASSIGNED TO EACH WEAK CLASSIFIER BY A DA B OOST Entry Us (Validation) Us (Test) Best Test Bicycle 0.853 0.845 0.948 Bus 0.900 0.916 0.984 Car 0.905 0.897 0.977 Cat Cow 0.846 0.873 0.859 0.838 0.937 0.940 TABLE II Dog 0.770 0.766 0.876 Horse 0.750 0.694 0.926 C LASSIFICATION ACCURACY AS AUC VALUES Motorbike 0.802 0.829 0.969 Person 0.656 0.622 0.863 Sheep 0.848 0.875 0.956

(a) Input image (b) Estimated Object Center Fig. 4. Spatial Voting