OBJECT CATEGORIZATION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it Slides: Ing. Lamberto Ballan November 18th, 2009
What is an Object? Merriam-Webster Definition: Something material that may be perceived by the senses. You already know two tasks about objects: Single object recognition: find that logo Object categorization: find a face
Why do we care about categorization? Perception of function: We can perceive the 3D shape, texture, material properties, without knowing about objects. But, the concept of category encapsulates also information about what can we do with those objects. We therefore include the perception of function as a proper indeed, crucial- subject for vision science, from Vision Science, chapter 9, Palmer. ICCV09 Short Course: Fei Fei, Torralba, Fergus
The perception of function Direct perception (affordances): Flat surface Horizontal Knee-high Sittable upon affordance: quality of an object that allows to perform an action (Gibson 1977). Mediated perception (Categorization) Flat surface Horizontal Knee-high Chair Sittable upon Chair Chair Chair? ICCV09 Short Course: Fei Fei, Torralba, Fergus
Direct perception Some aspects of an object function can be perceived directly Functional form: Some forms clearly indicate to a function ( sittable-upon, container, cutting device, ) Sittable-upon Sittable-upon It does not seem easy to sit-upon this Sittable-upon ICCV09 Short Course: Fei Fei, Torralba, Fergus
Text Indexing and Categorization Text categorization: the task is to assign a document to one or more categories based on its content is it something about medicine/biology? is it a document about business? Why is it useful? Detecting and indexing similar text/documents in large corpora Clustering document by topic Extracting mid/high level concepts from documents The Bag of Words (BoW) model, combined with advanced classification techniques, is able to perform state-of-the-art results A text - such a sentence or a document - is represented as an unordered collection of words, disregarding grammar and even word order; Three elements: i) a vocabulary, ii) an histogram representation of documents, iii) a classification method
Same approach usable with visual data An image can be treated as a document, and features extracted from the image are considered as the "visual words"... image of an object category bag of visual words D1: face D2: bike D3: violin Bag of (visual) Words: an image is represented as an unordered collection of visual words Vocabulary (codewords)
Why is it useful? Object recognition and categorization Bag of Visual Words have been successfully used to object categorization in images (e.g. faces, car, airplanes...) and, more recently, for action recognition in video sequences (e.g. running, walking, clapping...) Aim: find (annotate) objects in this photo (a very optimistic result...)
Three stages 1. Codebook (vocabulary) formation and feature assignment Given a training set, local descriptors (e.g. SIFT) are collected and a clustering algorithm is used to perform a quantization of the feature space Then, each cluster s center is used as an iconic word and local descriptors are assigned to the nearest word using an appropriate distance (e.g. Euclidean); the result is a Bag-of-Words representation 2. Train a classifier to discriminate vectors corresponding to positive and negative training images Usually Support Vector Machines (SVM) are used as classifiers 3. Apply the trained classifiers to the test image Note: the approach is the same... but the first stage - codebook formation and feature assignment - is really challenging because visual words have to be defined in advance using a clustering algorithm (e.g. k-means)
Note: we have to train a classifier (detector) for each object class... Training images Test images 1 2 3 Courtesy A. Zisserman
Feature detection Given an image, feature detection is the process of extracting local patches (regions) There are several methods: Random sampling Regular grid (dense sampling): the image is segmented by some horizontal and vertical lines It shows very good results for natural scene categorization (Fei-Fei and Perona, CVPR 2005) Interest Points (sparse sampling): local patches are detected by interest point detectors that are able to select salient regions (such as edges, corners, blobs); several different techniques (Mikolajczyk et al., IJCV 2005) Harris corner detector Difference of Gaussian (DoG); it is the SIFT detector (Lowe, IJCV 2004) Affine covariant patches
Regular grid It is probably the most simple method for feature detection An evenly sampled grid spaced at given values (e.g. 10x10 pixels) for a given image Despite of its simplicity, it provides good results for textures and natural scenes because it is able to describe more regions respect to interest points techniques
Interest points Local patches are detected at most salient regions (such as the regions attracting human attention) It use more information about the image itself respect to random or grid sampling An example of local patches detected using affine covariant features
Comparison: dense (grid) sampling vs sparse (interest points) sampling Dense sampling Advantage: it is able to describe the global content of an image Disadvantage: it uses little information of an image itself It has been used successfully for textures and natural scenes categorization Sparse sampling Advantage: it is able to detect ( select ) salient regions that are related to the more attractive and informative regions Disadvantage: depending on the interest points technique and the type/resolution of the image, sometimes only few regions are detected It has been used for specific object recognition and categorization (better for describing background/foreground)
Note: this is the basis of the popular SIFT, HOG, (Generalized)Shape Context methods Feature representation Local features are represented by local descriptors Several different information can be used but, usually, edge or gradient orientation histograms are the most common choice Common framework: Divide local region into spatial cells Calculate orientation of image gradient at each pixel Pool quantized orientations over each cell: i) descriptor contains an orientation histogram for each cell, ii) weight votes by gradient magnitude
SIFT descriptors However, the most common choice is the SIFT descriptor (because it exhibits the highest matching accuracies) Standard SIFT is computed as a set of orientation histograms on 4x4 pixel neighborhoods (contribution of each pixel is weighted by the gradient magnitude and by a gaussian equal to 1.5 times the scale of the keypoint) histograms contains 8 bins each (corresponding to 8 orientations) each descriptor contains a 4x4 array of 16 histograms around the keypoint this leads to a SIFT descriptor with (4x4x8) 128 elements Fig: standard SIFT descriptor
Combined feature descriptors Local descriptors (like SIFT) are usually based only on luminance and shape, so they use grey-scale values and ignore color it is very difficult to select a color model that it sufficiently robust and general nevertheless, color is very important to describe/distinguish objects or scenes Different types of descriptor can be combined to improve representation; the most common combination is between a local shape-descriptor (e.g. SIFT) and a color descriptor (e.g. color histogram in a smart color space like Luv or HSV) Figure: an example of color-sift descriptor (van de Weijer and Schmid, ECCV 2006). The combined descriptor is obtained by fusion of standard SIFT and a Hue descriptor calculated in a Color Invariant Space. Courtesy J. van de Weijer
Codebook formation The Bag-of-Words model is built through the creation of a discrete visual vocabulary (codebook) A vocabulary in the object/scene classification domain is commonly obtained by following one of two approaches: Annotation approach Data-driven approach Annotation approach: A vocabulary is obtained by assigning meaningful labels to image patches (e.g. sky, water, vegetation, etc.) Data-driven approach: It is required to perform a vector quantization for large sets of feature-vectors (usually in a high-dimensional space) This is performed by clustering of feature vectors
Data-driven approach Visual words are defined by clustering of feature vectors. An example:
The performance of this approach depends on the quantization method and on the number of words that are selected The most common quantization approach is the use of k-means clustering: the main reasons are its simplicity and convergence speed Examples of visual words: Courtesy A. Zisserman
k-means clustering It is an algorithm to cluster n objects, based on their feature-vector representation, into k<n partitions The objective it tries to achieve is to minimize global intra-cluster variance, or the squared error function: where k is the number of clusters, Si (i=1,...,k) are cluster partitions, and μi is the centroid (or mean point) of all the points xj Si The most common form of k-means is the Lloyd s algorithm: - They are often used as synonymously, but in reality Lloyd s method is an heuristic for solving the k-means problem - Other variations exist but it has remained popular because it converges extremely rapidly in practice
k-means clustering: Lloyd s algorithm Lloyd s algorithm is an heuristic iterative solution for the k-means problem 1. It starts by partitioning the n input points into k initial sets, either at random or using some heuristic data 2. It then calculates the centroid μi of each set Si (with i=1,...,k) 3. It constructs a new partition by associating each point with the closest centroid 4. Finally, the centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence (which is obtained when i) the points no longer switch cluster or ii) centroids are no longer changed) 1,2) initial random centroids 3) new partition by associating points to nearest centroid 4) centroids are moved to the center of their clusters convergence
k-means disadvantages Despite of its popularity, the use of k-means clustering for codebook formation is not the optimal solution It has some main disadvantages: 1. the number of visual words has to be known in advance 2. the clustering is not very robust w.r.t. outliers 3. cluster centers are attracted by the denser regions of the sample distribution, thus providing a more imprecise quantization for the vectors laying in these regions
This effect (3), due to the assumption of uniform distribution of the features in the descriptor space, is even more pronounced in high dimensional spaces A representation of this effect can be obtained visualizing a Voronoi tessellation of the feature space: k-means (Voronoi tesselletion) Detail of a dense region that has been split in 4 clusters Note: Voronoi cells do not uniformly cover the feature space...
Radius-based clustering Given n vectors, the algorithm starts with an uniform random subsampling s of the original dataset (thus s n) For each xi s (grey circles in fig.), a mean-shift procedure is initialized mean-shift is a procedure for locating the modes of a samples distribution (in other words, it is able to find the densest regions of the distribution) Given a radius R, mean-shift clustering on s is used to find the modes A new cluster center is then allocated on the mode corresponding to the maximal density region All vectors on the original set n within a distance < R from the center are labeled as members of this clusters and eliminated for the following iterations it prevents the algorithm from repeatedly assigning centers to the same high-density region It can be stopped when a sufficient number of clusters (words) has been identified R
In this way, cluster centers are allocated more uniformly A representation of this effect can be obtained visualizing a Voronoi tessellation of the feature space (it is compared to k-means) k-means clustering (Voronoi tesselletion) radius-based clustering (Voronoi tesselletion) Note: this dense region, that has been split in 4 clusters using k-means, now is correctly coded by radius-based clustering
Meanshift Meanshift estimator find distribution modes non parametrically Radius-based clustering 1. Subsampling 2. Use a meanshift estimator for each point 3. Densest mode M is found. 4. Each point at distance r from M. 5. Assigned points are removed. 6. Stop if enough cluster are found or no more clusters are found..
Feature assignment Represent an image as a histogram of visual word frequencies Given the codebook generated in the training stages, each region extracted from the test image has to be assigned to the corresponding visual word usually regions are represented by SIFT descriptors usually these features are hard-assigned to the nearest word (in terms of Euclidean distance) Feature (hard) assignment Feature detection Feature representation BoW model: histogram of visual words Courtesy A. Zisserman
Drawbacks: this hard assignment, that takes account only of the closest codeword, lacks to consider two issues: codeword uncertainty: i.e. the problem of selecting the correct codeword when two or more candidates are relevant codeword plausibility: i.e. the problem of selecting the correct codeword when all codewords are too far and no representative Feature assignment Figure: the small blue dots are image features the labeled red circles are codewords the yellow triangle represents an image feature that is correctly assigned to codeword b the green square is a example of the codeword uncertainty problem the light-blue diamond is an example of the codeword plausibility problem
A possible solution: a soft-assignment mechanism that is able to consider the information or two (or more) relevant candidates Recently, solutions based on kernel density estimation have been applied to feature assignment in codebook models (van Gemert et al., ECCV 2008, Philbin et al., CVPR 2008) In this way, the word frequency histogram is calculated by smoothing the hard assignment of features to the codeword vocabulary Hard-assignment Soft-assignment (Gaussian kernel)
Word frequency distributions In text classification, the problem of selecting a good vocabulary is only related to feature selection and vocabulary size Feature selection: pick only those terms that are really discriminant (e.g. using Mutual Information or Chi-Square statistics) Stop-words removal (the most frequent words like the, of, an, etc.) Stemming (the process for reducing inflected/derived words to their stem, base or root form) Given a natural language textual corpus, the words frequency distribution follows the well-known Zipf s law stop words Zipf's law: states that given a corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table (an ideal Zipf s distribution must be a straight line in log-log scale) most useful words
Visual words statistics Zipf s law is one of the basic assumption in text categorization according to this empirical evidence, we can consider words at intermediate frequencies as the most informative for classification Therefore, it is interesting to see how the visual words are distributed in a visual corpus in particular we want to know whether their distribution satisfies Zipf s law how their statistics is related to i) feature detection and ii) quantization techniques (see references [2,3] for more details)
An example: the distribution of visual words frequency using k-means and radius-based quantization Note: results are related to action categorization on KTH dataset
Vocabulary size Unlike the vocabulary of a text corpus whose size is relatively fixed, the size of a visual-words vocabulary is controlled by the number of clusters Choosing the right vocabulary size involves the trade-off between discriminativity and generalizability with small vocabularies, the visual word is not very discriminative because dissimilar features can map to the same codeword as the vocabulary size increases, the feature becomes more discriminative but meanwhile less generalizable (similar features can map to different codewords) There is no consensus as to the appropriate size of a visual vocabulary it can varies from several hundreds, to thousands and ten of thausands however, it is closely related to the dataset (e.g. image resolution) and the feature detection process usually the optimal size is fixed by experiments (see reference [3] for more details)
Usually in text categorization the vocabulary size is reduced, keeping only the most informative terms, using feature selection methods several methods: best results are obtained using Chi-Square statistic (CHImax) and Mutual information (IG) in text corpus, a good feature selection method is able to improve classification performances by reducing vocabulary size... is it the same for a visual corpus? text categorization object categorization
Classification Many different approaches; state-of-the-art results using BoW models are obtained by Support Vector Machines (SVM) classifiers An SVM classifier will construct a separating hyperplane in that space, one which maximizes the margin between the two data sets SVM is a binary classifier but, usually, in the visual domain it s extended to multi-class problems the original algorithm can be adapted to non-linear classification problems using the kernel trick method support vector Optimization problem: margin support vector (C is a regularization error-term usually fixed by cross-validation on the training set)
Non-linear classification: image features are described by high-dimensional feature vectors therefore, data are usually (always) not-linearly separable... linear kernel is sufficient linearly separable data not-linearly separable data use non-linear kernel (e.g. RBF) Optimal kernel choice: stata-of-the-art results are performed by Gaussian kernel using Chi-square as distance between histograms kernel: Chi-square distance:
A classification example (on two different datasets) results are closely related to the vocabulary size for big vocabularies linear kernel are the best choice (also, obviously, for computational costs)
Weakness of the BoW model No rigorous geometric information of the objects components It s intuitive that objects are made of parts... and relations between parts are really informative! An example ( face detection): Note: All have equal probability for BoW models Not extensively tested yet for viewpoint and scale invariance Segmentation and localization unclear These methods can suffer from poor recall
Applications Nowadays, BoW models have been successfully applied to the visual domain by several research groups; it is probably the most popular approach for large-scale categorization problems Object, Scene and Texture categorization in images Event and Action categorization in videos
References (1) J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In: Proc. of ICCV, 2003. (2) F. Jurie and B. Triggs. Creating efficient codebooks for visual recognition. In: Proc. of ICCV, 2005. (3) J. Yang, Y.-G. Jiang, A. G. Hauptmann and C.-W. Ngo. Evaluating bag-of-visualwords representation in scene classification. In: Proc. of MIR, 2007. (4) L. Fei-Fei, R. Fergus, A. Torralba. Recognizing and learning object categories. CVPR 2007 short course (Slides, Matlab code, Datasets), URL: http://people.csail.mit.edu/torralba/shortcourserloc/index.html