OBJECT CATEGORIZATION

Similar documents
BoW model. Textual data: Bag of Words model

Part-based and local feature models for generic object recognition

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Part based models for recognition. Kristen Grauman

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

Lecture 12 Recognition

Discriminative classifiers for image recognition

Lecture 12 Recognition. Davide Scaramuzza

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Beyond Bags of Features

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Hands on Advanced Bag- of- Words Models for Visual Recogni8on

Descriptors for CV. Introduc)on:

Bag-of-features. Cordelia Schmid

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

Patch Descriptors. CSE 455 Linda Shapiro

Video annotation based on adaptive annular spatial partition scheme

Patch Descriptors. EE/CSE 576 Linda Shapiro

Beyond Bags of features Spatial information & Shape models

IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES

Computer Vision. Exercise Session 10 Image Categorization

CS6670: Computer Vision

String distance for automatic image classification

Object Category Detection. Slides mostly from Derek Hoiem

Image classification Computer Vision Spring 2018, Lecture 18

Object Classification Problem

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)

Sparse coding for image classification

Local Image Features

Visual Object Recognition

Evaluation and comparison of interest points/regions

2D Image Processing Feature Descriptors

VK Multimedia Information Systems

Detection III: Analyzing and Debugging Detection Methods

CS229: Action Recognition in Tennis

ImageCLEF 2011

Local Features and Bag of Words Models

Aggregating Descriptors with Local Gaussian Metrics

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Local features and image matching. Prof. Xin Yang HUST

Large-scale visual recognition The bag-of-words representation

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Learning Representations for Visual Object Class Recognition

Selection of Scale-Invariant Parts for Object Class Recognition

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Feature Based Registration - Image Alignment

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

Sampling Strategies for Object Classifica6on. Gautam Muralidhar

By Suren Manvelyan,

Evaluation of GIST descriptors for web scale image search

Local Features: Detection, Description & Matching

1 Case study of SVM (Rob)

Deformable Part Models

Basic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies?

arxiv: v3 [cs.cv] 3 Oct 2012

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Comparing Local Feature Descriptors in plsa-based Image Models

Lecture 10 Detectors and descriptors

Using Geometric Blur for Point Correspondence

Large scale object/scene recognition

Indexing local features and instance recognition May 14 th, 2015

Visual words. Map high-dimensional descriptors to tokens/words by quantizing the feature space.

Supervised learning. y = f(x) function

Pattern recognition (3)

Category-level localization

Category vs. instance recognition

Feature Detection. Raul Queiroz Feitosa. 3/30/2017 Feature Detection 1

HISTOGRAMS OF ORIENTATIO N GRADIENTS

Tensor Decomposition of Dense SIFT Descriptors in Object Recognition

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

SIFT - scale-invariant feature transform Konrad Schindler

Large Scale Image Retrieval

Artistic ideation based on computer vision methods

Local features: detection and description May 12 th, 2015

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study

Instance-level recognition II.

Action recognition in videos

Indexing local features and instance recognition May 16 th, 2017

Video Google: A Text Retrieval Approach to Object Matching in Videos

Fitting: The Hough transform

Local Image Features

Recognition. Topics that we will try to cover:

Lecture 16: Object recognition: Part-based generative models

Motion illusion, rotating snakes

Local Feature Detectors

Local Features based Object Categories and Object Instances Recognition

SURF. Lecture6: SURF and HOG. Integral Image. Feature Evaluation with Integral Image

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

CS 4495 Computer Vision Classification 3: Bag of Words. Aaron Bobick School of Interactive Computing

Improved Spatial Pyramid Matching for Image Classification

Local features: detection and description. Local invariant features

Perception IV: Place Recognition, Line Extraction

Human Detection and Action Recognition. in Video Sequences

Combining Selective Search Segmentation and Random Forest for Image Classification

Transcription:

OBJECT CATEGORIZATION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it Slides: Ing. Lamberto Ballan November 18th, 2009

What is an Object? Merriam-Webster Definition: Something material that may be perceived by the senses. You already know two tasks about objects: Single object recognition: find that logo Object categorization: find a face

Why do we care about categorization? Perception of function: We can perceive the 3D shape, texture, material properties, without knowing about objects. But, the concept of category encapsulates also information about what can we do with those objects. We therefore include the perception of function as a proper indeed, crucial- subject for vision science, from Vision Science, chapter 9, Palmer. ICCV09 Short Course: Fei Fei, Torralba, Fergus

The perception of function Direct perception (affordances): Flat surface Horizontal Knee-high Sittable upon affordance: quality of an object that allows to perform an action (Gibson 1977). Mediated perception (Categorization) Flat surface Horizontal Knee-high Chair Sittable upon Chair Chair Chair? ICCV09 Short Course: Fei Fei, Torralba, Fergus

Direct perception Some aspects of an object function can be perceived directly Functional form: Some forms clearly indicate to a function ( sittable-upon, container, cutting device, ) Sittable-upon Sittable-upon It does not seem easy to sit-upon this Sittable-upon ICCV09 Short Course: Fei Fei, Torralba, Fergus

Text Indexing and Categorization Text categorization: the task is to assign a document to one or more categories based on its content is it something about medicine/biology? is it a document about business? Why is it useful? Detecting and indexing similar text/documents in large corpora Clustering document by topic Extracting mid/high level concepts from documents The Bag of Words (BoW) model, combined with advanced classification techniques, is able to perform state-of-the-art results A text - such a sentence or a document - is represented as an unordered collection of words, disregarding grammar and even word order; Three elements: i) a vocabulary, ii) an histogram representation of documents, iii) a classification method

Same approach usable with visual data An image can be treated as a document, and features extracted from the image are considered as the "visual words"... image of an object category bag of visual words D1: face D2: bike D3: violin Bag of (visual) Words: an image is represented as an unordered collection of visual words Vocabulary (codewords)

Why is it useful? Object recognition and categorization Bag of Visual Words have been successfully used to object categorization in images (e.g. faces, car, airplanes...) and, more recently, for action recognition in video sequences (e.g. running, walking, clapping...) Aim: find (annotate) objects in this photo (a very optimistic result...)

Three stages 1. Codebook (vocabulary) formation and feature assignment Given a training set, local descriptors (e.g. SIFT) are collected and a clustering algorithm is used to perform a quantization of the feature space Then, each cluster s center is used as an iconic word and local descriptors are assigned to the nearest word using an appropriate distance (e.g. Euclidean); the result is a Bag-of-Words representation 2. Train a classifier to discriminate vectors corresponding to positive and negative training images Usually Support Vector Machines (SVM) are used as classifiers 3. Apply the trained classifiers to the test image Note: the approach is the same... but the first stage - codebook formation and feature assignment - is really challenging because visual words have to be defined in advance using a clustering algorithm (e.g. k-means)

Note: we have to train a classifier (detector) for each object class... Training images Test images 1 2 3 Courtesy A. Zisserman

Feature detection Given an image, feature detection is the process of extracting local patches (regions) There are several methods: Random sampling Regular grid (dense sampling): the image is segmented by some horizontal and vertical lines It shows very good results for natural scene categorization (Fei-Fei and Perona, CVPR 2005) Interest Points (sparse sampling): local patches are detected by interest point detectors that are able to select salient regions (such as edges, corners, blobs); several different techniques (Mikolajczyk et al., IJCV 2005) Harris corner detector Difference of Gaussian (DoG); it is the SIFT detector (Lowe, IJCV 2004) Affine covariant patches

Regular grid It is probably the most simple method for feature detection An evenly sampled grid spaced at given values (e.g. 10x10 pixels) for a given image Despite of its simplicity, it provides good results for textures and natural scenes because it is able to describe more regions respect to interest points techniques

Interest points Local patches are detected at most salient regions (such as the regions attracting human attention) It use more information about the image itself respect to random or grid sampling An example of local patches detected using affine covariant features

Comparison: dense (grid) sampling vs sparse (interest points) sampling Dense sampling Advantage: it is able to describe the global content of an image Disadvantage: it uses little information of an image itself It has been used successfully for textures and natural scenes categorization Sparse sampling Advantage: it is able to detect ( select ) salient regions that are related to the more attractive and informative regions Disadvantage: depending on the interest points technique and the type/resolution of the image, sometimes only few regions are detected It has been used for specific object recognition and categorization (better for describing background/foreground)

Note: this is the basis of the popular SIFT, HOG, (Generalized)Shape Context methods Feature representation Local features are represented by local descriptors Several different information can be used but, usually, edge or gradient orientation histograms are the most common choice Common framework: Divide local region into spatial cells Calculate orientation of image gradient at each pixel Pool quantized orientations over each cell: i) descriptor contains an orientation histogram for each cell, ii) weight votes by gradient magnitude

SIFT descriptors However, the most common choice is the SIFT descriptor (because it exhibits the highest matching accuracies) Standard SIFT is computed as a set of orientation histograms on 4x4 pixel neighborhoods (contribution of each pixel is weighted by the gradient magnitude and by a gaussian equal to 1.5 times the scale of the keypoint) histograms contains 8 bins each (corresponding to 8 orientations) each descriptor contains a 4x4 array of 16 histograms around the keypoint this leads to a SIFT descriptor with (4x4x8) 128 elements Fig: standard SIFT descriptor

Combined feature descriptors Local descriptors (like SIFT) are usually based only on luminance and shape, so they use grey-scale values and ignore color it is very difficult to select a color model that it sufficiently robust and general nevertheless, color is very important to describe/distinguish objects or scenes Different types of descriptor can be combined to improve representation; the most common combination is between a local shape-descriptor (e.g. SIFT) and a color descriptor (e.g. color histogram in a smart color space like Luv or HSV) Figure: an example of color-sift descriptor (van de Weijer and Schmid, ECCV 2006). The combined descriptor is obtained by fusion of standard SIFT and a Hue descriptor calculated in a Color Invariant Space. Courtesy J. van de Weijer

Codebook formation The Bag-of-Words model is built through the creation of a discrete visual vocabulary (codebook) A vocabulary in the object/scene classification domain is commonly obtained by following one of two approaches: Annotation approach Data-driven approach Annotation approach: A vocabulary is obtained by assigning meaningful labels to image patches (e.g. sky, water, vegetation, etc.) Data-driven approach: It is required to perform a vector quantization for large sets of feature-vectors (usually in a high-dimensional space) This is performed by clustering of feature vectors

Data-driven approach Visual words are defined by clustering of feature vectors. An example:

The performance of this approach depends on the quantization method and on the number of words that are selected The most common quantization approach is the use of k-means clustering: the main reasons are its simplicity and convergence speed Examples of visual words: Courtesy A. Zisserman

k-means clustering It is an algorithm to cluster n objects, based on their feature-vector representation, into k<n partitions The objective it tries to achieve is to minimize global intra-cluster variance, or the squared error function: where k is the number of clusters, Si (i=1,...,k) are cluster partitions, and μi is the centroid (or mean point) of all the points xj Si The most common form of k-means is the Lloyd s algorithm: - They are often used as synonymously, but in reality Lloyd s method is an heuristic for solving the k-means problem - Other variations exist but it has remained popular because it converges extremely rapidly in practice

k-means clustering: Lloyd s algorithm Lloyd s algorithm is an heuristic iterative solution for the k-means problem 1. It starts by partitioning the n input points into k initial sets, either at random or using some heuristic data 2. It then calculates the centroid μi of each set Si (with i=1,...,k) 3. It constructs a new partition by associating each point with the closest centroid 4. Finally, the centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence (which is obtained when i) the points no longer switch cluster or ii) centroids are no longer changed) 1,2) initial random centroids 3) new partition by associating points to nearest centroid 4) centroids are moved to the center of their clusters convergence

k-means disadvantages Despite of its popularity, the use of k-means clustering for codebook formation is not the optimal solution It has some main disadvantages: 1. the number of visual words has to be known in advance 2. the clustering is not very robust w.r.t. outliers 3. cluster centers are attracted by the denser regions of the sample distribution, thus providing a more imprecise quantization for the vectors laying in these regions

This effect (3), due to the assumption of uniform distribution of the features in the descriptor space, is even more pronounced in high dimensional spaces A representation of this effect can be obtained visualizing a Voronoi tessellation of the feature space: k-means (Voronoi tesselletion) Detail of a dense region that has been split in 4 clusters Note: Voronoi cells do not uniformly cover the feature space...

Radius-based clustering Given n vectors, the algorithm starts with an uniform random subsampling s of the original dataset (thus s n) For each xi s (grey circles in fig.), a mean-shift procedure is initialized mean-shift is a procedure for locating the modes of a samples distribution (in other words, it is able to find the densest regions of the distribution) Given a radius R, mean-shift clustering on s is used to find the modes A new cluster center is then allocated on the mode corresponding to the maximal density region All vectors on the original set n within a distance < R from the center are labeled as members of this clusters and eliminated for the following iterations it prevents the algorithm from repeatedly assigning centers to the same high-density region It can be stopped when a sufficient number of clusters (words) has been identified R

In this way, cluster centers are allocated more uniformly A representation of this effect can be obtained visualizing a Voronoi tessellation of the feature space (it is compared to k-means) k-means clustering (Voronoi tesselletion) radius-based clustering (Voronoi tesselletion) Note: this dense region, that has been split in 4 clusters using k-means, now is correctly coded by radius-based clustering

Meanshift Meanshift estimator find distribution modes non parametrically Radius-based clustering 1. Subsampling 2. Use a meanshift estimator for each point 3. Densest mode M is found. 4. Each point at distance r from M. 5. Assigned points are removed. 6. Stop if enough cluster are found or no more clusters are found..

Feature assignment Represent an image as a histogram of visual word frequencies Given the codebook generated in the training stages, each region extracted from the test image has to be assigned to the corresponding visual word usually regions are represented by SIFT descriptors usually these features are hard-assigned to the nearest word (in terms of Euclidean distance) Feature (hard) assignment Feature detection Feature representation BoW model: histogram of visual words Courtesy A. Zisserman

Drawbacks: this hard assignment, that takes account only of the closest codeword, lacks to consider two issues: codeword uncertainty: i.e. the problem of selecting the correct codeword when two or more candidates are relevant codeword plausibility: i.e. the problem of selecting the correct codeword when all codewords are too far and no representative Feature assignment Figure: the small blue dots are image features the labeled red circles are codewords the yellow triangle represents an image feature that is correctly assigned to codeword b the green square is a example of the codeword uncertainty problem the light-blue diamond is an example of the codeword plausibility problem

A possible solution: a soft-assignment mechanism that is able to consider the information or two (or more) relevant candidates Recently, solutions based on kernel density estimation have been applied to feature assignment in codebook models (van Gemert et al., ECCV 2008, Philbin et al., CVPR 2008) In this way, the word frequency histogram is calculated by smoothing the hard assignment of features to the codeword vocabulary Hard-assignment Soft-assignment (Gaussian kernel)

Word frequency distributions In text classification, the problem of selecting a good vocabulary is only related to feature selection and vocabulary size Feature selection: pick only those terms that are really discriminant (e.g. using Mutual Information or Chi-Square statistics) Stop-words removal (the most frequent words like the, of, an, etc.) Stemming (the process for reducing inflected/derived words to their stem, base or root form) Given a natural language textual corpus, the words frequency distribution follows the well-known Zipf s law stop words Zipf's law: states that given a corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table (an ideal Zipf s distribution must be a straight line in log-log scale) most useful words

Visual words statistics Zipf s law is one of the basic assumption in text categorization according to this empirical evidence, we can consider words at intermediate frequencies as the most informative for classification Therefore, it is interesting to see how the visual words are distributed in a visual corpus in particular we want to know whether their distribution satisfies Zipf s law how their statistics is related to i) feature detection and ii) quantization techniques (see references [2,3] for more details)

An example: the distribution of visual words frequency using k-means and radius-based quantization Note: results are related to action categorization on KTH dataset

Vocabulary size Unlike the vocabulary of a text corpus whose size is relatively fixed, the size of a visual-words vocabulary is controlled by the number of clusters Choosing the right vocabulary size involves the trade-off between discriminativity and generalizability with small vocabularies, the visual word is not very discriminative because dissimilar features can map to the same codeword as the vocabulary size increases, the feature becomes more discriminative but meanwhile less generalizable (similar features can map to different codewords) There is no consensus as to the appropriate size of a visual vocabulary it can varies from several hundreds, to thousands and ten of thausands however, it is closely related to the dataset (e.g. image resolution) and the feature detection process usually the optimal size is fixed by experiments (see reference [3] for more details)

Usually in text categorization the vocabulary size is reduced, keeping only the most informative terms, using feature selection methods several methods: best results are obtained using Chi-Square statistic (CHImax) and Mutual information (IG) in text corpus, a good feature selection method is able to improve classification performances by reducing vocabulary size... is it the same for a visual corpus? text categorization object categorization

Classification Many different approaches; state-of-the-art results using BoW models are obtained by Support Vector Machines (SVM) classifiers An SVM classifier will construct a separating hyperplane in that space, one which maximizes the margin between the two data sets SVM is a binary classifier but, usually, in the visual domain it s extended to multi-class problems the original algorithm can be adapted to non-linear classification problems using the kernel trick method support vector Optimization problem: margin support vector (C is a regularization error-term usually fixed by cross-validation on the training set)

Non-linear classification: image features are described by high-dimensional feature vectors therefore, data are usually (always) not-linearly separable... linear kernel is sufficient linearly separable data not-linearly separable data use non-linear kernel (e.g. RBF) Optimal kernel choice: stata-of-the-art results are performed by Gaussian kernel using Chi-square as distance between histograms kernel: Chi-square distance:

A classification example (on two different datasets) results are closely related to the vocabulary size for big vocabularies linear kernel are the best choice (also, obviously, for computational costs)

Weakness of the BoW model No rigorous geometric information of the objects components It s intuitive that objects are made of parts... and relations between parts are really informative! An example ( face detection): Note: All have equal probability for BoW models Not extensively tested yet for viewpoint and scale invariance Segmentation and localization unclear These methods can suffer from poor recall

Applications Nowadays, BoW models have been successfully applied to the visual domain by several research groups; it is probably the most popular approach for large-scale categorization problems Object, Scene and Texture categorization in images Event and Action categorization in videos

References (1) J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In: Proc. of ICCV, 2003. (2) F. Jurie and B. Triggs. Creating efficient codebooks for visual recognition. In: Proc. of ICCV, 2005. (3) J. Yang, Y.-G. Jiang, A. G. Hauptmann and C.-W. Ngo. Evaluating bag-of-visualwords representation in scene classification. In: Proc. of MIR, 2007. (4) L. Fei-Fei, R. Fergus, A. Torralba. Recognizing and learning object categories. CVPR 2007 short course (Slides, Matlab code, Datasets), URL: http://people.csail.mit.edu/torralba/shortcourserloc/index.html