Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study

Similar documents
Local features and kernels for classification of texture and object categories: A comprehensive study

Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study

Learning Representations for Visual Object Class Recognition

Bag-of-features. Cordelia Schmid

Learning Object Representations for Visual Object Class Recognition

Object Classification Problem

Selection of Scale-Invariant Parts for Object Class Recognition

Comparison of Local Feature Descriptors

Sampling strategies for bag-of-features image classification

Beyond Bags of Features

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

ImageCLEF 2011

Basic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies?

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

CLASSIFICATION Experiments

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Multiple-Choice Questionnaire Group C

Action recognition in videos

SCALE INVARIANT FEATURE TRANSFORM (SIFT)

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Local Features based Object Categories and Object Instances Recognition

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

Selection of Scale-Invariant Parts for Object Class Recognition

Part based models for recognition. Kristen Grauman

A sparse texture representation using local affine regions

Image Matching. AKA: Image registration, the correspondence problem, Tracking,

Evaluation and comparison of interest points/regions

CHAPTER 1.5 PATTERN RECOGNITION WITH LOCAL INVARIANT FEATURES

Beyond bags of Features

Part-based and local feature models for generic object recognition

3D Object Recognition using Multiclass SVM-KNN

Learning and Recognizing Visual Object Categories Without First Detecting Features

Category-level localization

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

TEXTURE CLASSIFICATION METHODS: A REVIEW

Fisher vector image representation

LEARNING MODELS FOR MULTI-VIEWPOINT OBJECT DETECTION

String distance for automatic image classification

CS229: Action Recognition in Tennis

Selective Search for Object Recognition

Efficient Kernels for Identifying Unbounded-Order Spatial Features

TA Section: Problem Set 4

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

LOCAL, SEMI-LOCAL AND GLOBAL MODELS FOR TEXTURE, OBJECT AND SCENE RECOGNITION

Features Points. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

Learning Realistic Human Actions from Movies

Local features: detection and description. Local invariant features

Midterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability

Computer Vision for HCI. Topics of This Lecture

CAP 5415 Computer Vision Fall 2012

Learning Spatial Context: Using Stuff to Find Things

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation

Human Detection and Action Recognition. in Video Sequences

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

Local Features and Bag of Words Models

Modeling Image Context using Object Centered Grid

Outline 7/2/201011/6/

Bag of Words or Bag of Features. Lecture-16 (Slides Credit: Cordelia Schmid LEAR INRIA Grenoble)

Deformable Part Models

Local features: detection and description May 12 th, 2015

Local Feature Detectors

Bias-Variance Trade-off + Other Models and Problems

A performance evaluation of local descriptors

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Shape recognition with edge-based features

Improved Spatial Pyramid Matching for Image Classification

Visual Object Recognition

Robotics Programming Laboratory

Image Features: Detection, Description, and Matching and their Applications

Beyond Bags of features Spatial information & Shape models

OBJECT CATEGORIZATION

Segmentation as Selective Search for Object Recognition in ILSVRC2011

Patch-based Object Recognition. Basic Idea

The Impact of Color on Bag-of-Words based Object Recognition

Introduction. Introduction. Related Research. SIFT method. SIFT method. Distinctive Image Features from Scale-Invariant. Scale.

Image Classification based on Saliency Driven Nonlinear Diffusion and Multi-scale Information Fusion Ms. Swapna R. Kharche 1, Prof.B.K.

Integrated Feature Selection and Higher-order Spatial Feature Extraction for Object Categorization

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

Sampling Strategies for Object Classifica6on. Gautam Muralidhar

Building a Panorama. Matching features. Matching with Features. How do we build a panorama? Computational Photography, 6.882

Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature

CS6716 Pattern Recognition

The SIFT (Scale Invariant Feature

Chapter 2 Action Representation

CS 231A Computer Vision (Fall 2011) Problem Set 4

Lonely but Attractive: Sparse Color Salient Points for Object Retrieval and Categorization

Unifying discriminative visual codebook generation with classifier training for object category recognition

Scale and Affine Invariant Local Detectors and Descriptors

Artistic ideation based on computer vision methods

Shape Matching. Brandon Smith and Shengnan Wang Computer Vision CS766 Fall 2007

SIFT - scale-invariant feature transform Konrad Schindler

Exploring Bag of Words Architectures in the Facial Expression Domain

INRIA-LEAR S VIDEO COPY DETECTION SYSTEM. INRIA LEAR, LJK Grenoble, France 1 Copyright detection task 2 High-level feature extraction task

IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES

Video annotation based on adaptive annular spatial partition scheme

Sparse coding for image classification

Microstructure in materials data and analytics

Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

Transcription:

Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study J. Zhang 1 M. Marszałek 1 S. Lazebnik 2 C. Schmid 1 1 INRIA Rhône-Alpes, LEAR - GRAVIR Montbonnot, France 2 Beckman Institute, University of Illinois Urbana, USA Beyond Patches Workshop, 2006

Overview We have built an extensible image classification framework sparse local features bag-of-features image representation non-linear Support Vector Machines (SVMs) for classification We have evaluated various elements of the framework on 4 texture datasets (UIUCTex, KTH-TIPS, Brodatz, CUReT) 5 object category datasets (Xerox7, Caltech6, Caltech101, Graz, PASCAL 2005) The conclusions hold over the datasets We have performed a detailed evaluation of the background influence to check whether we can exploit context information to evaluate the robustness against background clutter

Outline 1 2 Context information Non-represenative training set

Outline 1 2 Context information Non-represenative training set

Overview Image Interest points Local descriptors Bag-of-features Classification Salient image regions (interest points ) are detected Regions are locally described with feature vectors Features are quantized or clustered Histograms or signatures are classified with SVMs

Detectors Image Interest points Local descriptors Bag-of-features Classification We have evaluated two widely used detectors Harris-Laplace detects corners Laplacian detects blobs Laplacian demonstrates slightly higher performance...

Detectors Image Interest points Local descriptors Bag-of-features Classification We have evaluated two widely used detectors Harris-Laplace detects corners Laplacian detects blobs Laplacian demonstrates slightly higher performance......the combination, however, performs even better The two detectors capture complementary information

Descriptors Image Interest points Local descriptors Bag-of-features Classification We have evaluated three descriptors SIFT gradient orientation histogram SPIN rotation invariant histogram of intensities RIFT rotation invariant version of SIFT SIFT performs the best, SPIN slightly worse, RIFT seems to loose important information Again, combining SIFT with SPIN improves performance as those descriptors are complementary Adding RIFT does not help

Description invariance Image Interest points Local descriptors Bag-of-features Classification We need invariance to recognize objects observed under varying conditions We have seen that invariance leads to information loss How much invariance do we need?

Description invariance Image Interest points Local descriptors Bag-of-features Classification We need invariance to recognize objects observed under varying conditions We have seen that invariance leads to information loss How much invariance do we need? No more than necessary We needed scale invariance in our experiments Rotation invariance helped only for UIUCTex We have not observed any improvement due to affine adaptation of interest points object recognition is different from matching

Visual vocabulary Image Interest points Local descriptors Bag-of-features Classification Bag-of-words representation has proven its usefullness in text classification Visual words are created by clustering the observed features

Bag-of-Features Image Interest points Local descriptors Bag-of-features Classification Given a vocabulary, we can quantize the feature vector space by assigning each observed feature to the closest visual word Given an image, we can create a histogram of words occurence Alternatively, we can cluster the set of features Note that there are no common underlying words in this case, the words are adapted to an image Note that both approaches ignore spatial relationships between features

Support Vector Machines Image Interest points Local descriptors Bag-of-features Classification We use non-linear Support Vector Machines to classify histograms and signatures The decision function has the following form g(x) = i α iy i K (x i, x) b We use extended Gaussian kernels K (x j, x k ) = exp ( 1 A D(x j, x k ) ) D(x j, x k ) is a similarity measure

χ 2 kernel Image Interest points Local descriptors Bag-of-features Classification To compare histograms, we use χ 2 distance D(U, W ) = 1 2 m i=1 (u i w i ) 2 u i + w i Efficient to compute It is bin-to-bin measure, so common underlying words are necessary

EMD kernel Image Interest points Local descriptors Bag-of-features Classification To compare signatures, we use Earth Mover s Distance D(U, W ) = m n i=1 j=1 f ij d(u i, w j ) m n i=1 j=1 f ij Requires solving a linear programming problem to determine the f ij flow We have to define the ground distance d(u i, w j ) between features No vocabulary construction is necessary

Which kernel to choose? Image Interest points Local descriptors Bag-of-features Classification Both perform comparably EMD kernel does not require an expensive vocabulary construction short training times χ 2 kernel is faster to compute short testing times

Outline 1 2 Context information Non-represenative training set

Texture datasets S1 S2 S3 S1 S2 S3 I1 T01 (bark) T02 (bark) T03 (bark) T04 (wood) T05 (wood) I2 T06 (wood) T07 (water) T08 (granite) T09 (marble) T10 (stone) Sponge Cotton Felt Plaster Styrofoam Methods UIUCTex KTH-TIPS Brodatz CUReT our 98.3 ± 0.5 95.5 ± 1.3 95.4 ± 0.3 95.3 ± 0.4 Hayman 92.0 ± 1.3 94.8 ± 1.2 95.0 ± 0.8 98.6 ± 0.2 Lazebnik 96.4 ± 0.9 91.3 ± 1.4 89.8 ± 1.0 72.5 ± 0.7 VZ-joint 78.4 ± 2.0 92.4 ± 2.1 92.9 ± 0.8 96.0 ± 0.4 G. Gabor 65.2 ± 2.0 90.0 ± 2.0 87.9 ± 1.0 92.4 ± 0.5

Object category datasets bikes books buildings cars faces phones trees airplanes cars (rear) cars (side) faces motorbikes wildcats bikes people background bikes people background accordion carside pagoda scorpion ibis anchor (93%) (92%) (89%) (15%) (8%) (7%) Methods Xerox7 CalTech6 Graz Pascal test set1 test set2 CalTech101 our 94.3 97.9 90.0 92.8 74.3 53.9 others 82.0 96.6 83.7 94.6 70.5 43

PASCAL VOC challenge 2005 dataset people motorbikes cars bike training test set 1 test set 2 Note the object annotations

Outline Context information Non-represenative training set 1 2 Context information Non-represenative training set

Context information Non-represenative training set Overview PASCAL VOC challenge 2005 dataset comes with detailed object annotation Using the provided bounding boxes we can approximately separate foreground and background features We perform experiments on five meta-datasets FF foreground features BF background features AF all features (FF + BF) AF-CONST foreground features with static scene background features AF-RAND foreground features with background features of a random J. Zhang, M. Marszałek, S. Lazebnik, C. Schmid Local Features and Kernels for Image Classification

How important is the context? Context information Non-represenative training set 1 PASCAL testset 1 1 PASCAL testset 2 True Positive Rate 0.8 0.6 0.4 BF (bicycles) 0.2 BF (cars) BF (motorbikes) BF (people) 0 0 0.2 0.4 0.6 0.8 1 False Positive Rate True Positive Rate 0.8 0.6 0.4 BF (bicycles) 0.2 BF (cars) BF (motorbikes) BF (people) 0 0 0.2 0.4 0.6 0.8 1 False Positive Rate We train and test on BF still better than random method Background features carry a significant ammount of context information

Can we use this information? Context information Non-represenative training set True Positive Rate 1 0.9 0.8 PASCAL testset 1, people FF 0.7 AF-CONST AF AF-RAND 0.6 0 0.1 0.2 0.3 0.4 False Positive Rate True Positive Rate 1 0.9 0.8 0.7 FF AF-CONST AF AF-RAND PASCAL testset 2, people 0.6 0 0.1 0.2 0.3 0.4 False Positive Rate Training and testing on FF gives better results than AF Due to background clutter we cannot use the context information to improve the classification results

Context information Non-represenative training set Can we deal with background clutter at all? True Positive Rate 1 0.9 0.8 PASCAL testset 1, people FF 0.7 AF-CONST AF AF-RAND 0.6 0 0.1 0.2 0.3 0.4 False Positive Rate True Positive Rate 1 0.9 0.8 0.7 FF AF-CONST AF AF-RAND PASCAL testset 2, people 0.6 0 0.1 0.2 0.3 0.4 False Positive Rate Training and testing on AF-CONST is still better than AF and often close to FF We can easily deal with static background

Outline Context information Non-represenative training set 1 2 Context information Non-represenative training set

It s easy to get biased Context information Non-represenative training set What happens if the test images are not represented well by the training set? True Positive Rate 1 0.9 0.8 PASCAL testset 1, people FF / AF 0.7 AF-CONST / AF AF / AF AF-RAND / AF 0.6 0 0.1 0.2 0.3 0.4 False Positive Rate True Positive Rate 1 0.9 0.8 0.7 PASCAL testset 2, people FF / AF AF-CONST / AF AF / AF AF-RAND / AF 0.6 0 0.1 0.2 0.3 0.4 False Positive Rate Training on AF and AF-RAND is significantly better than on FF or AF-CONST One should not train on too easy examples, it is better to choose too hard ones

We have created an effective image classification framework that outperforms the state-of-the-art We have evaluated the parameters of the framework on a wide range of datesets and were able to deliver general design conclusions We have evaluated the influence that the background has on bag-of-features methods

Future work Spatial Weigting for object recognition Spatial Pyramid Matching for scene classification

Thank you for your attention I will be glad to answer your questions