Beyond Bags of Features

Similar documents
Beyond bags of Features

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Part-based and local feature models for generic object recognition

Artistic ideation based on computer vision methods

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Object Classification Problem

Patch Descriptors. CSE 455 Linda Shapiro

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Bag-of-features. Cordelia Schmid

Part based models for recognition. Kristen Grauman

Beyond bags of features: spatial pyramid matching for recognizing natural scene categories

CS6670: Computer Vision

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce

CPPP/UFMS at ImageCLEF 2014: Robot Vision Task

CLASSIFICATION Experiments

Improved Spatial Pyramid Matching for Image Classification

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

Beyond Bags of features Spatial information & Shape models

Mining Discriminative Adjectives and Prepositions for Natural Scene Recognition

TA Section: Problem Set 4

Improving Recognition through Object Sub-categorization

arxiv: v3 [cs.cv] 3 Oct 2012

Patch Descriptors. EE/CSE 576 Linda Shapiro

String distance for automatic image classification

Announcements. Recognition. Recognition. Recognition. Recognition. Homework 3 is due May 18, 11:59 PM Reading: Computer Vision I CSE 152 Lecture 14

By Suren Manvelyan,

Supervised learning. y = f(x) function

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Deformable Part Models

Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study

Efficient Kernels for Identifying Unbounded-Order Spatial Features

Visual Object Recognition

ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING. Wei Liu, Serkan Kiranyaz and Moncef Gabbouj

CS 231A Computer Vision (Fall 2011) Problem Set 4

Selection of Scale-Invariant Parts for Object Class Recognition

Object Category Detection. Slides mostly from Derek Hoiem

Recognition with Bag-ofWords. (Borrowing heavily from Tutorial Slides by Li Fei-fei)

Local Features and Bag of Words Models

Local Features based Object Categories and Object Instances Recognition

Distances and Kernels. Motivation

Combining Selective Search Segmentation and Random Forest for Image Classification

Aggregating Descriptors with Local Gaussian Metrics

Effective Classifiers for Detecting Objects

Recap Image Classification with Bags of Local Features

Object recognition. Methods for classification and image representation

SEMANTIC SEGMENTATION AS IMAGE REPRESENTATION FOR SCENE RECOGNITION. Ahmed Bassiouny, Motaz El-Saban. Microsoft Advanced Technology Labs, Cairo, Egypt

Discriminative classifiers for image recognition

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

A Survey on Image Classification using Data Mining Techniques Vyoma Patel 1 G. J. Sahani 2

Learning Realistic Human Actions from Movies

Learning Representations for Visual Object Class Recognition

Loose Shape Model for Discriminative Learning of Object Categories

Max-Margin Dictionary Learning for Multiclass Image Categorization

Bias-Variance Trade-off + Other Models and Problems

Comparing Local Feature Descriptors in plsa-based Image Models

Action recognition in videos

Visual words. Map high-dimensional descriptors to tokens/words by quantizing the feature space.

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

A 2-D Histogram Representation of Images for Pooling

ImageCLEF 2011

Sketchable Histograms of Oriented Gradients for Object Detection

Fitting. Lecture 8. Cristian Sminchisescu. Slide credits: K. Grauman, S. Seitz, S. Lazebnik, D. Forsyth, J. Ponce

Human Detection and Action Recognition. in Video Sequences

TEXTURE CLASSIFICATION METHODS: A REVIEW

Category vs. instance recognition

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Multiple Kernel Learning for Emotion Recognition in the Wild

Developing Open Source code for Pyramidal Histogram Feature Sets

Semantic-based image analysis with the goal of assisting artistic creation

Scene Recognition using Bag-of-Words

Sampling Strategies for Object Classifica6on. Gautam Muralidhar

Object Detection Using Segmented Images

Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning

Local Image Features

Local invariant features

Spatial Hierarchy of Textons Distributions for Scene Classification

Object Detection Based on Deep Learning

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Codebook Graph Coding of Descriptors

CS 231A Computer Vision (Fall 2012) Problem Set 4

2 Related Work. 2.1 Logo Localization

Dynamic Scene Classification using Spatial and Temporal Cues

OBJECT CATEGORIZATION

Object recognition (part 1)

Supervised learning. y = f(x) function

Basic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies?

Image Classification based on Saliency Driven Nonlinear Diffusion and Multi-scale Information Fusion Ms. Swapna R. Kharche 1, Prof.B.K.

Visuelle Perzeption für Mensch- Maschine Schnittstellen

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Tensor Decomposition of Dense SIFT Descriptors in Object Recognition

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Class 5: Attributes and Semantic Features

Randomized Spatial Partition for Scene Recognition

Fast Kernel Learning for Spatial Pyramid Matching

Image classification Computer Vision Spring 2018, Lecture 18

Summarization of Egocentric Moving Videos for Generating Walking Route Guidance

The most cited papers in Computer Vision

Kernel Codebooks for Scene Categorization

Transcription:

: for Recognizing Natural Scene Categories Matching and Modeling Seminar <eliyahud@post.tau.ac.il> Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015

Outline 1 2 3 Pyramid Matching Kernels Spatial Matching Scheme 4

Outline 1 2 3 Pyramid Matching Kernels Spatial Matching Scheme 4

We want to recognize semantic category of an image Goals: Scene (forest, street, etc.) Objects of interest Efficiency Accuracy Based on [Lazebnik et al., 2006]

Outline 1 2 3 Pyramid Matching Kernels Spatial Matching Scheme 4

Set of methods for identification and classification Usually: Define image features For example, SIFT descriptors in 16 16 image blocks Orderless representation of (local) image features Usually a long vector For example, histogram with k-bins Learning algorithm for classification or detection Such as SVM

Set of methods for identification and classification Usually: Define image features For example, SIFT descriptors in 16 16 image blocks Orderless representation of (local) image features Usually a long vector For example, histogram with k-bins Learning algorithm for classification or detection Such as SVM

Set of methods for identification and classification Usually: Define image features For example, SIFT descriptors in 16 16 image blocks Orderless representation of (local) image features Usually a long vector For example, histogram with k-bins Learning algorithm for classification or detection Such as SVM

Disadvantages Existing methods usually use local features Hence, they disregard spatial layout of the feature For example: Incapability of capturing shape of objects Incapability of segmenting an object from its background Overcoming these limitations is challenging Especially with noise, such as: Clutter Occlusion Viewpoint change

Overcoming the Disadvantages Previous works suggest: Robustness at significant computational expense Efficiency with inconclusive results In this paper: Statistics of local features over fixed subregions

Off Topic: SVM Support Vector Machine Supervised learning algorithm Creates a (binary) linear classifier Finds hyperplane to separate the categories Has lot of extensions For example, multiclass SVM

Off Topic: SVM More Formally Given D = { (x i, y i ) x i R d, y i {±1} } Find w, b that maximizes (x i,y i ) D y i (w x i b) Given a new example x, its classification y satisfies: { 1 w x b > 0 y = 1 w x b 0 w x b = 0 is called the separating hyperplane

Outline Pyramid Matching Kernels Spatial Matching Scheme 1 2 3 Pyramid Matching Kernels Spatial Matching Scheme 4

Pyramid Matching Kernels Spatial Matching Scheme Kernel based recognition method Pyramid matching scheme of Grauman and Darrell Repeatedly subdivide the image For each division, compute histograms of local features

Pyramid Matching Kernels Pyramid Matching Kernels Spatial Matching Scheme Let X and Y be two sets of d-dimensional vectors Vectors in a feature space We want to find an approximate correspondence in them

Resolutions Pyramid Matching Kernels Spatial Matching Scheme Construct a sequence of grids for resolutions 0,..., L The grid of level l has 2 l cells along each dimension Total of D l = 2 dl cells Let HX l and Hl Y resolution denote the histograms of X and Y in this Let HX l (i) and Hl Y (i) be the numbers of points from X and Y in the i th cell

Matchings in Resolution Pyramid Matching Kernels Spatial Matching Scheme If x X satisfies x H l X (i), then x matches Hl X (i) The number of matches at level l is: I l = I I l also includes I l+1 ( ) HX, l HY l = D l i=1 { } min HX l (i), HY l (i) Therefore, the number of new matches of level l is I l I l+1 for 0 l < L

Pyramid Match Kernel Pyramid Matching Kernels Spatial Matching Scheme Give level l the weight 1 2 L l The pyramid match kerenl is defined as: L 1 κ L (X, Y ) = I L 1 + (I 2 L l l I l+1) l=0 = 1 2 L I0 + L l=1 2 1 L l+1 Il

Spatial Information Pyramid Matching Kernels Spatial Matching Scheme So far, we have counted the number of features in each cell But we ignored that the features might differ or be similar Now we will quantize all feature vectors into M discrete types Only features of the same type can match one another Each channel m gives us two sets of vectors, X m and Y m This approach reduces to standard bag of features when L = 0

Spatial Information Pyramid Matching Kernels Spatial Matching Scheme So far, we have counted the number of features in each cell But we ignored that the features might differ or be similar Now we will quantize all feature vectors into M discrete types Only features of the same type can match one another Each channel m gives us two sets of vectors, X m and Y m This approach reduces to standard bag of features when L = 0

The Image Kernel Pyramid Matching Kernels Spatial Matching Scheme The image kernel m K L (X, Y ) = κ L (X m, Y m ) m=1

Vector Length Pyramid Matching Kernels Spatial Matching Scheme Our images are two dimensional (d = 2) Hence each resolution divides the previous one by 2 2 = 4 For L levels and M channels, the vector length is: M L l=0 4 l = M 4L+1 1 3

Efficient Implementation Pyramid Matching Kernels Spatial Matching Scheme 1 Subdivide the image to L levels 2 For each level and each channe: 1 Count the features that fall in each spatial bin 2 Weight this spatial bin, and add its weighted value to the histogram The next slide shows an example

Efficient Implementation Pyramid Matching Kernels Spatial Matching Scheme 1 Subdivide the image to L levels 2 For each level and each channe: 1 Count the features that fall in each spatial bin 2 Weight this spatial bin, and add its weighted value to the histogram The next slide shows an example

Example Pyramid Matching Kernels Spatial Matching Scheme + + level 0 + + + + + + + level 1 + + + + + + + + level 2 + + + + + + + + + + + + + + + + + + + + + + 1/4 1/4 1/2

Outline 1 2 3 Pyramid Matching Kernels Spatial Matching Scheme 4

About the Experiments Each experiment was repeated 10 times Each with different randomly selected training and testing sets The average of per-class is recorded for each run The final result consists of mean and standard deviation Multiclass classification was done using SVM Trained using ove vs. all rule The label is assigned by the classifier with the highest response

Used Features Two kinds of features: Weak features Strong features

The Weak Features Oriented edge points Extract edfe points at two scales and eight orientations Total of M = 16 channels

The Strong Features SIFT descriptors 16 16 pixel batches, grid of 8 8 pixels Visual vocabulary was calculated by: k-means clustering of random subset of patches Taken from the training set Used with M = 200 or M = 400

Dataset consists of 15 scene categories Each category has 200 to 400 images Average image size is 300 250 pixels Originally from [Li and Perona, 2005] with 13 categories 2 categories were collected by [Lazebnik et al., 2006]

Dataset Sample

Results for Classification Results Weak features (M = 16) Strong Features (M = 200) Strong Features (M = 400) L Single-level Pyramid Single-level Pyramid Single-level Pyramid 0 (1 1) 45.3 ± 0.5 72.2 ± 0.6 74.8 ± 0.3 1 (2 2) 53.6 ± 0.3 56.2 ± 0.6 77.9 ± 0.6 79.0 ± 0.5 78.8 ± 0.4 80.1 ± 0.5 2 (4 4) 61.7 ± 0.6 64.7 ± 0.7 79.4 ± 0.3 81.1 ± 0.3 79.7 ± 0.5 81.4 ± 0.5 3 (8 8) 63.3 ± 0.8 66.8 ± 0.6 77.2 ± 0.4 80.07 ± 0.3 77.2 ± 0.5 81.1 ± 0.6

Results Analysis Results improve dramatically as we go from L = 0 to multilevel For strong features, performance drop from L = 2 to L = 3 Because the highest level of L = 3 is too finely subdivided Individual bins yielding too few matches Still small performance drop Strong features are better than weak features However, the number of strong features is almost irrelevant

Confusion Table

Confusions Confusions occurs between: Indoor classes (kitchen, bedroom, living room) Natural classes (coast, open country) This is understandable Next slide shows example of image retrieval

Image Retrieval Examples

Previous Work [Li and Perona, 2005] achieved a success rate of 65.2% with L = 0 and M = 200 Whereas [Lazebnik et al., 2006] achieved 72.2% 74.7% for the 13 categories of [Li and Perona, 2005] [Lazebnik et al., 2006] wondered why the gap is so big Apparently, [Li and Perona, 2005] used worse strong features than [Lazebnik et al., 2006] [Lazebnik et al., 2006] achieved 65.9% with the features of [Li and Perona, 2005]

Based on [Fei-Fei et al., 2004] 101 categories 31 to 800 images per category Image size about 300 300 pixels Properties: Relatively little clutter Objects are centered Objects occupy most of the image

Experiment Setup Train 30 images per class Test on the rest For efficiency, limit test to 50 images per class Same setup as in [Grauman and Darrell, 2005, Zhang et al., 2007].

Results Weak features Strong features (M = 200) L Single-level Pyramid Single-level Pyramid 0 15.5 ± 0.9 41.2 ± 1.2 1 31.4 ± 1.2 32.8 ± 1.3 55.9 ± 0.9 57.0 ± 0.8 2 47.2 ± 1.1 49.3 ± 1.4 63.6 ± 0.9 64.6 ± 0.8 3 52.2 ± 0.8 54.0 ± 1.1 60.3 ± 0.9 64.6 ± 0.7 Results for M = 400 do not improve, just as for scenes

Results Examples Easiest vs. Hardest Categories Highest performance, on L = 2, M = 200

Comparison to Previous Work For L = 0, [Grauman and Darrell, 2005] got 43% We ve got 41.2% Our best result is 64.6%, when L = 2, M = 200 Better than 53.9% by [Zhang et al., 2007]

Confusions There are lot of closely related classes Top five confusions for L = 2, M = 200: Class 1 Class 2 Class 1 Class 2 misclassified misclassified as class 2 as class 1 ketch schooner 21.6 14.8 lotus water lily 15.3 20.0 crocodile crocodile head 10.5 10.0 crayfish lobster 11.3 9.1 flamingo ibis 9.5 10.4

So Far Our method does well assuming canonical poses What about: Heavy clutter? Pose changes?

[Opelt et al., 2004] Characterized by high intra-class cariation Two object classes: Objects: Bikes (373 images) + Persons (460 images) Backgrounds (270 images) Image resolution is 640 480 pixels Range of scale and poses is very diverse For example, in the persons category: Pedestrian in the distance Side view of a complete body Closeup of a head

Examples Bikes

Examples Persons

Examples Backgrounds

Experiment Setup Same as in [Opelt et al., 2004]: Two class detection: objects vs. background Train on 100 positive and 100 negative images 50 from the other object class 50 from the backgrounds class Test on similarly distributing set

Results Strong Features, M = 200 Class L = 0 L = 2 Opelt Zhang Bikes 82.4 ± 2.0 86.3 ± 2.5 86.5 92.0 People 79.5 ± 2.3 82.3 ± 3.1 80.8 88.0 Standard deviation is quite high For example, the bikes class for L = 2 ranges from 81% to 91% Still we are close to other methods

Bibliography Bibliography I Fei-Fei, L., Fergus, R., and Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Computer Vision and Pattern Recognition Workshop, 2004. CVPRW 04. Conference on, pages 178 178. Grauman, K. and Darrell, T. (2005). The pyramid match kernel: discriminative classification with sets of image features. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1458 1465 Vol. 2.

Bibliography Bibliography II Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR 06, pages 2169 2178, Washington, DC, USA. IEEE Computer Society. Li, F.-F. and Perona, P. (2005). A bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05) -

Bibliography Bibliography III Volume 2 - Volume 02, CVPR 05, pages 524 531, Washington, DC, USA. IEEE Computer Society. Opelt, A., Fussenegger, M., Pinz, A., and Auer, P. (2004). Weak hypotheses and boosting for generic object detection and recognition. In Pajdla, T. and Matas, J., editors, Computer Vision - ECCV 2004, volume 3022 of Lecture Notes in Computer Science, pages 71 84. Springer Berlin Heidelberg. Zhang, J., Marszalek, M., Lazebnik, S., and Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. Int. J. Comput. Vision, 73(2):213 238.