Visual Object Recognition

Similar documents
Local features, distances and kernels. Plan for today

Part-based and local feature models for generic object recognition

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

By Suren Manvelyan,

Part based models for recognition. Kristen Grauman

Today. Main questions 10/30/2008. Bag of words models. Last time: Local invariant features. Harris corner detector: rotation invariant detection

Indexing local features and instance recognition May 16 th, 2017

Patch Descriptors. CSE 455 Linda Shapiro

Indexing local features and instance recognition May 14 th, 2015

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

CS6670: Computer Vision

Patch Descriptors. EE/CSE 576 Linda Shapiro

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Indexing local features and instance recognition May 15 th, 2018

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

CS 4495 Computer Vision Classification 3: Bag of Words. Aaron Bobick School of Interactive Computing

Efficient Kernels for Identifying Unbounded-Order Spatial Features

Large-scale visual recognition The bag-of-words representation

Instance-level recognition part 2

Recognition. Topics that we will try to cover:

Lecture 12 Recognition

Local Features and Bag of Words Models

Visual Object Recognition

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

Recognizing object instances

Instance-level recognition II.

Distances and Kernels. Motivation

Lecture 14: Indexing with local features. Thursday, Nov 1 Prof. Kristen Grauman. Outline

EECS 442 Computer vision. Object Recognition

Keypoint-based Recognition and Object Search

Lecture 12 Recognition. Davide Scaramuzza

Artistic ideation based on computer vision methods

Beyond Bags of Features

Lecture 12 Recognition

Recognizing Object Instances. Prof. Xin Yang HUST

Local Image Features

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce

Evaluation and comparison of interest points/regions

Supervised learning. y = f(x) function

Outline. Segmentation & Grouping. Examples of grouping in vision. Grouping in vision. Grouping in vision 2/9/2011. CS 376 Lecture 7 Segmentation 1

Lecture 12 Visual recognition

Learning Tree-structured Descriptor Quantizers for Image Categorization

Beyond Bags of features Spatial information & Shape models

Comparing Local Feature Descriptors in plsa-based Image Models

Unsupervised Identification of Multiple Objects of Interest from Multiple Images: discover

Object Recognition and Augmented Reality

Segmentation and Grouping April 19 th, 2018

Object Classification for Video Surveillance

Object Recognition with Invariant Features

Discriminative Object Class Models of Appearance and Shape by Correlatons

Sampling Strategies for Object Classifica6on. Gautam Muralidhar

Mining Discriminative Adjectives and Prepositions for Natural Scene Recognition

Recognition with Bag-ofWords. (Borrowing heavily from Tutorial Slides by Li Fei-fei)

Segmentation and Grouping

Video Google: A Text Retrieval Approach to Object Matching in Videos

Feature Matching + Indexing and Retrieval

Loose Shape Model for Discriminative Learning of Object Categories

Bag-of-features. Cordelia Schmid

Bundling Features for Large Scale Partial-Duplicate Web Image Search

Local Image Features

String distance for automatic image classification

Visual words. Map high-dimensional descriptors to tokens/words by quantizing the feature space.

Instance recognition

Grouping and Segmentation

TA Section: Problem Set 4

Lecture 15: Object recognition: Bag of Words models & Part based generative models

Kernel Codebooks for Scene Categorization

Video Google faces. Josef Sivic, Mark Everingham, Andrew Zisserman. Visual Geometry Group University of Oxford

Object Detection Using Segmented Images

A Survey on Image Classification using Data Mining Techniques Vyoma Patel 1 G. J. Sahani 2

Object Classification Problem

Local invariant features

Large Scale Image Retrieval

Deformable Part Models

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Discriminative classifiers for image recognition

Efficient Image Matching with Distributions of Local Invariant Features

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Shape Matching. Brandon Smith and Shengnan Wang Computer Vision CS766 Fall 2007

Local features and image matching. Prof. Xin Yang HUST

Segmentation and Grouping April 21 st, 2015

Spatial Latent Dirichlet Allocation

Object Category Detection. Slides mostly from Derek Hoiem

Human Detection and Action Recognition. in Video Sequences

A Tale of Two Object Recognition Methods for Mobile Robots

Determining Patch Saliency Using Low-Level Context

Integrated Feature Selection and Higher-order Spatial Feature Extraction for Object Categorization

Instance-level recognition

Automatic Ranking of Images on the Web

CAP 5415 Computer Vision Fall 2012

Image classification Computer Vision Spring 2018, Lecture 18

Instance-level recognition

arxiv: v3 [cs.cv] 3 Oct 2012

SEARCH BY MOBILE IMAGE BASED ON VISUAL AND SPATIAL CONSISTENCY. Xianglong Liu, Yihua Lou, Adams Wei Yu, Bo Lang

Unsupervised Learning of Categories from Sets of Partially Matching Image Features

Improving Recognition through Object Sub-categorization

Scalable Recognition with a Vocabulary Tree

Local features: detection and description. Local invariant features

Fitting: The Hough transform

Transcription:

Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial Visual Object Recognition Bastian Leibe Computer Vision Laboratory ETH Zurich Chicago, 14.07.2008 & Kristen Grauman Department of Computer Sciences University of Texas in Austin

Outline 1. Detection with Global Appearance & Sliding Windows 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Coffee Break 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2

Global representations: limitations Success may rely on alignment -> sensitive to viewpoint All parts of the image or window impact the description -> sensitive to occlusion, clutter 3

Local representations Describe component regions or patches separately. Many options for detection & description SIFT [Lowe] Salient regions [Kadir et al.] Shape context [Belongie et al.] Harris-Affine [Schmid et al.] Superpixels [Ren et al.] Spin images [Johnson and Hebert] Maximally Stable Extremal Regions [Matas et al.] Geometric Blur [Berg et al.] 4

Recall: Invariant local features Subset of local feature types designed to be invariant to Scale Translation Rotation Affine transformations Illumination 1) Detect interest points 2) Extract descriptors [Mikolajczyk & Schmid, Matas et al., Tuytelaars & Van Gool, Lowe, Kadir et al., ] y 1 y 2 y d x 1 x 2 x d

Recognition with local feature sets Previously, we saw how to use local invariant features + a global spatial model to recognize specific objects, using a planar object assumption. Now, we ll use local features for Indexing-based recognition Bags of words representations Correspondence / matching kernels Aachen Cathedral 6

Basic flow Detect or sample features List of positions, scales, orientations Describe features Associated list of d-dimensional descriptors Index each one into pool of descriptors from previously seen images 7

Indexing local features Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT)

Indexing local features When we see close points in feature space, we have similar descriptors, which indicates similar local content. Figure credit: A. Zisserman

Indexing local features We saw in the previous section how to use voting and pose clustering to identify objects using local features Figure credit: David Lowe 10

Indexing local features With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image? Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search High-dimensional descriptors: approximate nearest neighbor search methods more practical Inverted file indexing schemes 11

Indexing local features: approximate nearest neighbor search Best-Bin First (BBF), a variant of k-d trees that uses priority queue to examine most promising branches first [Beis & Lowe, CVPR 1997] Locality-Sensitive Hashing (LSH), a randomized hashing technique using hash functions that map similar points to the same bin, with high probability [Indyk & Motwani, 1998] 12

Indexing local features: inverted file index For text documents, an efficient way to find all pages on which a word occurs is to use an index We want to find all images in which a feature occurs. To use this idea, we ll need to map our features to visual words. 13

Visual words: main idea Extract some local features from a number of images Slide credit: D. Nister e.g., SIFT descriptor space: each point is 128-dimensional 14

Visual words: main idea Slide credit: D. Nister 15

Visual words: main idea Slide credit: D. Nister 16

Visual words: main idea Slide credit: D. Nister 17

Slide credit: D. Nister 18

Slide credit: D. Nister 19

Visual words: main idea Map high-dimensional descriptors to tokens/words by quantizing the feature space Quantize via clustering, let cluster centers be the prototype words Descriptor space 20

Visual words: main idea Map high-dimensional descriptors to tokens/words by quantizing the feature space Determine which word to assign to each new image region by finding the closest cluster center. Descriptor space 21

Visual words Example: each group of patches belongs to the same visual word Figure from Sivic & Zisserman, ICCV 2003 22

Visual words First explored for texture and material representations Texton = cluster center of filter responses over collection of images Describe textures and materials based on distribution of prototypical texture elements. Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003;

Visual words More recently used for describing scenes and objects for the sake of indexing or classification. Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others. 24

Inverted file index for images comprised of visual words Image credit: A. Zisserman Word number List of image numbers

Bags of visual words Summarize entire image based on its distribution (histogram) of word occurrences. Analogous to bag of words representation commonly used for documents. Image credit: Fei-Fei Li 26

Video Google System 1. Collect all words within query region 2. Inverted file index to find relevant frames 3. Compare word counts 4. Spatial verification Sivic & Zisserman, ICCV 2003 Demo online at : http://www.robots.ox.ac.uk/~vgg/r esearch/vgoogle/index.html Query region Retrieved frames 27

Basic flow Detect or sample features List of positions, scales, orientations Describe features Associated list of d-dimensional descriptors or Index each one into pool of descriptors from previously seen images Quantize to form bag of words vector for the image 28

Visual vocabulary formation Issues: Sampling strategy Clustering / quantization algorithm Unsupervised vs. supervised What corpus provides features (universal vocabulary?) Vocabulary size, number of words 29

Sampling strategies Sparse, at interest points Multiple interest operators Image credits: F-F. Li, E. Nowak, J. Sivic Dense, uniformly Randomly To find specific, textured objects, sparse sampling from interest points often more reliable. Multiple complementary interest operators offer more image coverage. For object categorization, dense sampling offers better coverage. [See Nowak, Jurie & Triggs, ECCV 2006] 30

Clustering / quantization methods k-means (typical choice), agglomerative clustering, mean-shift, Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies Vocabulary tree [Nister & Stewenius, CVPR 2006] 31

Example: Recognition with Vocabulary Tree Tree construction: [Nister & Stewenius, CVPR 06] 32 Slide credit: David Nister

Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 33

Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 34

Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 35

Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 36

Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 37

Vocabulary Tree Recognition RANSAC verification [Nister & Stewenius, CVPR 06] Slide credit: David Nister 38

Vocabulary Tree: Performance Evaluated on large databases Indexing with up to 1M images Online recognition for database of 50,000 CD covers Retrieval in ~1s Find experimentally that large vocabularies can be beneficial for recognition [Nister & Stewenius, CVPR 06] 39

Vocabulary formation Ensembles of trees provide additional robustness Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007; Bosch, Zisserman, & Munoz 2007; Figure credit: F. Jurie

Supervised vocabulary formation Recent work considers how to leverage labeled images when constructing the vocabulary Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for Generic Visual Categorization, ECCV 2006. 41

Supervised vocabulary formation Merge words that don t aid in discriminability Winn, Criminisi, & Minka, Object Categorization by Learned Universal Visual Dictionary, ICCV 2005

Supervised vocabulary formation Consider vocabulary and classifier construction jointly. Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR 2008. 43

Learning and recognition with bag of words histograms Bag of words representation makes it possible to describe the unordered point set with a single vector (of fixed dimension across image examples) Provides easy way to use distribution of feature types with various learning algorithms requiring vector input. 44

Learning and recognition with bag of words histograms including unsupervised topic models designed for documents. Hierarchical Bayesian text models (plsa and LDA) Hoffman 2001, Blei, Ng & Jordan, 2004 For object and scene categorization: Sivic et al. 2005, Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al. 2005 Figure credit: Fei-Fei Li 45

Learning and recognition with bag of words histograms including unsupervised topic models designed for documents. Probabilistic Latent d z w Semantic Analysis (plsa) D N face Sivic et al. ICCV 2005 [plsa code available at: http://www.robots.ox.ac.uk/~vgg/software/] Figure credit: Fei-Fei Li 46

Bags of words: pros and cons + flexible to geometry / deformations / viewpoint + compact summary of image content + provides vector representation for sets + has yielded good recognition results in practice - basic model ignores geometry must verify afterwards, or encode via features - background and foreground mixed when bag covers whole image - interest points or sampling: no guarantee to capture object-level parts - optimal vocabulary formation remains unclear 47

Outline 1. Detection with Global Appearance & Sliding Windows 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Coffee Break 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 48

Basic flow Detect or sample features List of positions, scales, orientations Describe features Associated list of d-dimensional descriptors or or Index each one into pool of descriptors from previously seen images Quantize to form bag of words vector for the image Compute match with another image 49

Local feature correspondences The matching between sets of local features helps to establish overall similarity between objects or shapes. Assigned matches also useful for localization Shape context [Belongie & Malik 2001] Low-distortion matching [Berg & Malik 2005] Match kernel [Wallraven, Caputo & Graf 2003] 50

Local feature correspondences Least cost match: minimize total cost between matched points min π: X Y Least cost partial match: match all of smaller set to some portion of larger set. xi X x i π ( xi)

Pyramid match kernel (PMK) Optimal matching expensive relative to number of features per image (m). PMK is approximate partial match for efficient discriminative learning from sets of local features. Optimal match: O(m 3 ) Greedy match: O(m 2 log m) Pyramid match: O(m) [Grauman & Darrell, ICCV 2005] 52

Pyramid match kernel: pyramid extraction, Histogram pyramid: level i has bins of size 53 K. Grauman

Pyramid match kernel: counting matches Histogram intersection K. Grauman 54

Pyramid match kernel: counting new matches Histogram intersection matches at this level Difference in histogram intersections across levels counts number of new pairs matched K. Grauman matches at previous level 55

Pyramid match kernel histogram pyramids measure of difficulty of a match at level i number of newly matched pairs at level i For similarity, weights inversely proportional to bin size (or may be learned discriminatively) Normalize kernel values to avoid favoring large sets K. Grauman 56

Example pyramid match K. Grauman 57

Example pyramid match K. Grauman 58

Example pyramid match K. Grauman 59

Example pyramid match pyramid match optimal match K. Grauman

Pyramid match kernel Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Bounded error relative to optimal partial match Linear time -> efficient learning with large feature sets

Pyramid match kernel Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Bounded error relative to optimal partial match Linear time -> efficient learning with large feature sets Accuracy Mean number of features Time (s) Match [Wallraven et al.] Pyramid match ETH-80 data set Mean number of features O(m 2 ) O(m)

Pyramid match kernel Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Bounded error relative to optimal partial match Linear time -> efficient learning with large feature sets Use data-dependent pyramid partitions for high-d feature spaces Uniform pyramid bins Vocabulary-guided pyramid bins Code for PMK: http://people.csail.mit.edu/jjl/libpmk/

Matching smoothness & local geometry Solving for linear assignment means (non-overlapping) features can be matched independently, ignoring relative geometry. One alternative: simply expand feature vectors to include spatial information before matching. y a x a [ f 1,,f 128, x a, y a ] 64

Spatial pyramid match kernel First quantize descriptors into words, then do one pyramid match per word in image coordinate space. Lazebnik, Schmid & Ponce, CVPR 2006

Matching smoothness & local geometry Use correspondence to estimate parameterized transformation, regularize to enforce smoothness Shape context matching [Belongie, Malik, & Puzicha 2001] Code: http://www.eecs.berkeley.edu/research/projects/cs/vision/shape/sc_digits.html

Matching smoothness & local geometry Let matching cost include term to penalize distortion between pairs of matched features. Template j i Rij Approximate for efficient solutions: Berg & Malik, CVPR 2005; Leordeanu & Hebert, ICCV 2005 i i' Query j' Si'j' i' Figure credit: Alex Berg

Matching smoothness & local geometry Compare semi-local features: consider configurations or neighborhoods and co-occurrence relationships Correlograms of visual words [Savarese, Winn, & Criminisi, CVPR 2006] Feature neighborhoods [Sivic & Zisserman, CVPR 2004] Proximity distribution kernel [Ling & Soatto, ICCV 2007] Hyperfeatures: Agarwal & Triggs, ECCV 2006] Tiled neighborhood [Quack, Ferrari, Leibe, van Gool ICCV 2007]

Matching smoothness & local geometry Learn or provide explicit object-specific shape model [Next in the tutorial : part-based models] x 6 x 1 x 2 x 5 x 4 x 3

Summary Local features are a useful, flexible representation Invariance properties - typically built into the descriptor Distinctive, especially helpful for identifying specific textured objects Breaking image into regions/parts gives tolerance to occlusions and clutter Mapping to visual words forms discrete tokens from image regions Efficient methods available for Indexing patches or regions Comparing distributions of visual words Matching features 70