Today. Main questions 10/30/2008. Bag of words models. Last time: Local invariant features. Harris corner detector: rotation invariant detection

Similar documents
By Suren Manvelyan,

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Indexing local features and instance recognition May 16 th, 2017

Indexing local features and instance recognition May 14 th, 2015

Indexing local features and instance recognition May 15 th, 2018

Lecture 14: Indexing with local features. Thursday, Nov 1 Prof. Kristen Grauman. Outline

Pattern recognition (3)

CS 4495 Computer Vision Classification 3: Bag of Words. Aaron Bobick School of Interactive Computing

Recognition with Bag-ofWords. (Borrowing heavily from Tutorial Slides by Li Fei-fei)

Advanced Techniques for Mobile Robotics Bag-of-Words Models & Appearance-Based Mapping

Distances and Kernels. Motivation

Recognizing object instances

Patch Descriptors. CSE 455 Linda Shapiro

Visual Object Recognition

Local features: detection and description. Local invariant features

Recognizing Object Instances. Prof. Xin Yang HUST

Local features: detection and description May 12 th, 2015

Patch Descriptors. EE/CSE 576 Linda Shapiro

Object Classification for Video Surveillance

Local features and image matching. Prof. Xin Yang HUST

Local invariant features

Midterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability

Instance recognition

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

CS6670: Computer Vision

Feature Matching + Indexing and Retrieval

Recognizing object instances. Some pset 3 results! 4/5/2011. Monday, April 4 Prof. Kristen Grauman UT-Austin. Christopher Tosh.

Computer Vision for HCI. Topics of This Lecture

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Visual Navigation for Flying Robots. Structure From Motion

Part-based and local feature models for generic object recognition

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

Evaluation and comparison of interest points/regions

Local Image Features

EECS 442 Computer vision. Object Recognition

The SIFT (Scale Invariant Feature

Lecture 12 Visual recognition

Local Features and Bag of Words Models

Local features, distances and kernels. Plan for today

Part based models for recognition. Kristen Grauman

Prof. Feng Liu. Spring /26/2017

Feature Based Registration - Image Alignment

Recognition. Topics that we will try to cover:

Supervised learning. y = f(x) function

Local Features: Detection, Description & Matching

Harder case. Image matching. Even harder case. Harder still? by Diva Sian. by swashford

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

The bits the whirl-wind left out..

CS4670: Computer Vision

Local Feature Detectors

Motion Estimation and Optical Flow Tracking

Image matching. Announcements. Harder case. Even harder case. Project 1 Out today Help session at the end of class. by Diva Sian.

CAP 5415 Computer Vision Fall 2012

Video Google: A Text Retrieval Approach to Object Matching in Videos

Outline. Segmentation & Grouping. Examples of grouping in vision. Grouping in vision. Grouping in vision 2/9/2011. CS 376 Lecture 7 Segmentation 1

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

Building a Panorama. Matching features. Matching with Features. How do we build a panorama? Computational Photography, 6.882

Grouping and Segmentation

Harder case. Image matching. Even harder case. Harder still? by Diva Sian. by swashford

Local Patch Descriptors

CS 558: Computer Vision 4 th Set of Notes

Automatic Image Alignment

Local Image Features

Segmentation and Grouping April 19 th, 2018

Motion illusion, rotating snakes

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Automatic Image Alignment (feature-based)

EECS150 - Digital Design Lecture 14 FIFO 2 and SIFT. Recap and Outline

Automatic Image Alignment

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

Image Features: Local Descriptors. Sanja Fidler CSC420: Intro to Image Understanding 1/ 58

Instance-level recognition II.

Instance-level recognition part 2

Object Recognition with Invariant Features

Lecture 12 Recognition

Lecture: RANSAC and feature detectors

Image Features. Work on project 1. All is Vanity, by C. Allan Gilbert,

Local Image Features

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Announcements. Recognition. Recognition. Recognition. Recognition. Homework 3 is due May 18, 11:59 PM Reading: Computer Vision I CSE 152 Lecture 14

Object Recognition and Augmented Reality

Discovering Visual Hierarchy through Unsupervised Learning Haider Razvi

Segmentation and Grouping April 21 st, 2015

Feature descriptors and matching

Sampling Strategies for Object Classifica6on. Gautam Muralidhar

2D Image Processing Feature Descriptors

CS 4495 Computer Vision. Segmentation. Aaron Bobick (slides by Tucker Hermans) School of Interactive Computing. Segmentation

Segmentation and Grouping

Supervised learning. f(x) = y. Image feature

CS 1674: Intro to Computer Vision. Midterm Review. Prof. Adriana Kovashka University of Pittsburgh October 10, 2016

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

A Comparison of SIFT, PCA-SIFT and SURF

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce

CS5670: Computer Vision

AK Computer Vision Feature Point Detectors and Descriptors

Video Google faces. Josef Sivic, Mark Everingham, Andrew Zisserman. Visual Geometry Group University of Oxford

Supervised learning. y = f(x) function

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

Feature Detection. Raul Queiroz Feitosa. 3/30/2017 Feature Detection 1

Transcription:

Today Indexing with local features, Bag of words models Matching local features Indexing features Bag of words model Thursday, Oct 30 Kristen Grauman UT-Austin Main questions Where will the interest points come from? What are salient features that we ll detect in multiple views? How to describe a local region? How to establish correspondences, i.e., compute matches? Last time: Local invariant features Problem 1: Detect the same point independently in both images no chance to match! We need a repeatable detector Algorithm steps: Compute M matrix within all image windows to get their R scores Find points with large corner response R > threshold) Take the points of local maxima of R Harris corner detector: rotation invariant detection Laplacian of Gaussian: scale invariant detection Laplacian-of-Gaussian = blob detector Corner response R is invariant to image rotation. Ellipse rotates but its shape (i.e. eigenvalues) remains the same. 1

Laplacian of Gaussian: scale invariant detection Laplacian of Gaussian: scale invariant detection Interest points: σ Local maxima in scale 5 space of Laplacian-of- Gaussian σ 4 L ( σ ) + L ( σ ) xx yy σ 3 space σ 2 σ List of (x, y, σ) Last time: Local invariant features Problem 2: For each point correctly recognize the corresponding one? Raw patches as local descriptors The simplest way to describe the neighborhood around an interest point is to write down the list of intensities to form a feature vector. But this is very sensitive to even small shifts, rotations. We need a reliable and distinctive descriptor SIFT descriptors [Lowe 2004] More robust way to describe the neighborhood: use histograms to bin pixels within sub-patches according to their orientation. Rotation invariant descriptors Find local orientation Dominant direction of gradient for the image patch 0 2π Why subpatches? Why does SIFT have some illumination invariance? Rotate patch according to this angle This puts the patches into a canonical orientation. 2

Feature descriptors: SIFT Extraordinarily robust matching technique Can handle changes in viewpoint Up to about 60 degree out of plane rotation Can handle significant changes in illumination Sometimes even day vs. night (below) Fast and efficient can run in real time Lots of code available http://people.csail.mit.edu/albert/ladypack/wiki/index.php/known_implementations_of_sift Interest points + descriptors So far we have methods to find interest points and describe the surrounding image neighborhood. This will map each image to a list of local descriptors. X 1 1, X 2, X 128 Steve Seitz How many detections will an image have? Many Existing Detectors Available You Can Try It At Home Visual Perceptual Object and Recognition Sensory Augment Tutor Hessian & Harris [Beaudet 78], [Harris 88] Laplacian, DoG [Lindeberg 98], [Lowe 1999] Harris-/Hessian-Laplace [Mikolajczyk & Schmid 01] Harris-/Hessian-Affine [Mikolajczyk & Schmid 04] EBR and IBR [Tuytelaars & Van Gool 04] MSER [Matas 02] Salient Regions [Kadir & Brady 01] Others 15 Visual Perceptual Object and Recognition Sensory Augment Tutor For most local feature detectors, executables are available online: http://robots.ox.ac.uk/~vgg/research/affine http://www.cs.ubc.ca/~lowe/keypoints/ http://www.vision.ee.ethz.ch/~surf 16 Main questions Where will the interest points come from? What are salient features that we ll detect in multiple views? How to describe a local region? How to describe a local region? How to establish correspondences, i.e., compute matches? http://www.robots.ox.ac.uk/~vgg/research/affine/detectors.html#binaries 3

Matching local features Matching local features? Image 1 Image 2 To generate candidate matches, find patches that have the most similar appearance (e.g., lowest SSD) Simplest approach: compare them all, take the closest (or closest k, or within a thresholded distance) Matching local features Ambiguous matches???? Image 1 Image 2 In stereo case, may constrain by proximity if we make assumptions on max disparities. Image 1 Image 2 At what SSD value do we have a good match? To add robustness to matching, can consider ratio : distance to best match / distance to second best match If high, first match looks good. Applications of local invariant features & matching Wide baseline stereo Motion tracking Panoramas Mobile robot navigation 3D reconstruction Recognition Specific objects Textures Categories Wide baseline stereo [Image from T. Tuytelaars ECCV 2006 tutorial] 4

Panorama stitching Automatic mosaicing Brown, Szeliski, and Winder, 2005 http://www.cs.ubc.ca/~mbrown/autostitch/autostitch.html Recognition of specific objects, scenes Recognition of categories Constellation model Bags of words Schmid and Mohr 1997 Sivic and Zisserman, 2003 Rothganger et al. 2003 Lowe 2002 Weber et al. (2000) Fergus et al. (2003) Csurka et al. (2004) Dorko & Schmid (2005) Sivic et al. (2005) Lazebnik et al. (2006), [Slide from Lana Lazebnik, Sicily 2006] Value of local features Critical to find distinctive and repeatable local regions for multi-view matching Complexity reduction via selection of distinctive points Describe images, objects, parts without requiring segmentation; robustness to clutter & occlusion Robustness: similar descriptors in spite of moderate view changes, noise, blur, etc. Today Matching local features Indexing features Bag of words model 5

Indexing local features Indexing local features Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT) When we see close points in feature space, we have similar descriptors, which indicates similar local content. This is of interest not only for 3d reconstruction, but also for retrieving images of similar objects. Figure credit: A. Zisserman Indexing local features Indexing local features With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image? Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search High-dimensional descriptors: approximate nearest neighbor search methods more practical Inverted file indexing schemes Indexing local features: inverted file index Text retrieval vs. image search For text documents, an efficient way to find all pages on which a word occurs is to use an index We want to find all images in which a feature occurs. To use this idea, we ll need to map our features to visual words. What makes the problems similar, different? 6

Visual words: main idea Extract some local features from a number of images Visual words: main idea e.g., SIFT descriptor space: each point is 128-dimensional Slide credit: D. Nister Slide credit: D. Nister Visual words: main idea Visual words: main idea Slide credit: D. Nister Slide credit: D. Nister Each point is a local descriptor, e.g. SIFT vector. Slide credit: D. Nister Slide credit: D. Nister 7

Visual words: main idea Visual words: main idea Map high-dimensional descriptors to tokens/words by quantizing the feature space Quantize via clustering, let cluster centers be the prototype words Descriptor space 43 ted ial Computing Map high-dimensional descriptors to tokens/words by quantizing the feature space Determine which word to assign to each new image region by finding the closest cluster center. Descriptor space Visual words Visual words: texture representation Example: each group of patches belongs to the same visual word Figure from Sivic & Zisserman, ICCV 2003 First explored for texture and material representations Texton = cluster center of filter responses over collection of images Describe textures and materials based on distribution of prototypical texture elements. Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003; Recall: Texture representation example Windows with primarily horizontal edges Dimension 2 (m mean d/dy value) Both Dimension 1 (mean d/dx value) mean d/dx value mean d/dy value Win. #1 4 10 Win.#2 18 7 Win.#9 20 20 Visual words: texture representation Texture is characterized by the repetition of basic elements or textons For stochastic textures, it is the identity of the textons, not their spatial arrangement, that matters Windows with small gradient in both directions Windows with primarily vertical edges statistics to summarize patterns in small windows Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003 Source: Lana Lazebnik 8

Visual words Inverted file index for images comprised of visual words More recently used for describing scenes and objects for the sake of indexing or classification. Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others. 49 Image credit: A. Zisserman Word number When will this give us a significant gain in efficiency? List of image numbers Analogy to documents If a local image region is a visual word, how can we summarize an image (the document)? Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted sensory, point brain, by point to visual centers in the brain; the cerebral cortex was a visual, perception, movie screen, so to speak, upon which the image in retinal, the eye was cerebral projected. Through cortex, the discoveries of eye, Hubel cell, and Wieselwe optical now know that behind the origin of the visual perception in the nerve, brain there image is a considerably more complicated Hubel, course of Wiesel events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a stepwise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image. China is forecasting a trade surplus of $90bn ( 51bn) to $100bn this year, a threefold increase on 2004's $32bn. The Commerce Ministry said the surplus would be created by a predicted 30% jump in exports to $750bn, compared with a 18% rise in imports to $660bn. The figures China, are likely trade, to further annoy the US, which has long argued that surplus, commerce, China's exports are unfairly helped by a deliberately exports, undervalued imports, yuan. Beijing US, agrees the yuan, surplus bank, is too high, domestic, but says the yuan is only one factor. Bank of China governor Zhou foreign, Xiaochuan increase, said the country also needed to do trade, more to value boost domestic demand so more goods stayed within the country. China increased the value of the yuan against the dollar by 2.1% in July and permitted it to trade within a narrow band, but the US wants the yuan to be allowed to trade freely. However, Beijing has made it clear that it will take its time and tread carefully before allowing the yuan to rise further in value. ICCV 2005 short course, L. Fei-Fei Object Bag of words ICCV 2005 short course, L. Fei-Fei 9

Bags of visual words Similarly, bags of textons for texture representation Summarize entire image based on its distribution (histogram) of word occurrences. Analogous to bag of words representation ti commonly used for documents. Image credit: Fei-Fei Li 55 histogram Universal texton dictionary Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003 Source: Lana Lazebnik Comparing bags of words tf-idf weighting Rank frames by normalized scalar product between their (possibly weighted) occurrence counts---nearest neighbor search for similar images. [1 8 1 4] [5 1 1 0] Term frequency inverse document frequency Describe frame by frequency of each word within it, downweight words that appear often in the database (Standard weighting for text retrieval) Number of occurrences of word i in document d Total number of documents in database d r j q r Number of words in document d Number of occurrences of word i in whole database Bags of words for content-based image retrieval What if query of interest is a portion of a frame? Slide from Andrew Zisserman Sivic & Zisserman, ICCV 2003 Slide from Andrew Zisserman Sivic & Zisserman, ICCV 2003 10

Video Google System 1. Collect all words within query region 2. Inverted file index to find relevant frames 3. Compare word counts 4. Spatial verification Sivic & Zisserman, ICCV 2003 Demo online at : http://www.robots.ox.ac.uk/~vgg/r esearch/vgoogle/index.html Query region Retrieved frames 61 Collecting words within a query region Query region: pull out only the SIFT descriptors whose positions are within the polygon 62 Bag of words representation: spatial info A bag of words is an orderless representation: throwing out spatial relationships between features Middle ground: Visual phrases : frequently co-occurring words Semi-local features : describe configuration, neighborhood Let position be part of each feature Count bags of words only within sub-grids of an image After matching, verify spatial consistency (e.g., look at neighbors are they the same too?) 11

Visual vocabulary formation Sampling strategies Issues: Sampling strategy: where to extract features? Clustering / quantization algorithm Unsupervised vs. supervised What corpus provides features (universal vocabulary?) Vocabulary size, number of words 67 Sparse, at interest points Multiple interest operators Image credits: F-F. Li, E. Nowak, J. Sivic Dense, uniformly Randomly To find specific, textured objects, sparse sampling from interest points often more reliable. Multiple complementary interest operators offer more image coverage. For object categorization, dense sampling offers better coverage. [See Nowak, Jurie & Triggs, ECCV 2006] 68 Clustering / quantization methods Example: Recognition with Vocabulary Tree k-means (typical choice), agglomerative clustering, mean-shift, Tree construction: Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies Vocabulary tree [Nister & Stewenius, CVPR 2006] 69 70 Slide credit: David Nister Vocabulary Tree Vocabulary Tree Training: Filling the tree Training: Filling the tree 71 Slide credit: David Nister 72 Slide credit: David Nister 12

Vocabulary Tree Vocabulary Tree Training: Filling the tree Training: Filling the tree 73 Slide credit: David Nister 74 Slide credit: David Nister Vocabulary Tree Training: Filling the tree 75 Slide credit: David Nister What is the computational advantage of the hierarchical representation bag of words, vs. a flat vocabulary? Vocabulary Tree Vocabulary Tree: Performance Recognition Evaluated on large databases Indexing with up to 1M images RANSAC verification 77 Slide credit: David Nister Online recognition for database of 50,000 CD covers Retrieval in ~1s Find experimentally that large vocabularies can be beneficial for recognition 78 13

Recap: indexing features Larger vocabularies can be advantageous But what happens if it is too large? Detect or sample features List of positions, scales, orientations Describe features Associated list of d-dimensional descriptors or Index each one into pool of descriptors from previously seen images Quantize to form bag of words vector for the image 80 Learning and recognition with bag of words histograms Bag of words representation makes it possible to describe the unordered point set with a single vector (of fixed dimension across image examples) Provides easy way to use distribution of feature types with various learning algorithms requiring vector input. 81 Bags of features for object recognition Works pretty well for image-level classification face, flowers, building Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005) Source: Lana Lazebnik Bags of features for object recognition Caltech6 dataset bag of features bag of features Parts-and-shape and shape model Source: Lana Lazebnik Bags of words: pros and cons + flexible to geometry / deformations / viewpoint + compact summary of image content + provides vector representation for sets + has yielded good recognition results in practice - basic model ignores geometry must verify afterwards, or encode via features - background and foreground mixed when bag covers whole image - interest points or sampling: no guarantee to capture object-level parts - optimal vocabulary formation remains unclear 84 14

Summary Local invariant features: distinctive matches possible in spite of significant view change, useful not only to provide matches for multi-view geometry, but also to find objects and scenes. To find correspondences among detected features, measure distance between descriptors, and look for most similar patches. Bag of words representation: quantize feature space to make discrete set of visual words Summarize image by distribution of words Index individual words Inverted index: pre-compute index to enable faster search at query time Next Next week : Object recognition 15