Lecture 12 Recognition

Similar documents
Lecture 12 Recognition

Lecture 12 Recognition. Davide Scaramuzza

Perception IV: Place Recognition, Line Extraction

Perception. Autonomous Mobile Robots. Sensors Vision Uncertainties, Line extraction from laser scans. Autonomous Systems Lab. Zürich.

Keypoint-based Recognition and Object Search

Object Recognition and Augmented Reality

By Suren Manvelyan,

Indexing local features and instance recognition May 16 th, 2017

Indexing local features and instance recognition May 14 th, 2015

Large Scale Image Retrieval

Recognition. Topics that we will try to cover:

Instance-level recognition

Indexing local features and instance recognition May 15 th, 2018

Instance-level recognition

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce

CS6670: Computer Vision

CS 4495 Computer Vision Classification 3: Bag of Words. Aaron Bobick School of Interactive Computing

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Recognizing object instances

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Local Features and Bag of Words Models

IMAGE MATCHING - ALOK TALEKAR - SAIRAM SUNDARESAN 11/23/2010 1

Large scale object/scene recognition

Patch Descriptors. CSE 455 Linda Shapiro

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

A Systems View of Large- Scale 3D Reconstruction

Visual Object Recognition

Instance-level recognition II.

Compressed local descriptors for fast image and video search in large databases

Large-scale visual recognition The bag-of-words representation

Recognizing Object Instances. Prof. Xin Yang HUST

Robot localization method based on visual features and their geometric relationship

Experiments in Place Recognition using Gist Panoramas

Feature Matching + Indexing and Retrieval

Supervised learning. y = f(x) function

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Evaluation of GIST descriptors for web scale image search

6.819 / 6.869: Advances in Computer Vision

Vocabulary tree. Vocabulary tree supports very efficient retrieval. It only cares about the distance between a query feature and each node.

Instance-level recognition part 2

Uncertainties: Representation and Propagation & Line Extraction from Range data

Visual localization using global visual features and vanishing points

Large-scale visual recognition Efficient matching

Large Scale 3D Reconstruction by Structure from Motion

Patch Descriptors. EE/CSE 576 Linda Shapiro

Image correspondences and structure from motion

CPPP/UFMS at ImageCLEF 2014: Robot Vision Task

VK Multimedia Information Systems

Lecture 14: Indexing with local features. Thursday, Nov 1 Prof. Kristen Grauman. Outline

Visual Word based Location Recognition in 3D models using Distance Augmented Weighting

Detection of Cut-And-Paste in Document Images

Large-scale Image Search and location recognition Geometric Min-Hashing. Jonathan Bidwell

IBuILD: Incremental Bag of Binary Words for Appearance Based Loop Closure Detection

Hamming embedding and weak geometric consistency for large scale image search

Ensemble of Bayesian Filters for Loop Closure Detection

LOCAL AND GLOBAL DESCRIPTORS FOR PLACE RECOGNITION IN ROBOTICS

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)

Feature-based methods for image matching

Learning a Fine Vocabulary

Mixtures of Gaussians and Advanced Feature Encoding

Maximally Stable Extremal Regions and Local Geometry for Visual Correspondences

Exercise 4: Image Retrieval with Vocabulary Trees

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Recognizing object instances. Some pset 3 results! 4/5/2011. Monday, April 4 Prof. Kristen Grauman UT-Austin. Christopher Tosh.

Efficient Representation of Local Geometry for Large Scale Object Retrieval

Today. Main questions 10/30/2008. Bag of words models. Last time: Local invariant features. Harris corner detector: rotation invariant detection

String distance for automatic image classification

Distances and Kernels. Motivation

Object Recognition with Invariant Features

Instance recognition

Augmenting Reality, Naturally:

Binary SIFT: Towards Efficient Feature Matching Verification for Image Search

Enhanced and Efficient Image Retrieval via Saliency Feature and Visual Attention

A Tale of Two Object Recognition Methods for Mobile Robots

Unsupervised Learning : Clustering

Bundling Features for Large Scale Partial-Duplicate Web Image Search

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Opportunities of Scale

Efficient Large-scale Image Search With a Vocabulary Tree. PREPRINT December 19, 2017

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

Photo Tourism: Exploring Photo Collections in 3D

FACULTY OF ENGINEERING AND INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE. Project Plan

Modeling and Recognition of Landmark Image Collections Using Iconic Scene Graphs

WISE: Large Scale Content Based Web Image Search. Michael Isard Joint with: Qifa Ke, Jian Sun, Zhong Wu Microsoft Research Silicon Valley

SEARCH BY MOBILE IMAGE BASED ON VISUAL AND SPATIAL CONSISTENCY. Xianglong Liu, Yihua Lou, Adams Wei Yu, Bo Lang

Towards Visual Words to Words

Automatic Image Alignment (feature-based)

Local Image Features

ADVANCED MACHINE LEARNING MACHINE LEARNING. Kernel for Clustering kernel K-Means

Paper Presentation. Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval

Midterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability

Local features and image matching. Prof. Xin Yang HUST

OBJECT CATEGORIZATION

Part-based and local feature models for generic object recognition

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

ECG782: Multidimensional Digital Signal Processing

Robotics Programming Laboratory

Image Features: Local Descriptors. Sanja Fidler CSC420: Intro to Image Understanding 1/ 58

Video Google: A Text Retrieval Approach to Object Matching in Videos

Transcription:

Institute of Informatics Institute of Neuroinformatics Lecture 12 Recognition Davide Scaramuzza http://rpg.ifi.uzh.ch/ 1

Lab exercise today replaced by Deep Learning Tutorial by Antonio Loquercio Room ETH HG E 1.1 from 13:15 to 15:00 Optional lab exercise is online: K-means clustering and Bag of Words place recognition 2

Place Recognition Robotics: Has the robot been to this place before? Which images were taken around the same location? Image retrieval: Have I seen this image before? Which images in my database look similar to it? E.g., Google Reverse Image Search 3

Place Recognition/Image Retrieval Query image Results on a database of 100 million images 4

How much is 100 million images? If each sheet of paper was 0.1 mm thick 5

Slide Slide Credit: Nister6

Slide Slide Credit: Nister7

Slide Slide Credit: Nister8

Fast visual search How do we query an image in a database of 100 million images in less than 6 seconds? Video Google, Sivic and Zisserman, ICCV 2003 Scalable Recognition with a Vocabulary Tree, Nister and Stewenius, CVPR 2006. 9

Visual Place Recognition Goal: query an image in a database of N images Complexity: NM 2 feature comparisons (assumes each image has M features) Example: - assume 1,000 SIFT features per image M = 1,000 - assume N = 100,000,000 - NM 2 = 100,000,000,000,000 feature comparisons! - If we assume 0.1 ms per feature comparison 1 image query would take 317 years! Solution: Use an inverted file index! Complexity reduces to O(M) [ Video Google, Sivic & Zisserman, ICCV 03] [ Scalable Recognition with a Vocabulary Tree, Nister & Stewenius, CVPR 06] See also FABMAP and Galvez-Lopez 12 s (DBoW2)] University of Zurich Robotics and Perception Group - rpg.ifi.uzh.ch

Indexing local features: inverted file text For text documents, an efficient way to find all pages in which a word occurs is to use an index We want to find all images in which a feature occurs How many distinct SIFT or BRISK features exist? SIFT Infinite BRISK-128 2 128 = 3.4 10 38 Since the number of image features may be infinite, before we build our visual vocabulary we need to map our features to visual words Using analogies from text retrieval, we should: Define a Visual Word Define a vocabulary of Visual Words This approach is known as Bag of Words (BOW) 11

How to extract Visual Words from descriptors Collect a large enough dataset that is representative of all possible images that are relevant to your application (e.g., for automotive place recognition, you may want to collect million of street images sampled around the world) Extract features and descriptors from each image and map them into the same descriptor space (e.g., for SIFT, 128 dimensional descriptor space) Cluster the descriptor space into K clusters The centroid of each cluster is a visual word. This is computed by taking the arithmetic average of all the descriptors within the same cluster: e.g., for SIFT, each cluster contains SIFT features that are very similar to each other; the visual word then is the average all the SIFT descriptors in that cluster Let s see an example 12

Extracting and mapping descriptors into the descriptor space Feature descriptor space For convenience we assume that the descriptor space has 2 dimensions. In case of SIFT, it would have 128 dimensions! 13

Extracting and mapping descriptors into the descriptor space For convenience we assume that the descriptor space has 2 dimensions. In case of SIFT, it would have 128 dimensions! 14

Extracting and mapping descriptors into the descriptor space For convenience we assume that the descriptor space has 2 dimensions. In case of SIFT, it would have 128 dimensions! 15

Extracting and mapping descriptors into the descriptor space For convenience we assume that the descriptor space has 2 dimensions. In case of SIFT, it would have 128 dimensions! 16

Extracting and mapping descriptors into the descriptor space For convenience we assume that the descriptor space has 2 dimensions. In case of SIFT, it would have 128 dimensions! 17

Extracting and mapping descriptors into the descriptor space 18

Descriptor space 19

Cluster the descriptor space into K clusters 20

Extract centroids (e.g., visual words) 21

Summary Image Collection Extract Features Cluster Descriptors Descriptors space What is a visual word? A visual word is the centroid of a cluster of similar features (i.e., similar descriptors) Examples of Features belonging to the same clusters 22

How do we cluster the descriptor space? k-means clustering is an algorithm to partition n data point into k clusters in which each data point x belongs to the cluster S i with center m i It minimizes the sum of squared Euclidean distances between points x and their nearest cluster centers m i Algorithm: k D X, M = Randomly initialize k cluster centers Iterate until convergence: i=1 Assign each data point x j to the nearest center m i Recompute each cluster center as the mean of all points assigned to it (x m i ) 2 x S i 23

K-means demo Source: http://shabal.in/visuals/kmeans/1.html 24

Applying Visual Words to Image Retrieval Inverted File Index lists all visual words in the vocabulary (extracted at training time) Each word points to a list of images, from the all image Data Base (DB), in which that word appears. The DB grows as the robot navigates and collects new images. Voting array: has as many cells as the images in the DB. Each word in the query image votes for multiple images. Inverted File Index Querying 1 image is independent of the number of images in the database Query image Q Visual words in Q 101 103 105 105 180 180 180 Visual word 0 101 102 103 104 105 Voting Array for Q List of images in which this word appears +1 +1 +1 25

Drawback Every feature in the query image still needs to be compared against all features in the vocabulary: Example: Assume our query image has 1,000 SIFT features M = 1,000 assume 1,000,000 visual words Number of feature comparisons = 1,000,000,000 If we assume 0.1 ms per feature comparison 1 image query would take 28 hours! How can we make the comparison cheaper, e.g., less than 6 seconds? Solution: use hierarchical clustering: Scalable Recognition with a Vocabulary Tree, [Nister & Stewenius, CVPR 06] 26

Hierarchical clustering 27

Hierarchical clustering 28

29

30

31

32

33

34

35

36

37

38

Building the inverted file index 39

Building the inverted file index 40

Building the inverted file index 41

Building the inverted file index 42

Retrieval Thanks to hierarchical clustering, how many comparisons between a query feature and the visual words need to be done with B branches and L levels? 43

Example Query an image in a database of 100 million images Example: assume our query image has 1,000 SIFT features M = 1,000 assume 10 branches and 6 levels (i.e., 1,000,000 visual words) Number of feature comparisons = 1,000 10 6 = 60,000 If we assume 0.1 ms per feature comparison 1 image query would take 6 seconds! 44

Robust object/scene recognition Visual Vocabulary discards the spatial relationships between features Two images with the same features shuffled around will return a 100% match when using only appearance information. This can be overcome using geometric verification Test the h most similar images to the query image for geometric consistency (e.g. using 5- or 8-point RANSAC) and retain the image with the smallest reprojection error and largest number of inliers Further reading (out of scope of this course): [Cummins and Newman, IJRR 2011] [Stewénius et al, ECCV 2012] 45

Video Google System Query region 1. Collect all words within query region 2. Inverted file index to find relevant frames 3. Compare word counts 4. Spatial verification Retrieved frames Sivic & Zisserman, ICCV 2003 Demo online at : http://www.robots.ox.ac.uk/~vgg/re search/vgoogle/ 46

More words is better Branch factor 47

Higher branch factor works better (but slower) 48

FABMAP [Cummins and Newman IJRR 2011] Place recognition for robot localization Uses training images to build the BoW database Captures the spatial dependencies of visual words to distinguish the most characteristic structure of each scene Probabilistic model of the world. At a new frame, compute: P(being at a known place) P(being at a new place) Very high performance Binaries available online Open FABMAP 49

Things to remember K-means clustering Bag of Words approach What is visual word Inverted file index How it works Chapter 14 of the Szeliski s book 50

Understanding Check Are you able to answer the following questions? What is an inverted file index? What is a visual word? How does K-means clustering work? Why do we need hierarchical clustering? Explain and illustrate image retrieval using Bag of Words. Discussion on place recognition: what are the open challenges and what solutions have been proposed? 51