Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Professor William Hoff Dept of Electrical Engineering &Computer Science http://inside.mines.edu/~whoff/ 1

Object Recognition in Large Databases Some material for these slides comes from www.cs.utexas.edu/~grauman/courses/spring2011/slides/lecture18_index.pptx 2

Object Recognition We have seen methods to do object recognition by matching a query image to a database image: Extract features from each image Match their descriptors Impose a constraint (such as a homography or the fundamental matrix) to eliminate mismatches If a sufficient number of matches remain, we have found our object! 24 22 21 25 23 19 Problem: as the database gets large, the time it takes to match a new image against each database image can become prohibitive 24 2122 2519 23 20 20 3

Example Application: Location Recognition Match a new image to a database of images to determine where the camera was when it took the picture What uses can this have? 4

GPS not available indoors Indoor Localization 5 Chris Card, Qualitative Image Based Localization In A Large Building, MS Thesis, 2015

In a large building, there can be many locations that have a similar appearance Walls and floors often have little or no texture and doors look very similar Similarity of Appearance 6

Brown Hall Mapping (Chris Card) 1st, 2nd, and 3rd floors of Brown Hall 1,382 images taken at known locations Given a new image, match to an image in the database C. Card and W. Hoff, "Qualitative Image Based Localization in a Large Building," Proc. of 19th International Conference on Image Processing,, & Pattern Recognition, 2015 7

Approach Instead of comparing the query image to every image in the database, one at a time, first narrow down the search to a few likely images Then use a more detailed verification stage on those Candidate matching images 10

Approach Create an index of feature descriptors It is a table containing the features, along with the images where the features appeared Similar to the index in a text document We ll look at two methods: Hashing Bag of words 11

Hashing Transform a feature descriptor into a shorter key that indexes into a table Store the feature keypoint there, along with the image id that it came from Image 3 Hash table Feature 11, image 3 Feature 21, image 7 Image 7 Feature 65, image 7 12

Matching To match a query image, extract feature descriptors and map them into the hash table Retrieve the stored features (and their corresponding image id s) from those locations in the table Images with a high number of matching features are taken to be candidate matching images Problem: Image noise can perturb a feature descriptor so that it no longer exactly matches the feature descriptor from the corresponding database image This can cause the hash function to map the query descriptor to a completely different location in the hash table 13

Locality Sensitive Hashing (LSH) In LSH, the hash function preserves the locality of feature descriptors If two features are close in feature space then their hashes will also be close Difference between hashes is equivalent to distance in feature space [1] If noise perturbs a query feature descriptor, it will map to a location in the hash table that is close to the correct location So when mapping a query descriptor to the hash table, you should also retrieve entries from the table that are near to the mapped location [1] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality sensitive hashing scheme based on p stable distributions, in Proceedings of the 20th Annual Symposium on Computational Geometry, pp. 253 262 (2004) 14

Example Features: ORB ORB consists of a feature detector and descriptor FAST is the feature detector BRIEF is the feature descriptor Descriptor is 32 bytes per feature SURF uses 64, SIFT uses 128 15

Feature Matching Ratio test When matching a query feature, the closest matching feature from one database image must be 80% closer then the second closest feature from the same database image (in feature space) Spatial consistency Neighbors of the query point (in image space) should have matches that are neighbors of the database point. 16

Verification The N database images with the highest number of matches are candidate images (N=2 in Card s work) Each candidate image is then sent through a verification step, by fitting a fundamental matrix using RANSAC 17

Evaluation 1,382 images; 1,073,903 feature points Test set of 70 images Vary the threshold for the number of inliers to the fundamental matrix For a threshold of 16: TPR is 94% FPR is 17% 18

Examples of True Positives Corresponding epipolar lines Query 19

Examples of True Positives Query 20

Examples of True Positives Query 21

Examples of True Positives Query 22

Example of False Negative Query 23

Example of False Positives Query 24

Example of False Positives Query 25

Example of False Positives Query 26

Bag of Words In a large database, there can be a lot of features to store Instead of storing all of them, we can quantize the descriptors into visual words The number of possible words (the vocabulary ) is relatively small Then, each image can be described by a histogram of the visual words (i.e., a bag of words 27

Visual words: main idea Extract some local features from a number of images e.g., SIFT descriptor space: each point is 128-dimensional Slide credit: D. Nister, CVPR 2006

Visual words: main idea

Each point is a local descriptor, e.g. SIFT vector.

Visual words Map high dimensional descriptors to tokens/words by quantizing the feature space Quantize via clustering, let cluster centers be the prototype words Word #2 Descriptor s feature space Determine which word to assign to each new image region by finding the closest cluster center. Kristen Grauman

Example: each group of patches belongs to the same visual word Visual words Figure from Sivic & Zisserman, ICCV 2003 Kristen Grauman

Similarity to textons First explored for texture and material representations Texton = cluster center of filter responses over collection of images Describe textures and materials based on distribution of prototypical texture elements. Leung & Malik 1999; Varma & Zisserman, 2002 Kristen Grauman

Texture representation example Windows with primarily horizontal edges Dimension 2 (mean d/dy value) Dimension 1 (mean d/dx value) Windows with small gradient in both directions Both Windows with primarily vertical edges mean d/dx value mean d/dy value Win. #1 4 10 Win.#2 18 7 Win.#9 20 20 statistics to summarize patterns in small windows Kristen Grauman

Inverted file index Database images are loaded into the index, mapping words to image numbers Kristen Grauman

Inverted file index A new query image is mapped to indices of database images that share a word. Kristen Grauman

Matching Images Once we have extracted visual words from a query image, how to find matching images in the database? One way is to simply look at the image id s of the matching features, and retrieve those images whose id s occurred most often (i.e., Chris Card s method) Another way is to look at the distribution (histogram) of the visual words in each image The histograms from the query image and the matching database image should be very similar

Analogy to documents Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the sensory, retinal image brain, was transmitted point by point visual, to centers perception, in the brain; the cerebral cortex was a movie screen, so to speak, upon which retinal, the image cerebral in the eye was cortex, projected. Through the discoveries eye, cell, of Hubel optical and Wiesel we now know that behind the origin of the visual perception in the brain nerve, there image is a considerably more complicated Hubel, course of Wiesel events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image. China is forecasting a trade surplus of $90bn ( 51bn) to $100bn this year, a threefold increase on 2004's $32bn. The Commerce Ministry said the surplus would be created by a predicted 30% jump in exports to $750bn, compared with a 18% rise in imports to $660bn. China, The trade, figures are likely to further annoy surplus, the US, which commerce, has long argued that China's exports are unfairly helped by a deliberately exports, undervalued imports, yuan. Beijing US, agrees the surplus yuan, is too high, bank, but says domestic, the yuan is only one factor. Bank of China governor Zhou Xiaochuan said foreign, the country increase, also needed to do more to boost domestic trade, demand value so more goods stayed within the country. China increased the value of the yuan against the dollar by 2.1% in July and permitted it to trade within a narrow band, but the US wants the yuan to be allowed to trade freely. However, Beijing has made it clear that it will take its time and tread carefully before allowing the yuan to rise further in value. ICCV 2005 short course, L. Fei-Fei

Bags of visual words Summarize entire image based on its distribution (histogram) of word occurrences. Analogous to bag of words representation commonly used for documents.

Comparing bags of words Rank frames by normalized scalar product between their (possibly weighted) occurrence counts nearest neighbor search for similar images. [1 8 1 4] [5 1 1 0],, d j q for vocabulary of V words Kristen Grauman

tf idf weighting Term frequency inverse document frequency Describe frame by frequency of each word within it, downweight words that appear often in the database (Standard weighting for text retrieval) Number of occurrences of word i in document d Number of words in document d Total number of documents in database Number of documents word i occurs in, in whole database Kristen Grauman

Bags of words for content-based image retrieval Sivic, Josef, and Andrew Zisserman. "Efficient visual search of videos cast as text retrieval." IEEE transactions on pattern analysis and machine intelligence 31.4 (2009): 591-606. Slide from Andrew Zisserman Sivic & Zisserman, ICCV 2003

Slide from Andrew Zisserman School of Mines Sivic &Colorado Zisserman, ICCV 2003

Additional Checks Stop words Create a stop list of the most common visual words These words are dropped from further consideration Spatial consistency For every matching feature, count the number of k = 15 nearest adjacent features that also match between the two documents This is added to the score 48

Video Google System 1. Collect all words within query region 2. Inverted file index to find relevant frames 3. Compare word counts 4. Spatial verification Sivic & Zisserman, ICCV 2003 Demo online at : http://www.robots.ox.ac.uk/~vgg/research /vgoogle/index.html Query region Retrieved frames K. Computer Grauman, Vision B. Leibe 49

Video Google System Query region Retrieved frames K. Computer Grauman, Vision B. Leibe 50

Scoring retrieval quality Query Database size: 10 images Relevant (total): 5 images Results (ordered): precision = #relevant / #returned recall = #relevant / #total relevant 1 0.8 precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 recall Slide credit: Ondrej Chum

Vocabulary Trees A very large vocabulary can be organized into a tree for greater efficiency Each descriptor vector is compared to several prototypes at a given level in the vocabulary tree and the branch with the closest prototype is selected for further refinement Only a few comparisons at each level are needed for quantizing each descriptor 52

Vocabulary Trees: hierarchical clustering for large vocabularies Tree construction: [Nister & Stewenius, CVPR 06] Slide credit: David Nister

Bags of words: pros and cons + flexible to geometry / deformations / viewpoint + compact summary of image content + provides vector representation for sets + very good results in practice basic model ignores geometry must verify afterwards, or encode via features background and foreground mixed when bag covers whole image optimal vocabulary formation remains unclear

Summary Matching local invariant features: useful not only to provide matches for multi view geometry, but also to find objects and scenes. Bag of words representation: quantize feature space to make discrete set of visual words Summarize image by distribution of words Index individual words Inverted index: pre compute index to enable faster search at query time