Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial Visual Object Recognition Bastian Leibe Computer Vision Laboratory ETH Zurich Chicago, 14.07.2008 & Kristen Grauman Department of Computer Sciences University of Texas in Austin
Outline 1. Detection with Global Appearance & Sliding Windows 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Coffee Break 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2
Global representations: limitations Success may rely on alignment -> sensitive to viewpoint All parts of the image or window impact the description -> sensitive to occlusion, clutter 3
Local representations Describe component regions or patches separately. Many options for detection & description SIFT [Lowe] Salient regions [Kadir et al.] Shape context [Belongie et al.] Harris-Affine [Schmid et al.] Superpixels [Ren et al.] Spin images [Johnson and Hebert] Maximally Stable Extremal Regions [Matas et al.] Geometric Blur [Berg et al.] 4
Recall: Invariant local features Subset of local feature types designed to be invariant to Scale Translation Rotation Affine transformations Illumination 1) Detect interest points 2) Extract descriptors [Mikolajczyk & Schmid, Matas et al., Tuytelaars & Van Gool, Lowe, Kadir et al., ] y 1 y 2 y d x 1 x 2 x d
Recognition with local feature sets Previously, we saw how to use local invariant features + a global spatial model to recognize specific objects, using a planar object assumption. Now, we ll use local features for Indexing-based recognition Bags of words representations Correspondence / matching kernels Aachen Cathedral 6
Basic flow Detect or sample features List of positions, scales, orientations Describe features Associated list of d-dimensional descriptors Index each one into pool of descriptors from previously seen images 7
Indexing local features Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT)
Indexing local features When we see close points in feature space, we have similar descriptors, which indicates similar local content. Figure credit: A. Zisserman
Indexing local features We saw in the previous section how to use voting and pose clustering to identify objects using local features Figure credit: David Lowe 10
Indexing local features With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image? Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search High-dimensional descriptors: approximate nearest neighbor search methods more practical Inverted file indexing schemes 11
Indexing local features: approximate nearest neighbor search Best-Bin First (BBF), a variant of k-d trees that uses priority queue to examine most promising branches first [Beis & Lowe, CVPR 1997] Locality-Sensitive Hashing (LSH), a randomized hashing technique using hash functions that map similar points to the same bin, with high probability [Indyk & Motwani, 1998] 12
Indexing local features: inverted file index For text documents, an efficient way to find all pages on which a word occurs is to use an index We want to find all images in which a feature occurs. To use this idea, we ll need to map our features to visual words. 13
Visual words: main idea Extract some local features from a number of images Slide credit: D. Nister e.g., SIFT descriptor space: each point is 128-dimensional 14
Visual words: main idea Slide credit: D. Nister 15
Visual words: main idea Slide credit: D. Nister 16
Visual words: main idea Slide credit: D. Nister 17
Slide credit: D. Nister 18
Slide credit: D. Nister 19
Visual words: main idea Map high-dimensional descriptors to tokens/words by quantizing the feature space Quantize via clustering, let cluster centers be the prototype words Descriptor space 20
Visual words: main idea Map high-dimensional descriptors to tokens/words by quantizing the feature space Determine which word to assign to each new image region by finding the closest cluster center. Descriptor space 21
Visual words Example: each group of patches belongs to the same visual word Figure from Sivic & Zisserman, ICCV 2003 22
Visual words First explored for texture and material representations Texton = cluster center of filter responses over collection of images Describe textures and materials based on distribution of prototypical texture elements. Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003;
Visual words More recently used for describing scenes and objects for the sake of indexing or classification. Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others. 24
Inverted file index for images comprised of visual words Image credit: A. Zisserman Word number List of image numbers
Bags of visual words Summarize entire image based on its distribution (histogram) of word occurrences. Analogous to bag of words representation commonly used for documents. Image credit: Fei-Fei Li 26
Video Google System 1. Collect all words within query region 2. Inverted file index to find relevant frames 3. Compare word counts 4. Spatial verification Sivic & Zisserman, ICCV 2003 Demo online at : http://www.robots.ox.ac.uk/~vgg/r esearch/vgoogle/index.html Query region Retrieved frames 27
Basic flow Detect or sample features List of positions, scales, orientations Describe features Associated list of d-dimensional descriptors or Index each one into pool of descriptors from previously seen images Quantize to form bag of words vector for the image 28
Visual vocabulary formation Issues: Sampling strategy Clustering / quantization algorithm Unsupervised vs. supervised What corpus provides features (universal vocabulary?) Vocabulary size, number of words 29
Sampling strategies Sparse, at interest points Multiple interest operators Image credits: F-F. Li, E. Nowak, J. Sivic Dense, uniformly Randomly To find specific, textured objects, sparse sampling from interest points often more reliable. Multiple complementary interest operators offer more image coverage. For object categorization, dense sampling offers better coverage. [See Nowak, Jurie & Triggs, ECCV 2006] 30
Clustering / quantization methods k-means (typical choice), agglomerative clustering, mean-shift, Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies Vocabulary tree [Nister & Stewenius, CVPR 2006] 31
Example: Recognition with Vocabulary Tree Tree construction: [Nister & Stewenius, CVPR 06] 32 Slide credit: David Nister
Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 33
Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 34
Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 35
Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 36
Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 37
Vocabulary Tree Recognition RANSAC verification [Nister & Stewenius, CVPR 06] Slide credit: David Nister 38
Vocabulary Tree: Performance Evaluated on large databases Indexing with up to 1M images Online recognition for database of 50,000 CD covers Retrieval in ~1s Find experimentally that large vocabularies can be beneficial for recognition [Nister & Stewenius, CVPR 06] 39
Vocabulary formation Ensembles of trees provide additional robustness Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007; Bosch, Zisserman, & Munoz 2007; Figure credit: F. Jurie
Supervised vocabulary formation Recent work considers how to leverage labeled images when constructing the vocabulary Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for Generic Visual Categorization, ECCV 2006. 41
Supervised vocabulary formation Merge words that don t aid in discriminability Winn, Criminisi, & Minka, Object Categorization by Learned Universal Visual Dictionary, ICCV 2005
Supervised vocabulary formation Consider vocabulary and classifier construction jointly. Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR 2008. 43
Learning and recognition with bag of words histograms Bag of words representation makes it possible to describe the unordered point set with a single vector (of fixed dimension across image examples) Provides easy way to use distribution of feature types with various learning algorithms requiring vector input. 44
Learning and recognition with bag of words histograms including unsupervised topic models designed for documents. Hierarchical Bayesian text models (plsa and LDA) Hoffman 2001, Blei, Ng & Jordan, 2004 For object and scene categorization: Sivic et al. 2005, Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al. 2005 Figure credit: Fei-Fei Li 45
Learning and recognition with bag of words histograms including unsupervised topic models designed for documents. Probabilistic Latent d z w Semantic Analysis (plsa) D N face Sivic et al. ICCV 2005 [plsa code available at: http://www.robots.ox.ac.uk/~vgg/software/] Figure credit: Fei-Fei Li 46
Bags of words: pros and cons + flexible to geometry / deformations / viewpoint + compact summary of image content + provides vector representation for sets + has yielded good recognition results in practice - basic model ignores geometry must verify afterwards, or encode via features - background and foreground mixed when bag covers whole image - interest points or sampling: no guarantee to capture object-level parts - optimal vocabulary formation remains unclear 47
Outline 1. Detection with Global Appearance & Sliding Windows 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Coffee Break 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 48
Basic flow Detect or sample features List of positions, scales, orientations Describe features Associated list of d-dimensional descriptors or or Index each one into pool of descriptors from previously seen images Quantize to form bag of words vector for the image Compute match with another image 49
Local feature correspondences The matching between sets of local features helps to establish overall similarity between objects or shapes. Assigned matches also useful for localization Shape context [Belongie & Malik 2001] Low-distortion matching [Berg & Malik 2005] Match kernel [Wallraven, Caputo & Graf 2003] 50
Local feature correspondences Least cost match: minimize total cost between matched points min π: X Y Least cost partial match: match all of smaller set to some portion of larger set. xi X x i π ( xi)
Pyramid match kernel (PMK) Optimal matching expensive relative to number of features per image (m). PMK is approximate partial match for efficient discriminative learning from sets of local features. Optimal match: O(m 3 ) Greedy match: O(m 2 log m) Pyramid match: O(m) [Grauman & Darrell, ICCV 2005] 52
Pyramid match kernel: pyramid extraction, Histogram pyramid: level i has bins of size 53 K. Grauman
Pyramid match kernel: counting matches Histogram intersection K. Grauman 54
Pyramid match kernel: counting new matches Histogram intersection matches at this level Difference in histogram intersections across levels counts number of new pairs matched K. Grauman matches at previous level 55
Pyramid match kernel histogram pyramids measure of difficulty of a match at level i number of newly matched pairs at level i For similarity, weights inversely proportional to bin size (or may be learned discriminatively) Normalize kernel values to avoid favoring large sets K. Grauman 56
Example pyramid match K. Grauman 57
Example pyramid match K. Grauman 58
Example pyramid match K. Grauman 59
Example pyramid match pyramid match optimal match K. Grauman
Pyramid match kernel Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Bounded error relative to optimal partial match Linear time -> efficient learning with large feature sets
Pyramid match kernel Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Bounded error relative to optimal partial match Linear time -> efficient learning with large feature sets Accuracy Mean number of features Time (s) Match [Wallraven et al.] Pyramid match ETH-80 data set Mean number of features O(m 2 ) O(m)
Pyramid match kernel Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Bounded error relative to optimal partial match Linear time -> efficient learning with large feature sets Use data-dependent pyramid partitions for high-d feature spaces Uniform pyramid bins Vocabulary-guided pyramid bins Code for PMK: http://people.csail.mit.edu/jjl/libpmk/
Matching smoothness & local geometry Solving for linear assignment means (non-overlapping) features can be matched independently, ignoring relative geometry. One alternative: simply expand feature vectors to include spatial information before matching. y a x a [ f 1,,f 128, x a, y a ] 64
Spatial pyramid match kernel First quantize descriptors into words, then do one pyramid match per word in image coordinate space. Lazebnik, Schmid & Ponce, CVPR 2006
Matching smoothness & local geometry Use correspondence to estimate parameterized transformation, regularize to enforce smoothness Shape context matching [Belongie, Malik, & Puzicha 2001] Code: http://www.eecs.berkeley.edu/research/projects/cs/vision/shape/sc_digits.html
Matching smoothness & local geometry Let matching cost include term to penalize distortion between pairs of matched features. Template j i Rij Approximate for efficient solutions: Berg & Malik, CVPR 2005; Leordeanu & Hebert, ICCV 2005 i i' Query j' Si'j' i' Figure credit: Alex Berg
Matching smoothness & local geometry Compare semi-local features: consider configurations or neighborhoods and co-occurrence relationships Correlograms of visual words [Savarese, Winn, & Criminisi, CVPR 2006] Feature neighborhoods [Sivic & Zisserman, CVPR 2004] Proximity distribution kernel [Ling & Soatto, ICCV 2007] Hyperfeatures: Agarwal & Triggs, ECCV 2006] Tiled neighborhood [Quack, Ferrari, Leibe, van Gool ICCV 2007]
Matching smoothness & local geometry Learn or provide explicit object-specific shape model [Next in the tutorial : part-based models] x 6 x 1 x 2 x 5 x 4 x 3
Summary Local features are a useful, flexible representation Invariance properties - typically built into the descriptor Distinctive, especially helpful for identifying specific textured objects Breaking image into regions/parts gives tolerance to occlusions and clutter Mapping to visual words forms discrete tokens from image regions Efficient methods available for Indexing patches or regions Comparing distributions of visual words Matching features 70