Visual Object Recognition

Size: px

Start display at page:

Download "Visual Object Recognition"

Lorena Cain
5 years ago
Views:

1 Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial Visual Object Recognition Bastian Leibe Computer Vision Laboratory ETH Zurich Chicago, & Kristen Grauman Department of Computer Sciences University of Texas in Austin

2 Outline 1. Detection with Global Appearance & Sliding Windows 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Coffee Break 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 2

3 Global representations: limitations Success may rely on alignment -> sensitive to viewpoint All parts of the image or window impact the description -> sensitive to occlusion, clutter 3

4 Local representations Describe component regions or patches separately. Many options for detection & description SIFT [Lowe] Salient regions [Kadir et al.] Shape context [Belongie et al.] Harris-Affine [Schmid et al.] Superpixels [Ren et al.] Spin images [Johnson and Hebert] Maximally Stable Extremal Regions [Matas et al.] Geometric Blur [Berg et al.] 4

Recall: Invariant local features Subset of local

Translation Rotation Affine transformations

descriptors [Mikolajczyk & Schmid, Matas et al.

5 Recall: Invariant local features Subset of local feature types designed to be invariant to Scale Translation Rotation Affine transformations Illumination 1) Detect interest points 2) Extract descriptors [Mikolajczyk & Schmid, Matas et al., Tuytelaars & Van Gool, Lowe, Kadir et al., ] y 1 y 2 y d x 1 x 2 x d

6 Recognition with local feature sets Previously, we saw how to use local invariant features + a global spatial model to recognize specific objects, using a planar object assumption. Now, we ll use local features for Indexing-based recognition Bags of words representations Correspondence / matching kernels Aachen Cathedral 6

7 Basic flow Detect or sample features List of positions, scales, orientations Describe features Associated list of d-dimensional descriptors Index each one into pool of descriptors from previously seen images 7

8 Indexing local features Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT)

9 Indexing local features When we see close points in feature space, we have similar descriptors, which indicates similar local content. Figure credit: A. Zisserman

10 Indexing local features We saw in the previous section how to use voting and pose clustering to identify objects using local features Figure credit: David Lowe 10

11 Indexing local features With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image? Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search High-dimensional descriptors: approximate nearest neighbor search methods more practical Inverted file indexing schemes 11

Indexing local features: approximate nearest neighbor search Best-Bin First (BBF), a variant of k-d trees that uses priority queue to examine most promising branches first [Beis & Lowe,

12 Indexing local features: approximate nearest neighbor search Best-Bin First (BBF), a variant of k-d trees that uses priority queue to examine most promising branches first [Beis & Lowe, CVPR 1997] Locality-Sensitive Hashing (LSH), a randomized hashing technique using hash functions that map similar points to the same bin, with high probability [Indyk & Motwani, 1998] 12

13 Indexing local features: inverted file index For text documents, an efficient way to find all pages on which a word occurs is to use an index We want to find all images in which a feature occurs. To use this idea, we ll need to map our features to visual words. 13

14 Visual words: main idea Extract some local features from a number of images Slide credit: D. Nister e.g., SIFT descriptor space: each point is 128-dimensional 14

15 Visual words: main idea Slide credit: D. Nister 15

16 Visual words: main idea Slide credit: D. Nister 16

17 Visual words: main idea Slide credit: D. Nister 17

18 Slide credit: D. Nister 18

19 Slide credit: D. Nister 19

20 Visual words: main idea Map high-dimensional descriptors to tokens/words by quantizing the feature space Quantize via clustering, let cluster centers be the prototype words Descriptor space 20

space Determine which word to assign to each new image

21 Visual words: main idea Map high-dimensional descriptors to tokens/words by quantizing the feature space Determine which word to assign to each new image region by finding the closest cluster center. Descriptor space 21

22 Visual words Example: each group of patches belongs to the same visual word Figure from Sivic & Zisserman, ICCV

Visual words First explored for texture and

distribution of prototypical texture elements.

23 Visual words First explored for texture and material representations Texton = cluster center of filter responses over collection of images Describe textures and materials based on distribution of prototypical texture elements. Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003;

24 Visual words More recently used for describing scenes and objects for the sake of indexing or classification. Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others. 24

25 Inverted file index for images comprised of visual words Image credit: A. Zisserman Word number List of image numbers

26 Bags of visual words Summarize entire image based on its distribution (histogram) of word occurrences. Analogous to bag of words representation commonly used for documents. Image credit: Fei-Fei Li 26

27 Video Google System 1. Collect all words within query region 2. Inverted file index to find relevant frames 3. Compare word counts 4. Spatial verification Sivic & Zisserman, ICCV 2003 Demo online at : esearch/vgoogle/index.html Query region Retrieved frames 27

28 Basic flow Detect or sample features List of positions, scales, orientations Describe features Associated list of d-dimensional descriptors or Index each one into pool of descriptors from previously seen images Quantize to form bag of words vector for the image 28

29 Visual vocabulary formation Issues: Sampling strategy Clustering / quantization algorithm Unsupervised vs. supervised What corpus provides features (universal vocabulary?) Vocabulary size, number of words 29

Sampling strategies Sparse, at interest points

textured objects, sparse sampling from interest

Multiple complementary interest operators offer

30 Sampling strategies Sparse, at interest points Multiple interest operators Image credits: F-F. Li, E. Nowak, J. Sivic Dense, uniformly Randomly To find specific, textured objects, sparse sampling from interest points often more reliable. Multiple complementary interest operators offer more image coverage. For object categorization, dense sampling offers better coverage. [See Nowak, Jurie & Triggs, ECCV 2006] 30

31 Clustering / quantization methods k-means (typical choice), agglomerative clustering, mean-shift, Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies Vocabulary tree [Nister & Stewenius, CVPR 2006] 31

32 Example: Recognition with Vocabulary Tree Tree construction: [Nister & Stewenius, CVPR 06] 32 Slide credit: David Nister

33 Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 33

34 Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 34

35 Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 35

36 Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 36

37 Vocabulary Tree Training: Filling the tree [Nister & Stewenius, CVPR 06] Slide credit: David Nister 37

38 Vocabulary Tree Recognition RANSAC verification [Nister & Stewenius, CVPR 06] Slide credit: David Nister 38

39 Vocabulary Tree: Performance Evaluated on large databases Indexing with up to 1M images Online recognition for database of 50,000 CD covers Retrieval in ~1s Find experimentally that large vocabularies can be beneficial for recognition [Nister & Stewenius, CVPR 06] 39

40 Vocabulary formation Ensembles of trees provide additional robustness Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007; Bosch, Zisserman, & Munoz 2007; Figure credit: F. Jurie

41 Supervised vocabulary formation Recent work considers how to leverage labeled images when constructing the vocabulary Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for Generic Visual Categorization, ECCV

42 Supervised vocabulary formation Merge words that don t aid in discriminability Winn, Criminisi, & Minka, Object Categorization by Learned Universal Visual Dictionary, ICCV 2005

43 Supervised vocabulary formation Consider vocabulary and classifier construction jointly. Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR

44 Learning and recognition with bag of words histograms Bag of words representation makes it possible to describe the unordered point set with a single vector (of fixed dimension across image examples) Provides easy way to use distribution of feature types with various learning algorithms requiring vector input. 44

45 Learning and recognition with bag of words histograms including unsupervised topic models designed for documents. Hierarchical Bayesian text models (plsa and LDA) Hoffman 2001, Blei, Ng & Jordan, 2004 For object and scene categorization: Sivic et al. 2005, Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al Figure credit: Fei-Fei Li 45

Probabilistic Latent d z w Semantic Analysis (plsa) D N face Sivic et al.

46 Learning and recognition with bag of words histograms including unsupervised topic models designed for documents. Probabilistic Latent d z w Semantic Analysis (plsa) D N face Sivic et al. ICCV 2005 [plsa code available at: Figure credit: Fei-Fei Li 46

47 Bags of words: pros and cons + flexible to geometry / deformations / viewpoint + compact summary of image content + provides vector representation for sets + has yielded good recognition results in practice - basic model ignores geometry must verify afterwards, or encode via features - background and foreground mixed when bag covers whole image - interest points or sampling: no guarantee to capture object-level parts - optimal vocabulary formation remains unclear 47

48 Outline 1. Detection with Global Appearance & Sliding Windows 2. Local Invariant Features: Detection & Description 3. Specific Object Recognition with Local Features Coffee Break 4. Visual Words: Indexing, Bags of Words Categorization 5. Matching Local Feature Sets 6. Part-Based Models for Categorization 7. Current Challenges and Research Directions 48

Basic flow Detect or sample features List of positions, scales, orientations Describe features Associated list of d-dimensional descriptors or or

49 Basic flow Detect or sample features List of positions, scales, orientations Describe features Associated list of d-dimensional descriptors or or Index each one into pool of descriptors from previously seen images Quantize to form bag of words vector for the image Compute match with another image 49

Local feature correspondences The matching between

context [Belongie & Malik 2001] Low-distortion

50 Local feature correspondences The matching between sets of local features helps to establish overall similarity between objects or shapes. Assigned matches also useful for localization Shape context [Belongie & Malik 2001] Low-distortion matching [Berg & Malik 2005] Match kernel [Wallraven, Caputo & Graf 2003] 50

51 Local feature correspondences Least cost match: minimize total cost between matched points min π: X Y Least cost partial match: match all of smaller set to some portion of larger set. xi X x i π ( xi)

52 Pyramid match kernel (PMK) Optimal matching expensive relative to number of features per image (m). PMK is approximate partial match for efficient discriminative learning from sets of local features. Optimal match: O(m 3 ) Greedy match: O(m 2 log m) Pyramid match: O(m) [Grauman & Darrell, ICCV 2005] 52

53 Pyramid match kernel: pyramid extraction, Histogram pyramid: level i has bins of size 53 K. Grauman

54 Pyramid match kernel: counting matches Histogram intersection K. Grauman 54

55 Pyramid match kernel: counting new matches Histogram intersection matches at this level Difference in histogram intersections across levels counts number of new pairs matched K. Grauman matches at previous level 55

56 Pyramid match kernel histogram pyramids measure of difficulty of a match at level i number of newly matched pairs at level i For similarity, weights inversely proportional to bin size (or may be learned discriminatively) Normalize kernel values to avoid favoring large sets K. Grauman 56

57 Example pyramid match K. Grauman 57

58 Example pyramid match K. Grauman 58

59 Example pyramid match K. Grauman 59

60 Example pyramid match pyramid match optimal match K. Grauman

61 Pyramid match kernel Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Bounded error relative to optimal partial match Linear time -> efficient learning with large feature sets

62 Pyramid match kernel Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Bounded error relative to optimal partial match Linear time -> efficient learning with large feature sets Accuracy Mean number of features Time (s) Match [Wallraven et al.] Pyramid match ETH-80 data set Mean number of features O(m 2 ) O(m)

Pyramid match kernel Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Bounded error relative to optimal partial match Linear time -> efficient learning

63 Pyramid match kernel Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods Bounded error relative to optimal partial match Linear time -> efficient learning with large feature sets Use data-dependent pyramid partitions for high-d feature spaces Uniform pyramid bins Vocabulary-guided pyramid bins Code for PMK:

64 Matching smoothness & local geometry Solving for linear assignment means (non-overlapping) features can be matched independently, ignoring relative geometry. One alternative: simply expand feature vectors to include spatial information before matching. y a x a [ f 1,,f 128, x a, y a ] 64

65 Spatial pyramid match kernel First quantize descriptors into words, then do one pyramid match per word in image coordinate space. Lazebnik, Schmid & Ponce, CVPR 2006

66 Matching smoothness & local geometry Use correspondence to estimate parameterized transformation, regularize to enforce smoothness Shape context matching [Belongie, Malik, & Puzicha 2001] Code:

67 Matching smoothness & local geometry Let matching cost include term to penalize distortion between pairs of matched features. Template j i Rij Approximate for efficient solutions: Berg & Malik, CVPR 2005; Leordeanu & Hebert, ICCV 2005 i i' Query j' Si'j' i' Figure credit: Alex Berg

Matching smoothness & local geometry Compare semi-local features:

relationships Correlograms of visual words [Savarese, Winn, &

2004] Proximity distribution kernel [Ling & Soatto, ICCV 2007]

68 Matching smoothness & local geometry Compare semi-local features: consider configurations or neighborhoods and co-occurrence relationships Correlograms of visual words [Savarese, Winn, & Criminisi, CVPR 2006] Feature neighborhoods [Sivic & Zisserman, CVPR 2004] Proximity distribution kernel [Ling & Soatto, ICCV 2007] Hyperfeatures: Agarwal & Triggs, ECCV 2006] Tiled neighborhood [Quack, Ferrari, Leibe, van Gool ICCV 2007]

69 Matching smoothness & local geometry Learn or provide explicit object-specific shape model [Next in the tutorial : part-based models] x 6 x 1 x 2 x 5 x 4 x 3

70 Summary Local features are a useful, flexible representation Invariance properties - typically built into the descriptor Distinctive, especially helpful for identifying specific textured objects Breaking image into regions/parts gives tolerance to occlusions and clutter Mapping to visual words forms discrete tokens from image regions Efficient methods available for Indexing patches or regions Comparing distributions of visual words Matching features 70

Local features, distances and kernels. Plan for today

Local features, distances and kernels February 5, 2009 Kristen Grauman UT Austin Plan for today Lecture : local features and matchi Papers: Video Google [Sivic & Zisserman] Pyramid match [Grauman & Darrell]