Beyond Bags of features Spatial information & Shape models

Size: px

Start display at page:

Download "Beyond Bags of features Spatial information & Shape models"

Spencer Jefferson
5 years ago
Views:

1 Beyond Bags of features Spatial information & Shape models Jana Kosecka Many slides adapted from S. Lazebnik, FeiFei Li, Rob Fergus, and Antonio Torralba

2 Detection, recognition (so far )! Bags of features models, codebooks made based on appearance! No spatial relationships between local features! Incorporating spatial information! Edge based representations! Distance Transform, Chamfer matching! Generalized Hough Transform! Combinations of edge based and patch based! Color based representations! Holistic gist based representations 2

3 Adding spatial information Forming vocabularies from pairs of nearby features doublets or bigrams Computing bags of features on subwindows of the whole image Using codebooks to vote for object position Generative partbased models

4 Spatial pyramid representation! Extension of a bag of features! Locally orderless representation at several levels of resolution level 0 Lazebnik, Schmid & Ponce (CVPR 2006)

5 Spatial pyramid representation! Extension of a bag of features! Locally orderless representation at several levels of resolution level 0 level 1 Lazebnik, Schmid & Ponce (CVPR 2006)

6 Spatial pyramid representation! Extension of a bag of features! Locally orderless representation at several levels of resolution level 0 level 1 level 2 Lazebnik, Schmid & Ponce (CVPR 2006)

7 Scene category dataset Multiclass classification results (100 training images per class)

8 Caltech101 dataset Multiclass classification results (30 training images per class)

9 Distance Transform! Given:! binary image, B, of edge and local feature locations! binary edge template, T, of shape we want to match! Let D be an array in registration with B such that D(i,j) is the distance to the nearest 1 in B.! this array is called the distance transform of B (binary image)

10 Distance Transform! Use of distance transform for template matching! Goal: Find placement of template T in D that minimizes the sum, M, of the distance transform multiplied by the pixel values in T! if T is an exact match to B at location (i,j) then M(i,j) = 0! i.e. all nonzero pixels of T will have distance 0! if the edges in B are slightly displaced from their ideal locations in T, we still get a good match using the distance transform technique

11 Computing the distance transform! Brute force, exact algorithm, is to scan B and find, for each 0, its closest 1 using the Euclidean distance.! expensive in time, and difficult to implement

12 Computing the distance transform! Two pass sequential algorithm! Initialize: set D(i,j) = 0 where B(i,j) = 1, else set D(i,j) =! Forward pass! D(i,j) = min[ D(i1,j1) 1, D(i1,j) 1, D(i1, j1) 1, D(i,j1) 1, D(i,j)]! Backward pass! D(i,j) = min[ D(i,j1) 1, D(i1,j1) 1, D(i1, j) 1, D(i 1,j1) 1, D(i,j)] f f f f b b b b f forward pass pixels b backward pass pixels

13 Computing the distance transform! Two pass sequential algorithm! Initialize: set D(i,j) = 0 where B(i,j) = 1, else set D(i,j) =! Forward pass! D(i,j) = min[ D(i1,j1) 1, D(i1,j) 1, D(i1, j1) 1, D(i,j1) 1, D(i,j)]! Backward pass! D(i,j) = min[ D(i,j1) 1, D(i1,j1) 1, D(i1, j) 1, D(i 1,j1) 1, D(i,j)]

14 Distance transform example D(i,j) = min[d(i,j), D(i,j1)1, D(i1, j1)1, D(i1,j)1, D(i1,j1)1] f f f f b b b b Extensions to nonbinary images (functions) P. Felzenswalb and D. Huttenlocher Distance Transform of Sampled Functions

15 Chamfer matching! Chamfer matching is convolution of a binary edge template with the distance transform! for any placement of the template over the image, it sums up the distance transform values for all pixels that are 1 s (edges) in the template! if, at some position in the image, all of the edges in the template coincide with edges in the image (which are the points at which the distance transform is zero), then we have a perfect match with a match score of 0.

16 Example Template n n T( i Match score is i= 1 j= 1, j) D( i k, j l)

Implicit shape models Combining the edge based Hough Transform style voting with appearance codebooks Visual codebook is used to index votes for object position training image annotated with object

17 Implicit shape models Combining the edge based Hough Transform style voting with appearance codebooks Visual codebook is used to index votes for object position training image annotated with object localization info visual codeword with displacement vectors B. Leibe, A. Leonardis, and B. Schiele, Combined Object Categorization and Segmentation with an Implicit Shape Model, ECCV Workshop on Statistical Learning in Computer Vision 2004

18 Implicit shape models Visual codebook is used to index votes for object position test image

19 Idea Implicit Shape Model! Faces rectangular templates detection windows! Does not generalize to more complex object with different shapes! How to combine patch based appearance based representations to incorporate notion of shape! Combined Object Categorization and Segmentation with an Implicit Shape Model. Bastian Leibe, Ales Leonardis, and Bernt Schiele. In ECCV'04. 30

20 Initial Recognition Approach! First Step: Generate hypotheses from local features! Training: Agglomerative Clustering How to decide when to merge two clusters Average NCC of patches NCC between two patches

center (positions parametrized by r and theta)!

21 Initial Recognition Approach! Codebook words spatial information is lost! For each codebook entry store all positions it was activated in relative to object center (positions parametrized by r and theta)! Parts vote for object center Lowe s DoG Detector 3σ x 3σ patches Resize to 25 x 25 Learn Spatial Distribution Find codebook patches

Pedestrian Detection in Crowded Scenes 1. Interleaved Object Categorization and Segmentation, BMVC 03 2.

22 Pedestrian Detection in Crowded Scenes 1. Interleaved Object Categorization and Segmentation, BMVC Combined Object Categorization and Segmentation with an Implicit Shape Model. Bastian Leibe, Ales Leonardis, and Bernt Schiele. In ECCV'04 Workshop on Statistical Learning in Computer Vision, Prague, May

Pedestrian Detection! Many applications!

difficult for any type of feature or model

23 Pedestrian Detection! Many applications! Large variation in shape, appearance! Need to combine different representations! Basic Premise: [Such a] problem is too difficult for any type of feature or model alone! Probabilistic bottom up, top down segmentation

24 ! Open Question: How would you do pedestrian detection/segmentation? Original image Segmentation from local features! Solution: integrate as many cues as possible from many sources Support of Segmentation from local features Support of segmentation from global features (Chamfer Matching)

25 Theme of the Paper! Goal: Localize AND count pedestrians in a given image! Datasets Training Set: 35 people walking parallel to the image plane Testing Set (Much harder!): 209 images of 595 annotated pedestrians

26 Theme of the Paper

27 Initial Recognition Approach! First Step: Generate hypotheses from local features (Intrinsic Shape Models)! Testing:! Initial Hypothesis: Overall

28 Initial Recognition Approach! First Step: Generate hypotheses from local features (Intrinsic Shape Models)! Testing:! Initial Hypothesis: Overall

29 Initial Recognition Approach! Second Step: Segmentation based Verification (Minimum Description Length)! Caveat: it leads to another set of problems Or four legs and three arms ISM doesn t know a person doesn t have three legs! Global Cues are needed

30 Assimilation of Global Cues! Distance Transform, Chamfer Matching get Feature Image by an edge detector Chamfer Distance between template and DT image get DT image by computing distance to nearest feature point

31 Assimilation of Global Cues (Attempt 1)! Distance Transform, Chamfer Matching Initial hypothesis generated by local features Use scale estimate to cut out surrounding region Yellow is highest Chamfer score Chamfer distance based matching Apply Canny detector and compute DT

32 Results

33 Generative Part Based Models Many slides adapted from FeiFei Li, Rob Fergus, and Antonio Torralba

34 Generative partbased models R. Fergus, P. Perona and A. Zisserman, Object Class Recognition by Unsupervised ScaleInvariant Learning, CVPR 2003

35 Probabilistic model P( image = max h object ) = P( appearance, shape object ) P( appearance h, object ) p( shape h, object ) p( h h: assignment Part of features to Part parts descriptors locations object ) Candidate parts

36 Probabilistic model P( image object ) = P( appearance, shape object ) = max h P( appearance h, object ) p( shape h, object ) p( h object ) h: assignment of features to parts Part 1 Part 3 Part 2

37 Probabilistic model P( image object ) = P( appearance, shape object ) = max h P( appearance h, object ) p( shape h, object ) p( h object ) h: assignment of features to parts Part 1 Part 3 Part 2

38 Probabilistic model P( image = max h object ) = P( appearance, shape object ) P( appearance h, object ) p( shape h, object ) p( h object ) Distribution over patch descriptors Highdimensional appearance space

39 Probabilistic model P( image = max h object ) = P( appearance, shape object ) P( appearance h, object ) p( shape h, object ) p( h object ) Distribution over joint part positions 2D image space

40 How to model location?! Explicit: Probability density functions! Implicit: Voting scheme! Invariance! Translation! Scaling! Similarity/affine! Viewpoint Translation and Scaling Affine transformation Translation Similarity transformation

41 Summary: Adding spatial information Doublet vocabularies! Pro: takes cooccurrences into account, some geometric invariance is preserved! Cons: too many doublet probabilities to estimate Spatial pyramids! Pro: simple extension of a bag of features, works very well! Cons: no geometric invariance, no object localization Implicit shape models! Pro: can localize object, maintain translation and possibly scale invariance! Cons: need supervised training data (known object positions and possibly segmentation masks) Generative partbased models! Pro: very nice conceptually, can be learned from unsegmented images! Cons: combinatorial hypothesis search problem

42 Slides from Sminchisescu

43 Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

44 ! Tested with! RGB! LAB! Grayscale! Gamma Normalization and Compression! Square root! Log

45 centered diagonal uncentered cubiccorrected Sobel Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

46 ! Histogram of gradient orientations Orientation Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

50 8 orientations X= 15x7 cells Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

52 pedestrian Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

53 Overfitting! A simple dataset.! Two models Linear Nonlinear

54 Overfitting! Let s get more data.! Simple model has better generalization.

55 Overfitting Loss! As complexity increases, the model overfits the data Real loss! Training loss decreases! Real loss increases Training loss! We need to penalize model complexity = to regularize Model complexity

56 Classification methods! K Nearest Neighbors! Decision Trees! Linear SVMs! Kernel SVMs! Boosted classifiers

57 K Nearest Neighbors o! Memorize all training data! Find K closest points to the query! The neighbors vote for the label: Vote()=2 Vote( )=1

58 KNearest Neighbors Nearest Neighbors (silhouettes) Kristen Grauman, Gregory Shakhnarovich, and Trevor Darrell, Virtual Visual Hulls: ExampleBased 3D Shape Inference from Silhouettes

59 KNearest Neighbors Silhouettes from other views 3D Visual hull Kristen Grauman, Gregory Shakhnarovich, and Trevor Darrell, Virtual Visual Hulls: ExampleBased 3D Shape Inference from Silhouettes

60 Decision tree V()=8 No X 1 >2 Yes V()=8 o V()=2 V()=8 No X 2 >1 Yes V()=4 V()=0 V()=4 V()=8 V()=2

61 Decision Tree Training V()=57% V()=64% V()=80% V()=80% V()=100%! Partition data into pure chunks! Find a good rule! Split the training data! Build left tree! Build right tree! Count the examples in the leaves to get the votes: V(), V()! Stop when! Purity is high! Data size is small! At fixed level

62 Histogram intersection 1 Assign to texture cluster Count S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories.

63 (Spatial) Pyramid Match S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories.

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Adding spatial information Forming vocabularies from pairs of nearby features doublets