Beyond Bags of features Spatial information & Shape models

Beyond Bags of features Spatial information & Shape models Jana Kosecka Many slides adapted from S. Lazebnik, FeiFei Li, Rob Fergus, and Antonio Torralba

Detection, recognition (so far )! Bags of features models, codebooks made based on appearance! No spatial relationships between local features! Incorporating spatial information! Edge based representations! Distance Transform, Chamfer matching! Generalized Hough Transform! Combinations of edge based and patch based! Color based representations! Holistic gist based representations 2

Adding spatial information Forming vocabularies from pairs of nearby features doublets or bigrams Computing bags of features on subwindows of the whole image Using codebooks to vote for object position Generative partbased models

Spatial pyramid representation! Extension of a bag of features! Locally orderless representation at several levels of resolution level 0 Lazebnik, Schmid & Ponce (CVPR 2006)

Spatial pyramid representation! Extension of a bag of features! Locally orderless representation at several levels of resolution level 0 level 1 Lazebnik, Schmid & Ponce (CVPR 2006)

Spatial pyramid representation! Extension of a bag of features! Locally orderless representation at several levels of resolution level 0 level 1 level 2 Lazebnik, Schmid & Ponce (CVPR 2006)

Scene category dataset Multiclass classification results (100 training images per class)

Caltech101 dataset http://www.vision.caltech.edu/image_datasets/caltech101/caltech101.html Multiclass classification results (30 training images per class)

Distance Transform! Given:! binary image, B, of edge and local feature locations! binary edge template, T, of shape we want to match! Let D be an array in registration with B such that D(i,j) is the distance to the nearest 1 in B.! this array is called the distance transform of B (binary image)

Distance Transform! Use of distance transform for template matching! Goal: Find placement of template T in D that minimizes the sum, M, of the distance transform multiplied by the pixel values in T! if T is an exact match to B at location (i,j) then M(i,j) = 0! i.e. all nonzero pixels of T will have distance 0! if the edges in B are slightly displaced from their ideal locations in T, we still get a good match using the distance transform technique

Computing the distance transform! Brute force, exact algorithm, is to scan B and find, for each 0, its closest 1 using the Euclidean distance.! expensive in time, and difficult to implement 0 1 2 3 4 5 6 7 1 1 2 3 4 5 6 7 2 2 2 3 0 1 2 3 3 3 3 1 0 1 2 0 4 4 2 1 0 0 0 1 5 3 2 1 1 1 1 0

Computing the distance transform! Two pass sequential algorithm! Initialize: set D(i,j) = 0 where B(i,j) = 1, else set D(i,j) =! Forward pass! D(i,j) = min[ D(i1,j1) 1, D(i1,j) 1, D(i1, j1) 1, D(i,j1) 1, D(i,j)]! Backward pass! D(i,j) = min[ D(i,j1) 1, D(i1,j1) 1, D(i1, j) 1, D(i 1,j1) 1, D(i,j)] f f f f b b b b f forward pass pixels b backward pass pixels 0 1 2 3 4 5 6 7 1 1 2 3 4 5 6 7 2 2 2 3 0 1 2 3 3 3 3 1 0 1 2 0 4 4 2 1 0 0 0 1 5 3 2 1 1 1 1 0

Distance transform example 2 2 3 4 4 4 4 4 1 2 2 3 3 3 3 3 0 1 2 3 4 5 6 7 0 1 2 2 2 2 2 3 1 1 2 3 4 5 6 7 1 1 2 1 1 1 2 2 2 2 2 3 0 1 2 3 2 2 2 1 0 1 1 1 3 3 3 1 0 1 2 0 3 3 2 1 0 1 1 0 4 4 2 1 0 0 0 1 4 3 2 1 0 0 0 1 5 3 2 1 1 1 1 0 4 3 2 1 1 1 1 0 D(i,j) = min[d(i,j), D(i,j1)1, D(i1, j1)1, D(i1,j)1, D(i1,j1)1] f f f f b b b b Extensions to nonbinary images (functions) P. Felzenswalb and D. Huttenlocher Distance Transform of Sampled Functions

Chamfer matching! Chamfer matching is convolution of a binary edge template with the distance transform! for any placement of the template over the image, it sums up the distance transform values for all pixels that are 1 s (edges) in the template! if, at some position in the image, all of the edges in the template coincide with edges in the image (which are the points at which the distance transform is zero), then we have a perfect match with a match score of 0.

Example 2 2 3 4 4 4 4 4 1 2 2 3 3 3 3 3 0 1 2 2 2 2 2 3 1 1 2 1 1 1 2 2 Template 2 2 2 1 0 1 1 1 3 3 2 1 0 1 1 0 4 3 2 1 0 0 0 1 4 3 2 1 1 1 1 0 n n T( i Match score is i= 1 j= 1, 2 2 3 4 4 4 4 4 1 2 2 3 3 3 3 3 0 1 2 2 2 2 2 3 1 1 2 1 1 1 2 2 2 2 2 1 0 1 1 1 3 3 2 1 0 1 1 0 4 3 2 1 0 0 0 1 4 3 2 1 1 1 1 0 j) D( i k, j l)

Implicit shape models Combining the edge based Hough Transform style voting with appearance codebooks Visual codebook is used to index votes for object position training image annotated with object localization info visual codeword with displacement vectors B. Leibe, A. Leonardis, and B. Schiele, Combined Object Categorization and Segmentation with an Implicit Shape Model, ECCV Workshop on Statistical Learning in Computer Vision 2004

Implicit shape models Visual codebook is used to index votes for object position test image

Idea Implicit Shape Model! Faces rectangular templates detection windows! Does not generalize to more complex object with different shapes! How to combine patch based appearance based representations to incorporate notion of shape! Combined Object Categorization and Segmentation with an Implicit Shape Model. Bastian Leibe, Ales Leonardis, and Bernt Schiele. In ECCV'04. 30

Initial Recognition Approach! First Step: Generate hypotheses from local features! Training: Agglomerative Clustering How to decide when to merge two clusters Average NCC of patches NCC between two patches

Initial Recognition Approach! Codebook words spatial information is lost! For each codebook entry store all positions it was activated in relative to object center (positions parametrized by r and theta)! Parts vote for object center Lowe s DoG Detector 3σ x 3σ patches Resize to 25 x 25 Learn Spatial Distribution Find codebook patches

Pedestrian Detection in Crowded Scenes 1. Interleaved Object Categorization and Segmentation, BMVC 03 2. Combined Object Categorization and Segmentation with an Implicit Shape Model. Bastian Leibe, Ales Leonardis, and Bernt Schiele. In ECCV'04 Workshop on Statistical Learning in Computer Vision, Prague, May 2004. 33

Pedestrian Detection! Many applications! Large variation in shape, appearance! Need to combine different representations! Basic Premise: [Such a] problem is too difficult for any type of feature or model alone! Probabilistic bottom up, top down segmentation

! Open Question: How would you do pedestrian detection/segmentation? Original image Segmentation from local features! Solution: integrate as many cues as possible from many sources Support of Segmentation from local features Support of segmentation from global features (Chamfer Matching)

Theme of the Paper! Goal: Localize AND count pedestrians in a given image! Datasets Training Set: 35 people walking parallel to the image plane Testing Set (Much harder!): 209 images of 595 annotated pedestrians

Theme of the Paper

Initial Recognition Approach! First Step: Generate hypotheses from local features (Intrinsic Shape Models)! Testing:! Initial Hypothesis: Overall

Initial Recognition Approach! Second Step: Segmentation based Verification (Minimum Description Length)! Caveat: it leads to another set of problems Or four legs and three arms ISM doesn t know a person doesn t have three legs! Global Cues are needed

Assimilation of Global Cues! Distance Transform, Chamfer Matching get Feature Image by an edge detector Chamfer Distance between template and DT image get DT image by computing distance to nearest feature point

Assimilation of Global Cues (Attempt 1)! Distance Transform, Chamfer Matching Initial hypothesis generated by local features Use scale estimate to cut out surrounding region Yellow is highest Chamfer score Chamfer distance based matching Apply Canny detector and compute DT

Results

Generative Part Based Models Many slides adapted from FeiFei Li, Rob Fergus, and Antonio Torralba

Generative partbased models R. Fergus, P. Perona and A. Zisserman, Object Class Recognition by Unsupervised ScaleInvariant Learning, CVPR 2003

Probabilistic model P( image = max h object ) = P( appearance, shape object ) P( appearance h, object ) p( shape h, object ) p( h h: assignment Part of features to Part parts descriptors locations object ) Candidate parts

Probabilistic model P( image object ) = P( appearance, shape object ) = max h P( appearance h, object ) p( shape h, object ) p( h object ) h: assignment of features to parts Part 1 Part 3 Part 2

Probabilistic model P( image = max h object ) = P( appearance, shape object ) P( appearance h, object ) p( shape h, object ) p( h object ) Distribution over patch descriptors Highdimensional appearance space

Probabilistic model P( image = max h object ) = P( appearance, shape object ) P( appearance h, object ) p( shape h, object ) p( h object ) Distribution over joint part positions 2D image space

How to model location?! Explicit: Probability density functions! Implicit: Voting scheme! Invariance! Translation! Scaling! Similarity/affine! Viewpoint Translation and Scaling Affine transformation Translation Similarity transformation

Summary: Adding spatial information Doublet vocabularies! Pro: takes cooccurrences into account, some geometric invariance is preserved! Cons: too many doublet probabilities to estimate Spatial pyramids! Pro: simple extension of a bag of features, works very well! Cons: no geometric invariance, no object localization Implicit shape models! Pro: can localize object, maintain translation and possibly scale invariance! Cons: need supervised training data (known object positions and possibly segmentation masks) Generative partbased models! Pro: very nice conceptually, can be learned from unsegmented images! Cons: combinatorial hypothesis search problem

Slides from Sminchisescu

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

! Tested with! RGB! LAB! Grayscale! Gamma Normalization and Compression! Square root! Log

centered diagonal uncentered cubiccorrected Sobel Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

! Histogram of gradient orientations Orientation Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

8 orientations X= 15x7 cells Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

pedestrian Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Overfitting! A simple dataset.! Two models Linear Nonlinear

Overfitting! Let s get more data.! Simple model has better generalization.

Overfitting Loss! As complexity increases, the model overfits the data Real loss! Training loss decreases! Real loss increases Training loss! We need to penalize model complexity = to regularize Model complexity

Classification methods! K Nearest Neighbors! Decision Trees! Linear SVMs! Kernel SVMs! Boosted classifiers

K Nearest Neighbors o! Memorize all training data! Find K closest points to the query! The neighbors vote for the label: Vote()=2 Vote( )=1

KNearest Neighbors Nearest Neighbors (silhouettes) Kristen Grauman, Gregory Shakhnarovich, and Trevor Darrell, Virtual Visual Hulls: ExampleBased 3D Shape Inference from Silhouettes

KNearest Neighbors Silhouettes from other views 3D Visual hull Kristen Grauman, Gregory Shakhnarovich, and Trevor Darrell, Virtual Visual Hulls: ExampleBased 3D Shape Inference from Silhouettes

Decision tree V()=8 No X 1 >2 Yes V()=8 o V()=2 V()=8 No X 2 >1 Yes V()=4 V()=0 V()=4 V()=8 V()=2

Decision Tree Training V()=57% V()=64% V()=80% V()=80% V()=100%! Partition data into pure chunks! Find a good rule! Split the training data! Build left tree! Build right tree! Count the examples in the leaves to get the votes: V(), V()! Stop when! Purity is high! Data size is small! At fixed level

Histogram intersection 1 Assign to texture cluster Count S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories.

(Spatial) Pyramid Match S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories.