: for Recognizing Natural Scene Categories Matching and Modeling Seminar <eliyahud@post.tau.ac.il> Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015
Outline 1 2 3 Pyramid Matching Kernels Spatial Matching Scheme 4
Outline 1 2 3 Pyramid Matching Kernels Spatial Matching Scheme 4
We want to recognize semantic category of an image Goals: Scene (forest, street, etc.) Objects of interest Efficiency Accuracy Based on [Lazebnik et al., 2006]
Outline 1 2 3 Pyramid Matching Kernels Spatial Matching Scheme 4
Set of methods for identification and classification Usually: Define image features For example, SIFT descriptors in 16 16 image blocks Orderless representation of (local) image features Usually a long vector For example, histogram with k-bins Learning algorithm for classification or detection Such as SVM
Set of methods for identification and classification Usually: Define image features For example, SIFT descriptors in 16 16 image blocks Orderless representation of (local) image features Usually a long vector For example, histogram with k-bins Learning algorithm for classification or detection Such as SVM
Set of methods for identification and classification Usually: Define image features For example, SIFT descriptors in 16 16 image blocks Orderless representation of (local) image features Usually a long vector For example, histogram with k-bins Learning algorithm for classification or detection Such as SVM
Disadvantages Existing methods usually use local features Hence, they disregard spatial layout of the feature For example: Incapability of capturing shape of objects Incapability of segmenting an object from its background Overcoming these limitations is challenging Especially with noise, such as: Clutter Occlusion Viewpoint change
Overcoming the Disadvantages Previous works suggest: Robustness at significant computational expense Efficiency with inconclusive results In this paper: Statistics of local features over fixed subregions
Off Topic: SVM Support Vector Machine Supervised learning algorithm Creates a (binary) linear classifier Finds hyperplane to separate the categories Has lot of extensions For example, multiclass SVM
Off Topic: SVM More Formally Given D = { (x i, y i ) x i R d, y i {±1} } Find w, b that maximizes (x i,y i ) D y i (w x i b) Given a new example x, its classification y satisfies: { 1 w x b > 0 y = 1 w x b 0 w x b = 0 is called the separating hyperplane
Outline Pyramid Matching Kernels Spatial Matching Scheme 1 2 3 Pyramid Matching Kernels Spatial Matching Scheme 4
Pyramid Matching Kernels Spatial Matching Scheme Kernel based recognition method Pyramid matching scheme of Grauman and Darrell Repeatedly subdivide the image For each division, compute histograms of local features
Pyramid Matching Kernels Pyramid Matching Kernels Spatial Matching Scheme Let X and Y be two sets of d-dimensional vectors Vectors in a feature space We want to find an approximate correspondence in them
Resolutions Pyramid Matching Kernels Spatial Matching Scheme Construct a sequence of grids for resolutions 0,..., L The grid of level l has 2 l cells along each dimension Total of D l = 2 dl cells Let HX l and Hl Y resolution denote the histograms of X and Y in this Let HX l (i) and Hl Y (i) be the numbers of points from X and Y in the i th cell
Matchings in Resolution Pyramid Matching Kernels Spatial Matching Scheme If x X satisfies x H l X (i), then x matches Hl X (i) The number of matches at level l is: I l = I I l also includes I l+1 ( ) HX, l HY l = D l i=1 { } min HX l (i), HY l (i) Therefore, the number of new matches of level l is I l I l+1 for 0 l < L
Pyramid Match Kernel Pyramid Matching Kernels Spatial Matching Scheme Give level l the weight 1 2 L l The pyramid match kerenl is defined as: L 1 κ L (X, Y ) = I L 1 + (I 2 L l l I l+1) l=0 = 1 2 L I0 + L l=1 2 1 L l+1 Il
Spatial Information Pyramid Matching Kernels Spatial Matching Scheme So far, we have counted the number of features in each cell But we ignored that the features might differ or be similar Now we will quantize all feature vectors into M discrete types Only features of the same type can match one another Each channel m gives us two sets of vectors, X m and Y m This approach reduces to standard bag of features when L = 0
Spatial Information Pyramid Matching Kernels Spatial Matching Scheme So far, we have counted the number of features in each cell But we ignored that the features might differ or be similar Now we will quantize all feature vectors into M discrete types Only features of the same type can match one another Each channel m gives us two sets of vectors, X m and Y m This approach reduces to standard bag of features when L = 0
The Image Kernel Pyramid Matching Kernels Spatial Matching Scheme The image kernel m K L (X, Y ) = κ L (X m, Y m ) m=1
Vector Length Pyramid Matching Kernels Spatial Matching Scheme Our images are two dimensional (d = 2) Hence each resolution divides the previous one by 2 2 = 4 For L levels and M channels, the vector length is: M L l=0 4 l = M 4L+1 1 3
Efficient Implementation Pyramid Matching Kernels Spatial Matching Scheme 1 Subdivide the image to L levels 2 For each level and each channe: 1 Count the features that fall in each spatial bin 2 Weight this spatial bin, and add its weighted value to the histogram The next slide shows an example
Efficient Implementation Pyramid Matching Kernels Spatial Matching Scheme 1 Subdivide the image to L levels 2 For each level and each channe: 1 Count the features that fall in each spatial bin 2 Weight this spatial bin, and add its weighted value to the histogram The next slide shows an example
Example Pyramid Matching Kernels Spatial Matching Scheme + + level 0 + + + + + + + level 1 + + + + + + + + level 2 + + + + + + + + + + + + + + + + + + + + + + 1/4 1/4 1/2
Outline 1 2 3 Pyramid Matching Kernels Spatial Matching Scheme 4
About the Experiments Each experiment was repeated 10 times Each with different randomly selected training and testing sets The average of per-class is recorded for each run The final result consists of mean and standard deviation Multiclass classification was done using SVM Trained using ove vs. all rule The label is assigned by the classifier with the highest response
Used Features Two kinds of features: Weak features Strong features
The Weak Features Oriented edge points Extract edfe points at two scales and eight orientations Total of M = 16 channels
The Strong Features SIFT descriptors 16 16 pixel batches, grid of 8 8 pixels Visual vocabulary was calculated by: k-means clustering of random subset of patches Taken from the training set Used with M = 200 or M = 400
Dataset consists of 15 scene categories Each category has 200 to 400 images Average image size is 300 250 pixels Originally from [Li and Perona, 2005] with 13 categories 2 categories were collected by [Lazebnik et al., 2006]
Dataset Sample
Results for Classification Results Weak features (M = 16) Strong Features (M = 200) Strong Features (M = 400) L Single-level Pyramid Single-level Pyramid Single-level Pyramid 0 (1 1) 45.3 ± 0.5 72.2 ± 0.6 74.8 ± 0.3 1 (2 2) 53.6 ± 0.3 56.2 ± 0.6 77.9 ± 0.6 79.0 ± 0.5 78.8 ± 0.4 80.1 ± 0.5 2 (4 4) 61.7 ± 0.6 64.7 ± 0.7 79.4 ± 0.3 81.1 ± 0.3 79.7 ± 0.5 81.4 ± 0.5 3 (8 8) 63.3 ± 0.8 66.8 ± 0.6 77.2 ± 0.4 80.07 ± 0.3 77.2 ± 0.5 81.1 ± 0.6
Results Analysis Results improve dramatically as we go from L = 0 to multilevel For strong features, performance drop from L = 2 to L = 3 Because the highest level of L = 3 is too finely subdivided Individual bins yielding too few matches Still small performance drop Strong features are better than weak features However, the number of strong features is almost irrelevant
Confusion Table
Confusions Confusions occurs between: Indoor classes (kitchen, bedroom, living room) Natural classes (coast, open country) This is understandable Next slide shows example of image retrieval
Image Retrieval Examples
Previous Work [Li and Perona, 2005] achieved a success rate of 65.2% with L = 0 and M = 200 Whereas [Lazebnik et al., 2006] achieved 72.2% 74.7% for the 13 categories of [Li and Perona, 2005] [Lazebnik et al., 2006] wondered why the gap is so big Apparently, [Li and Perona, 2005] used worse strong features than [Lazebnik et al., 2006] [Lazebnik et al., 2006] achieved 65.9% with the features of [Li and Perona, 2005]
Based on [Fei-Fei et al., 2004] 101 categories 31 to 800 images per category Image size about 300 300 pixels Properties: Relatively little clutter Objects are centered Objects occupy most of the image
Experiment Setup Train 30 images per class Test on the rest For efficiency, limit test to 50 images per class Same setup as in [Grauman and Darrell, 2005, Zhang et al., 2007].
Results Weak features Strong features (M = 200) L Single-level Pyramid Single-level Pyramid 0 15.5 ± 0.9 41.2 ± 1.2 1 31.4 ± 1.2 32.8 ± 1.3 55.9 ± 0.9 57.0 ± 0.8 2 47.2 ± 1.1 49.3 ± 1.4 63.6 ± 0.9 64.6 ± 0.8 3 52.2 ± 0.8 54.0 ± 1.1 60.3 ± 0.9 64.6 ± 0.7 Results for M = 400 do not improve, just as for scenes
Results Examples Easiest vs. Hardest Categories Highest performance, on L = 2, M = 200
Comparison to Previous Work For L = 0, [Grauman and Darrell, 2005] got 43% We ve got 41.2% Our best result is 64.6%, when L = 2, M = 200 Better than 53.9% by [Zhang et al., 2007]
Confusions There are lot of closely related classes Top five confusions for L = 2, M = 200: Class 1 Class 2 Class 1 Class 2 misclassified misclassified as class 2 as class 1 ketch schooner 21.6 14.8 lotus water lily 15.3 20.0 crocodile crocodile head 10.5 10.0 crayfish lobster 11.3 9.1 flamingo ibis 9.5 10.4
So Far Our method does well assuming canonical poses What about: Heavy clutter? Pose changes?
[Opelt et al., 2004] Characterized by high intra-class cariation Two object classes: Objects: Bikes (373 images) + Persons (460 images) Backgrounds (270 images) Image resolution is 640 480 pixels Range of scale and poses is very diverse For example, in the persons category: Pedestrian in the distance Side view of a complete body Closeup of a head
Examples Bikes
Examples Persons
Examples Backgrounds
Experiment Setup Same as in [Opelt et al., 2004]: Two class detection: objects vs. background Train on 100 positive and 100 negative images 50 from the other object class 50 from the backgrounds class Test on similarly distributing set
Results Strong Features, M = 200 Class L = 0 L = 2 Opelt Zhang Bikes 82.4 ± 2.0 86.3 ± 2.5 86.5 92.0 People 79.5 ± 2.3 82.3 ± 3.1 80.8 88.0 Standard deviation is quite high For example, the bikes class for L = 2 ranges from 81% to 91% Still we are close to other methods
Bibliography Bibliography I Fei-Fei, L., Fergus, R., and Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Computer Vision and Pattern Recognition Workshop, 2004. CVPRW 04. Conference on, pages 178 178. Grauman, K. and Darrell, T. (2005). The pyramid match kernel: discriminative classification with sets of image features. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1458 1465 Vol. 2.
Bibliography Bibliography II Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR 06, pages 2169 2178, Washington, DC, USA. IEEE Computer Society. Li, F.-F. and Perona, P. (2005). A bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05) -
Bibliography Bibliography III Volume 2 - Volume 02, CVPR 05, pages 524 531, Washington, DC, USA. IEEE Computer Society. Opelt, A., Fussenegger, M., Pinz, A., and Auer, P. (2004). Weak hypotheses and boosting for generic object detection and recognition. In Pajdla, T. and Matas, J., editors, Computer Vision - ECCV 2004, volume 3022 of Lecture Notes in Computer Science, pages 71 84. Springer Berlin Heidelberg. Zhang, J., Marszalek, M., Lazebnik, S., and Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. Int. J. Comput. Vision, 73(2):213 238.