Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

Size: px

Start display at page:

Download "Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61"

Gwenda Pitts
5 years ago
Views:

1 Object Recognition in Images and Video: Advanced Bag-of-Words Prof. Andrew D. Bagdanov Dipartimento di Ingegneria dell Informazione Università degli Studi di Firenze andrew.bagdanov AT unifi.it 27 April 2017 Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

2 Outline 1 Comments 2 Overview 3 The Bag-of-Words Model 4 Interlude 5 Advanced BOW: Spatial Pyramids 6 Advanced BOW: Sparse Coding 7 Advanced BOW: Fisher Vectors 8 Detection: Deformable Part Models 9 Discussion Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

3 Comments Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

4 Comments: Final Exam For the exam (your presentations), we can be flexible. We re all busy, and I recognize that. For those in Florence: we can arrange to do your presentations at any time (preferably by mid-june). For those outside of Florence: we can also do the final exam by Skype, if that s easier. I strongly desire to finish all exam presentations by mid-june. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

5 Overview Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

6 Overview In this lesson I will first pickup where I left off with an explanation of the basic Bag-of-Words (BOW) model. Then, I will explain three extensions of this basic model (which is why the lecture is called Advanced Bag-of-Words). And then I will cover an article on detection (which is only loosely connected with the Bag-of-Words). Since you have all read the articles, I will cover the details briefly, then I will open the floor for discussion and questions. Together, we will work to reach a deeper understanding of each contribution. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

7 The Bag-of-Words Model Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

8 Three Magic Ingredients Now we will shift our discussion to one of the first Big Breakthroughs in modern object recognition. Visual Categorization with Bags of Keypoints, Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cédric Bray. In: European Conference on Computer Vision (ECCV), These ideas were developed independently, in many places, at the same time. This paper is one of the first, and in my opinion the simplest explanation of the basic Bag-of-Words pipeline. Again returning to our analogy with text retrieval, we now have a reasonably invariant way to describe local image structure. However, we still don t have a concept corresponding to words. SIFT features are 128-dimensional vectors, which are not discrete enough to use in a TF*IDF model. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

Feature Quantization Key idea: use clustering to identify groups of SIFT points using a training set. The centers are used as a visual vocabulary words in our model.

9 Feature Quantization Key idea: use clustering to identify groups of SIFT points using a training set. The centers are used as a visual vocabulary words in our model. All SIFT descriptors extracted from training or test images are quantized to the closest visual word in our vocabulary. We have gone from an infinite class of SIFT descriptors, to a finite class of visual words. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

10 Feature Pooling One last problem: the number of SIFT descriptor is variable: each image will yield a different number of points. Also, the order of points (for comparison, for example) is crucial. This problem makes it hard to apply standard, machine learning techniques to our representation (e.g. SVM, naive-bayes, nearest neighbor, etc). The solution: like in text retrieval, use pooling to build a fixed-length descriptor of images that is invariant to descriptor order. Our descriptor is a histogram of frequencies of visual word occurrences in the image. To compare images we can now use: inner products (like TF*IDF), SVMs, and a vast array of tried and true classifiers. This last point is most important: given a training set of images labeled with object categories, we can train classifiers to recognize objects in unseen test images. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

11 The Bag-of-Words Model This full pipeline is best explained graphically Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

12 The Bag-of-Words Model Csurka et al. demonstrated the BOW approach on a dataset with 7 object categories. They extract BOW descriptors from training images and train a multiclass, one-versus-all, linear SVM for each. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

13 The Bag-of-Words The punchline: the results on this challenging dataset are impressive. The approach uses a small vocabulary of 1000 visual words (in text retrieval, 100K+ word dictionaries are common). It also uses an extremely simple linear SVM for classification. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

14 The Bag-of-Words Added bonus: visual words are semantically meaningful (note, this example from Csurka et al. is highly cherry-picked): Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

15 The Bag-of-Words Another bonus: the one-versus-all SVM architecture can recognize multiple object categories in images. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

16 BOW: Discussion Discussion Like the SIFT descriptor, it is hard to overstate the impact and influence the Bag-of-Words model has had on the development of modern object recognition. It is a hallmark result, despite its extreme simplicity (in hindsight). The paper of Csurka et al. was the first to demonstrate the plausibility of efficient, accurate, and robust object recognition over many categories with extreme visual variance. Clearly, this simple BOW model was only the beginning. The next ten years of computer vision was dominated by incremental improvements and refinements of this model. In the next lecture we will head of in that direction with a survey of advanced Bag-of-Words models that came after. Note: see the course website for the required reading for next week. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

17 Interlude Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

18 Interlude: Ten Years of Progress The Bag-of-Words was quickly adopted by the community as řthe* method for object recognition. There rapidly followed a series of many, many improvements over the basic BOW model. In the following, we will look at some important examples. With the adoption of the BOW model, and the growing interest in object recognition, there were also established several standard benchmark datasets. Many of these datasets were developed in the context of international competitions: PASCAL VOC: five years of competitions, many high-quality benchmark datasets. ImageNet: first version with 1000 object categories, over 1M images. Current version has 10M+ images. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

19 Advanced BOW: Spatial Pyramids Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

20 BOW is an Orderless Image Representation The next paper we will examine is: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, S Lazebnik, C Schmid, J Ponce. In: Computer Vision and Pattern Recognition (CVPR), The motivation behind this work is that the Bag-of-Words is an orderless image representation. We can think of the BOW histogramming process as marginalizing spatial information away. Assume we have encoded an image in quantized visual words that is, each spatial location is represented by a single integer representing the cluster center the SIFT descriptor is closest to. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

21 BOW is an Orderless Image Representation Assume we have K = 1000 visual words in our vocabulary, and let I q be the quantized image (i.e. each location is represented by an integer index). Let δ i be the (curried) Kronecker delta function: δ i (j) = { 1 ifi = j 0 otherwise We can express the image as a one-hot discrete of field of vectors, and the histogram as a sum over the field: I 1-hot (x, y) = [δ i (I(x, y))] 1000 i=1 H(I 1-hot ) = I 1-hot (x, y) x y Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

22 BOW is an Orderless Image Representation Feature: we have a fixed-length representation of images. Feature: this representation has strong invariance. Bug: we lose all spatial coherence in this representation it s too invariant). Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

23 Spatial Pyramids: Impose Some Order The main idea: revisit global non-invariant representations based on aggregating statistics of local features; and use kernel-based recognition that computes rough geometric correspondence on a global scale. Once you sweep away the suppercazzola in the paper, the method is quite simple: repeatedly subdivide the image; and compute histograms of local features at increasingly fine resolutions. The spatial pyramid technique is simple and extremely effective. After its publication, it became ubiquitous in nearly all BOW pipelines. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

24 Spatial Pyramids: A Technical Aside The Support Vector Machine (SVM) is the standard classifier BOW. The linear SVM objective function is to find w that minimizes: [ ] 1 n max (0, 1 y i(w x i + b)) + λ w 2. n i=1 This can be rewritten as a constrained optimization problem: minimize 1 n ζ i + λ w 2 n i=1 subject to y i(w x i + b) 1 ζ i and ζ i 0, for all i. And the dual formulation: maximize f (c 1... c n) = subject to n i=1 n c i 1 2 i=1 n i=1 n y ic i(x i x j)y jc j, j=1 c iy i = 0, and 0 c i 1 for all i. 2nλ Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

25 Spatial Pyramids: A Technical Aside The c i in the dual are formulated so that we can write the classifier vector as: n w = c i y i x i. i=1 Kernel trick: embed our feature vectors x i in a Hilbert Space φ(x i ). n maximize f (c 1... c n ) = c i 1 n n y i c i k( x i, x j )y j c j 2 subject to n i=1 Where k(x i, y i ) = φ(x i ) φ(y i ). i=1 i=1 j=1 c i y i = 0, and 0 c i 1 for all i. 2nλ So, we never have to actually embed out features, we just need to compute the kernel matrix of all pairs of inner products. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

26 Spatial Pyramids: A Technical Aside Some popular kernels Linear: k(x, y) = x y Gaussian RBF: k(x, y) = exp ( x x 2 2σ 2 ) χ 2 : Exponential χ 2 : k(x, y) = 1 2 d i=1 (x i y i ) 2 x i + y i k(x, y) = exp( β(1χ 2 (x, y))) Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

27 Spatial Pyramids: Structured Matching The inspiration for Spatial Pyramids comes from a technique called pyramid matching for measuring similarity between n-dimensional point sets X and Y. It constructs a sequence of grids at resolutions 0,..., L such that the grid at level l has 2 l cells along each of the d dimensions. Thus, there is a total of D = 2 dl cells at each level. Finally, let HX l and Hl Y denote the histograms of X and Y at level l. So: HX l and Hl Y are the number of from X and Y that fall into the ith cell of the grid. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

28 Spatial Pyramids: Structured Matching The main tool in defining Spatial Pyramids is the Histogram Intersection Kernel: D I(HX l, HY l ) = min(hx l (i), HY l (i)) i=1 What we re doing here is simply counting the number of points that hit the same cells in the dyadic spatial decomposition. If we notice that the matches found at level l also include all matches at finer levels, we can write the pyramid match kernel: L 1 κ L (X, Y ) = I L L l (Il I l 1 ) = 1 2 L I0 + l=0 L l=1 2 1 L l+1 Il Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

29 Spatial Pyramids: Structured Matching To extend this idea to the BOW model, we can quantize all salient locations in the image (SIFTs descriptors). Let X m and Y m represent the spatial locations of all visual words of type m in images X and Y, respectively. Then, we can write the Spatial Pyramid Match Kernel as: M K L (X, Y ) = κ L (X m, Y m ) m=1 Note that κ L is just a weighted sum of histogram intersections. Note also that, for positive numbers, c min(a, b) = min(ca, cb). Thus, we can write the above as a single histogram intersection of concatenations of appropriately weighted channel histograms at all levels. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

30 Spatial Pyramids: What s Going On This is the diagram from the paper: But this is how people usually think of the Spatial Pyramid: Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

31 Spatial Pyramids: Experiments New Trend: More Data is Better. Scenes-15: Fifteen categories with strong inter-class variability, intra-class similarity images per category. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

32 Spatial Pyramids: Experiments New Trend: More Data is Better. Caltech-101: 101 categories with strong inter-class variability, intra-class similarity images per category. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

33 Spatial Pyramids: Experiments Results are impressive and interesting on Scenes-15: And on Caltech-101: Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

34 Spatial Pyramids: Reflections Despite the simplicity of the method, it consistently achieves an improvement over an orderless image representation. This also despite the fact that it works not by constructing explicit object models, but by using global cues as indirect evidence about the presence of an object. This is not a trivial accomplishment, given that a well-designed bag-of-features method can outperform more sophisticated approaches based on parts and relations. As I mentioned before, the Spatial Pyramid technique became a standard trick to significantly and consistently improve results for BOW models. In the next two papers, we will look at more sophisticated coding techniques. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

35 Advanced BOW: Sparse Coding Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

36 Locality-constrained Linear Coding The original BOW model uses global pooling of descriptors, hence a single, global image representation. Spatial Pyramids add some spatial structure to the image representation. Can we improve the way features themselves are coded before pooling into the final image represenation? We will look at one approach to better feature coding in our next paper: Locality-constrained linear coding for image classification, J Wang J Yang, K Yu, F Lv, T Huang, Y Gong. In: Computer Vision and Pattern Recognition (CVPR), Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

37 Locality-constrained Linear Coding Some ovservations: BOW + SPM works really well. But: requires non-linear SVMs to achieve state-of-the-art performance. As datasets grow larder, computing and storing the kernel matrix for solving the dual SVM formulation is onerous. We hope: for a better feature encoding that allows us to achieve state-of-the-art results, but with linear SVMs. The key insight is to use the codebook (visual vocabulary) more evvectively. This is done through sparse coding to encode local features. Followed by max pooling (as opposed to average pooling) to arrive at the global image description. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

38 LLC: Basic Definitions Let X = [x 1,..., x N ] R D N Given a set of codewords B = [b 1,..., b M ] R D M, We want to encode each x i into an M-dimensional code. Vector Quantization is used in the BOW (resulting in a set of 1-hot codes C = [c 1,..., c M ] R D N : Sparse coding can be used instead (reference 22): This leads to lower quantization loss by using more elements of the codebook to encode local features. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

39 LLC: Paying a Price LLC proposes to use an additional locality (in feature space) constraint: Where locality is smoothly modeled with a an exponential: So: code features, but pay a cost for using codewords far from the descriptor we are encoding. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

40 LLC: What s Going On Here is a comparison: Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

41 LLC: What s Going On The full pipeline: Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

42 LLC: Implementation In practice, solving the constrained optimization problem for every descriptor is too costly. Solution: select the k nearest codewords in feature space, and solve a constrained least-squares problem using only k codewords. Codebook optimization: maybe k-means doesn t yield an optimal codebook for LLC codeing: Section 3 gives an iterative algorithm for building a codebook. In practice, everyone uses k-means. Classification: the LLC embedding is rich enough to allow it to perform well with linear SVMs as classifiers. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

43 LLC: Results Extract dense HOG descriptors (8-pixel stride), at three scales. Use k = 5 for approximate LLC encoding. They also use a Spatial Pyramid (but don t explain the configuration). The authors consider two pooling methods to arrive at the final image representation: sum-pooling: just sum all codes (this is the BOW pooling). max-pooling: take the maximum coefficient for each codeword. Then report results with: max-pooling with L2 normalization (since they use linear SVMs). Use a linear SVM to train one-versus-all classifiers for each category. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

44 LLC: Results On Caltech-101: And on Caltech-256: Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

LLC: Results A new dataset (which became the benchmark for object recognition). The PASCAL Visual Object Categorization (VOC) competitions defined the state-of-the-art for five years.

45 LLC: Results A new dataset (which became the benchmark for object recognition). The PASCAL Visual Object Categorization (VOC) competitions defined the state-of-the-art for five years. 20 object categories, high-quality annotations, recognition, segmentation, detections, etc. Introduced use of average precision to evaluate object recognition. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

46 LLC: Results Results on PASCAL 2007: Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

47 LLC: Reflections The LLC encoding technique takes a different approach to enriching the image representation. It uses sparse codes, but this codes are sparse in that only local codewords can contribute to the encoding of features. Global pooling is done using a max operation, which helps ensure global quasi-sparsity. The resulting codes can be used with linear SVMs, which is a huge win for large datasets. Beats other BOW/SPM approaches, and achieves results comparable to more complex methods at the state-of-the-art. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

48 Advanced BOW: Fisher Vectors Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

49 FV: Clusters are Not Points The main observation in Fisher Vector coding is that the quantization process is imprecise. More precisely: clusters are distributions of points. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

50 FV: Start with a Generative Model Let X = { x t, t = 1... T } be the descriptors extracted from an image. Assume there is a generation process for X modeled by a probability density function u λ with parameters λ. X can be characterized by the gradient: G X λ = 1 T λ log u λ (X) Why the gradient? Because it describes the contribution of each parameter to the generation process (also, the gradient of the data log likelihood is the Fisher Score). Plus, the dimensionality depends only on the number of parameters in λ and not on the number of patches in the image. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

51 FV: Then define a kernel A natural kernel to use for gradients of generative model likelihoods is the Fisher kernel: K(X, Y ) = G X λ Fλ 1 GY λ Where F 1 λ is the Fisher Information Matrix: F λ = E x uλ [ λ log u λ (x) λ log u λ (x) ] = L λl λ (by Cholesky decomposition of symmetric and p.d. L) So, we can rewrite K(X, Y ) as dot-products between normalized vectors: G X λ = L λ G X λ Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

52 FV: The Fisher Vector This G X λ is the Fisher Vector of image X. Big win: learning a kernel classifier with the non-linear kernel K is equivalent to learning a linear classifier on Fisher Vectors. Implementation: Assume the generative model is a mixture of Gaussians. And assume that all x i are generated independently: Gλ X = 1 T λ log u λ (x t) T t=1 Compute gradients with respect to mean and diagonal covariance of all mixture components. Final descriptor is the concatenation of: Gµ,i X 1 T ( ) = T xt µ t γ t(i) w i σ i t=1 Gσ,i X 1 T ( ) = T (xt µ t) 2 γ t(i) 1 w i t=1 σ 2 i Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

53 FV: What s Going On The Fisher Vector is a weighted average of gradients. These gradients are defined at every point in descriptor space. First compute the FV for each individual descriptor. Then average pool the vectors to computer the FV encoding for the image. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

FV: Improvements There are really only two improvements to the Fisher Vector proposed in this paper. Improvement 1: L2 normalize the Fisher Vectors before training an SVM (not surprising).

54 FV: Improvements There are really only two improvements to the Fisher Vector proposed in this paper. Improvement 1: L2 normalize the Fisher Vectors before training an SVM (not surprising). Power normalization: Also known as "signed square-root" or the Hellinger kernel, it compensates for bursty features: Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

55 FV: Baseline Comparison It has become standard practice to do a baseline comparison. In these experiments, you want to evaluate the contribution of your work. Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

56 FV: Us versus Them The proof is in the pudding. PASCAL 2007: Caltech 256: Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

57 FV: Reflections The Fisher Vector is an alternative coding method. Instead of quantizing descriptors, you represent each descriptor with a gradient. This gradient represents the relationship between a local descriptor and all clusters in a generative model. This encoding can significantly improve performance over the BOW framework. Bonus: everything is linear, and we can use efficient solvers even for large-scale datasets. The Fisher Vector was the state-of-the-art in Bag-of-Features coding until Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

58 Detection: Deformable Part Models Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

59 Deformable Part Models [OTHER PRESENTATIONS] Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

60 Discussion Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

61 Discussion Discuss Object Recognition in Images and Video: Advanced Bag-of-Words 27 April / 61

Aggregating Descriptors with Local Gaussian Metrics

Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,