BoW model. Textual data: Bag of Words model

Size: px

Start display at page:

Download "BoW model. Textual data: Bag of Words model"

Brice Wade
6 years ago
Views:

BoW model Textual data: Bag of Words model With text, categoriza9on is the task of assigning a document to one or more categories based on its content.

1 BoW model Textual data: Bag of Words model With text, categoriza9on is the task of assigning a document to one or more categories based on its content. It is appropriate for: Detec9ng and indexing similar text/documents in large corpora Clustering document by topic Extrac9ng mid/high level concepts from documents is it something about medicine/biology? is it a document about business? For the purpose of categoriza9on, a text document can be represented as an unordered collec9on of words, a Bag of Words (BoW), that is by a histogram representa9on based on a word vocabulary. Grammar and word order are not taken into account in this model. The Bag of Words model, combined with advanced classifica9on techniques, permits to assign word distribu9on to document categories. It represents state of the art of document classifica9on. 1

) apply stemming (reducing inflected/derived words to their stem, base or root form) Given a natural language textual corpus, the words frequency distribu9on follows the wellknown Zipf s law.

2 Selec9ng a good vocabulary is related to the problem of feature selec9on. It is required to: pick only those terms that are really discriminant remove stop words (the most frequent words like the, of, an, etc.) apply stemming (reducing inflected/derived words to their stem, base or root form) Given a natural language textual corpus, the words frequency distribu9on follows the wellknown Zipf s law. Zipf's law: states that given a corpus of natural language uterances, the frequency of any word is inversely propor9onal to its rank in the frequency table (i.e. the most frequent occurs twice as ouen the second most frequent, three 9mes the third.). An ideal Zipf s distribu9on must be a straight line in log scale most useful words stop words Visual words Following the same reasoning, it has been hypothesized that a similar model can be adopted for images. If an image can be treated as a document, "visual words can be iden9fied from the local features extracted from the image. Procedure of determining visual words is as follows: Extract some local features from a number of images e.g., a SIFT descriptor: each keypoint is a point in a 128 dimensional feature space Image 1 D. Nister 2

3 Image 2 D. Nister Image 3 D. Nister 3

4 Image 4 D. Nister The SIFT descriptor space D. Nister 4

Feature descriptors that have been collected are clustered to perform a quan9za9on of the feature space. Each cluster s center is used as a visual word. D.

5 Feature descriptors that have been collected are clustered to perform a quan9za9on of the feature space. Each cluster s center is used as a visual word. D. Nister Local descriptors are assigned to the nearest word using an appropriate distance. The quan9zed feature space provides a visual vocabulary and suggests a vector representa9on of images that indicates the frequency of visual words, which can be used in conjunc9on with some vector based kernels or similarity measures for matching or categoriza9on of image content. Descriptor space 5

6 Visual words example Each group of patches belongs to the same visual word Sivic & Zisserman Visual codebook forma9on Codebook forma9on and feature assignment in images is substan9ally different from what happens in text because visual words have to be defined in advance using a clustering algorithm. Essen9al tasks that influence performance are: choice of local features sampling strategies quan9za9on method number of visual words 6

orienta9on histogram for each cell, weight votes by gradient magnitude This is the basis of the popular SIFT, SURF, HOG methods.

7 Choice of local features Local features are represented by local descriptors. The common framework is: divide the local region into spa9al cells calculate orienta9on of image gradient at each pixel apply pool quan9zed orienta9ons over each cell: descriptor contains an orienta9on histogram for each cell, weight votes by gradient magnitude This is the basis of the popular SIFT, SURF, HOG methods.. standard SIFT descriptor Sampling strategies Several strategies of local feature detec9on are possible: Random sampling: calculate local features at random points in the image Sparse sampling: with no spa7al rela7onships between features: local patches are detected by interest point detectors that are able to select salient regions (such as edges, corners, blobs). with spa7al rela7ons between features: repeatedly subdividing an image into a Spa7al Pyramid and compu9ng histograms of image features over the resul9ng subregions Dense sampling: the image is segmented into subregions by some horizontal and ver9cal lines according to a regular grid. Features are extracted in each subregion 7

the more atrac9ve and informa9ve regions.

8 Sparse sampling: local patches are detected at most salient regions. It uses more informa9on about the image itself respect to random or grid sampling Local patches detected using affine covariant features Advantage: able to detect salient regions that are related to the more atrac9ve and informa9ve regions. It has been used for specific object recogni9on and categoriza9on (beter for describing background/foreground) Disadvantage: depending on the interest points and the type/resolu9on of the image, some9mes only few regions are detected Dense sampling : par99oning into equal sized rectangular regions, compu9ng the visual word feature from each region and concatena9ng the features of these regions into a single feature vector Spa9al informa9on substan9ally improves the classifica9on performance. Typically grids are evenly sampled spaced at 1x1, 2x2, 3x3, 4x4. Advantage: it is able to describe the global content of an image. Par9cularly suited for textures and natural scenes. Despite of its simplicity, it provides good results for textures and natural scenes because it is able to describe more regions respect to interest points techniques Disadvantage: it uses litle informa9on of an image itself 8

9 The choice of the sampling strategy depends on the type of images and the applica9on goals: For image classifica9on : largely random sampling sparse SIFT [Nowak 06] (some9mes DoG cannot sample densely enough to produce leading edge classifica9on results) For indoor scene classifica9on: dense SIFT > sparse SIFT [Jurie 05]. Sampling only where feature detector fires can be a poor representa9on For natural scene categoriza9on dense, random > sparse SIFT [Li 05] Color is important for classifying natural scenes: HSV SIFT beter than Gray SIFT [Bosch 08]. Sampling only where feature detector fires can be a poor representa9on For object detec9on and concept detec9on dense patches (10K) sparse feature points [Jiang 07] Quan9za9on of the feature space The most common quan9za9on approach is the use of k means clustering, mainly because of its simplicity and convergence speed. K means is an algorithm to cluster n objects, based on their feature vector representa9on, into k<n par99ons. The objec9ve it tries to achieve is to minimize global intra cluster variance, or the squared error func9on: where k is the number of clusters, Si (i=1,...,k) are cluster par99ons, and μi is the centroid (or mean point) of all the points xj Si The most common form of k means is the Lloyd s algorithm. It is an heuris9c for solving the k means problem that is popular because it converges extremely rapidly. 9

k means clustering: Lloyd s algorithm Lloyd s algorithm is an itera9ve solu9on for the k means problem: 1. par99on the n input points into k ini9al sets, either at random or using some heuris9cs 2.

recalculate the centroids for the new clusters, and repeat the process by alternate applica9on of the steps 1. and 2.

centroids are moved to the center of their clusters convergence K means clustering for codebook forma9on is not the op9mal solu9on.

10 k means clustering: Lloyd s algorithm Lloyd s algorithm is an itera9ve solu9on for the k means problem: 1. par99on the n input points into k ini9al sets, either at random or using some heuris9cs 2. calculate the centroid μi of each set Si (with i=1,...,k) 3. construct a new par99on by associa9ng each point with the closest centroid 4. recalculate the centroids for the new clusters, and repeat the process by alternate applica9on of the steps 1. and 2. un9l converges, that happens when the points no longer switch cluster or centroids are no longer changed 1,2) ini9al random centroids 3) new par99on by associa9ng points to nearest centroid 4) centroids are moved to the center of their clusters convergence K means clustering for codebook forma9on is not the op9mal solu9on. It has some main disadvantages: the number of visual words has to be known in advance the clustering is not very robust w.r.t. outliers cluster centers are atracted by the densest regions of the sample distribu9on, thus providing a more imprecise quan9za9on for the vectors laying in these regions. This is due to the assump9on of uniform distribu9on of the features in the descriptor space. k means (Voronoi tesselle9on). Detail of a dense region that has been split in 4 clusters.... Voronoi cells do not uniformly cover the feature space... Voronoi diagrams are a special decomposi9on of a metric space that is determined by distances to a set of objects in the space. Given a set of points that are Voronoi sites (f.e. the centers of clusters) each site c is associated to a a Voronoi cell that contains all points closer to c than to any other site. The segments of the Voronoi diagram are all the points that are equidistant to the two nearest sites. 10

Given a radius R, for each xi in s (grey circles in figure), a mean shiu procedure is ini9alized to locate the modes of the distribu9on of samples (to find the densest regions of the distribu9on) A

11 Radius based clustering Radius based clustering is an effec9ve alterna9ve to k means clustering. Given n vectors, the algorithm starts with an uniform random subsampling s of the original dataset (green dots in the figure). Given a radius R, for each xi in s (grey circles in figure), a mean shiu procedure is ini9alized to locate the modes of the distribu9on of samples (to find the densest regions of the distribu9on) A new cluster center is then allocated on the mode corresponding to the maximal density region All vectors on the original set n within a distance < R from the center are labeled as members of this cluster and eliminated for the following itera9ons (it prevents the algorithm from repeatedly assigning centers to the same high density region) It can be stopped when a sufficient number of clusters (words) has been iden9fied R Mean shiu is a procedure for loca9ng the maxima of a density func9on given discrete data sampled from that func9on. MeanshiU is an itera9ve method: It starts with an ini9al es9mate x and a Gaussian kernel K(x i x) = e x i x 2 that weights the nearby points based on the distance to the current es9mate. At each itera9on it uses K(x i -x) to perform the re es9ma9on of the weighted mean of the sample density in the window determined by K: m(x) = Mean shiu algorithm K(x i x)x i xi K(x i x) x i The mean shiu algorithm sets x = m(x) repeats the es9ma9on un9l m(x) converges. The vector m(x) x always points in the direc9on of maximum increase of density. 11

12 Radius based clustering mode of opera9on Region of interest Center of mass Mean ShiU vector Final goal : find the densest region Region of interest Center of mass Mean ShiU vector 12

13 Region of interest Center of mass Mean ShiU vector Region of interest Center of mass Mean ShiU vector 13

14 Region of interest Center of mass Mean ShiU vector Region of interest Center of mass Mean ShiU vector 14

15 Region of interest Center of mass Final convergence to the densest region In this way, cluster centers are allocated more uniformly. A representa9on of this effect can be obtained visualizing a Voronoi tessella9on of the feature space (it is compared to kmeans) k means clustering (Voronoi tesselle9on) radius based clustering (Voronoi tesselle9on) This red circled dense region is split into 4 clusters by k means, and is correctly coded by radius based clustering 15

15-11-2011 Bag of Visual Words model If visual words are extracted, an image can be represented as an unordered collec9on of visual words, i.e. a Bag of Visual Words.

some number of pre deﬁned (quan9zed) feature prototypes.

codebook generated in the training stages, each region extracted from the test image has to be assigned to the corresponding visual word.

16 Bag of Visual Words model If visual words are extracted, an image can be represented as an unordered collec9on of visual words, i.e. a Bag of Visual Words. The set of real valued feature vectors coming from one image is transformed into a single ﬂat histogram that counts the frequency of occurrence of some number of pre deﬁned (quan9zed) feature prototypes. image of an object category Object bag of visual words face bike violin Codeword histograms Vocabulary (codewords) Feature assignment Given the codebook generated in the training stages, each region extracted from the test image has to be assigned to the corresponding visual word. Usually region descriptors are hard assigned to the nearest word (in terms of Euclidean distance) Feature (hard) assignment Feature detec9on Feature representa9on BoW model: histogram of visual words Courtesy A. Zisserman 16

Hard assignment drawback is that it takes account only of the closest codeword, and does not consider: codeword uncertainty: i.e. the problem of selec9ng the correct codeword when two or more candidates are relevant codeword plausibility: i.

17 Hard assignment drawback is that it takes account only of the closest codeword, and does not consider: codeword uncertainty: i.e. the problem of selec9ng the correct codeword when two or more candidates are relevant codeword plausibility: i.e. the problem of selec9ng the correct codeword when all codewords are too far and not representa9ve Feature assignment the small blue dots are image features the labeled red circles are codewords the yellow triangle represents an image feature that is correctly assigned to codeword b the green square is a example of codeword uncertainty the light blue diamond is an example of codeword plausibility SoU assignment is able to consider the informa9on of two (or more) relevant candidates. In this way, the word frequency histogram is calculated by smoothing the hard assignment of features to the codeword vocabulary Hard assignment SoU assignment (Gaussian kernel) 17

Bag of Visual Words advantages Main advantages of the Bag of Visual Word approach are: Invariance to scale and orienta9on (local patches can be detected from Harris corner detector, SIFT detector,

Bag of Visual Words weaknesses Bag of Visual Words representa9on faces several challenges of genera9ng an appropriately descrip9ve representa9on: Bin boundary issues create matching problems for flat

18 Bag of Visual Words advantages Main advantages of the Bag of Visual Word approach are: Invariance to scale and orienta9on (local patches can be detected from Harris corner detector, SIFT detector, SURF detector, Affine covariant patches, MSER. ) Offline computa9on to enable (near) real 9me applica9ons (e.g. retrieval) Promising to adopt exis9ng algorithms in text domain e.g. indexing, classifica9on, mining, etc. Bag of Visual Words weaknesses Bag of Visual Words representa9on faces several challenges of genera9ng an appropriately descrip9ve representa9on: Bin boundary issues create matching problems for flat histograms (Rubner et al., 2000). It is not clear whether a universal feature quan9za9on is more or less effec9ve than data setdependent vocabularies. Genera9ng the vocabulary from large amounts of data is generally computa9onally costly... Main open issues with Bag of Visual Words paradigm are: Size of vocabulary Efficiency of genera9ng visual words Feature selec9on and reduc9on Accoun9ng for spa9al informa9on 18

19 Size of vocabulary The number of visual words needed depends on the type of images referred to. Typically: in [Lazebnik 06] for natural scene categoriza9on 1000 in [Zhang 05] for texture classifica9on and object categoriza9on 6,000 10,000 in [Sivic 03] for object retrieval (matching) ,000 in [Jiang 07] for retrieval and classifica9on General observa9ons: More is beter but saturates at certain degrees in object retrieval [Philbin 07] Satura9ng or even degrading as increasing the size of visual words for image categoriza9on [Yang 07]. Efficiency of genera9ng Visual Words Efficiency is obtained by making k means clustering more efficient. This can be obtained by: Automa9c paralleliza9on and distribu9on: solu9ons are: k means over Hadoop; Apache Lucene Mahout to build scalable Apache licensed machine learning libraries based on an open source Map/Reduce framework [Dean 04] Hierarchical K means clustering [Nistér 06]: provides efficient codebook genera9on for large visual words vocabularies(~1m). At the first level of the tree, all data points are clustered to a small number (K = 10) of cluster centers; at the next level, K means is applied within each of the par99ons independently. 19

20 Feature Selec9on and Reduc9on Feature selec9on is mandatory for efficiency and effec9veness of classifica9on. Among the supervised methods Mutual Informa9on method performs the best for making classifica9on effec9ve. It both measures the dependence between each visual word and the dependency between visual words and class label [Yang 07] Among the semi supervised methods TF IDF like methods augmented with ranking scores can be used. They favor those words that have high TF (term frequency) and low Document Frequency (high Inverse Document Frequency) and stay in the high rank [Yang 08] Accoun9ng for spa9al informa9on: Spa9al Pyramid A spa9al pyramid is a collec9on of orderless feature histograms computed over cells defined by a mul9 level recursive image decomposi9on: At level 0, the decomposi9on consists of just a single cell, and the representa9on is equivalent to a standard bag of features. At level 1, the image is subdivided into four quadrants, yielding four feature histograms; Lazebnik, Schmid & Ponce level 0 20

21 Repeatedly subdividing an image and compu9ng histograms of image features over the resul9ng subregions Lazebnik, Schmid & Ponce level 0 level 1 Salient performance in image categoriza9on for L=3 and 400 visual words [Lazebnik 06]. Normaliza9on is necessary to account for images with different numbers of local features. Lazebnik, Schmid & Ponce level 0 level 1 level 2 21

OBJECT CATEGORIZATION

OBJECT CATEGORIZATION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it Slides: Ing. Lamberto Ballan November 18th, 2009 What is an Object? Merriam-Webster Definition: Something material that may be