LSA-like models Bill Freeman June 21, 2004

Size: px

Start display at page:

Download "LSA-like models Bill Freeman June 21, 2004"

Marjorie Grant
6 years ago
Views:

1 LSA-like models Bill Freeman June 21, Introduction LSA is simple and stupid, but works pretty well for analyzing the meanings of words and text, given a large, unlabelled training set. Why? LSA factorizes a histogram observation matrix, showing frequency of occurrance counts, with words along columns and documents in rows, then reduces the dimensionality of the coordinates describing each word. The coordinates of words with similar meanings tend to be similar. Let s build up a principled model, inspired by the factorization approach of LSA, but better motivated. The result will be an iterative scheme to learn object parts and objects by observing feature responses over a training set of many images. 2 Histogram Matrix Factorization We examine a simple model, then embellish it. For this section, we ll call a document the contents of a subsection of an image (say a circle of diameter D). We assume that we have first created a finite vocabulary of image feature indices. These could be, for example, a set of 1000 or so vector-quantized SIFT feature responses such as Bryan has used in his work. An observation consists of counting how many of each feature occur in a given document. (Again, we use a bag-of-words representation, although later we ll work our way away from that). Thus, the measurement from a document consists of a column vector showing the number of times each feature was found in the document (inside that circle of the image). We create an observation matrix from a corpus by stacking next to each other all the column vectors of observations from each document. The resulting observation matrix has number-of-features rows by number-of-documents columns. We want to use that training corpus to infer meaning from observed variables. We will to re-write the corpus matrix in terms of hidden variables we ll call objects, which we will learn. The nice thing about the bag-of-words model for documents is that things combine additively. We can then write the observations, a histogram matrix Y, as a product of two other histogram matrices: a matrix, F, whose columns tell what objects are contained in each training image (or document, or circular region within an image), times a matrix, G, telling how many of each feature are present for each object. Figure 1 shows the dimensions of Y = F G. This factorization is reminiscent of SVD, except our product matrices are not the real, orthonormal matrices of an SVD. Nor is it the non-negative matrix factorization of Seung et al, where the multiplicand matrices are constrained to be positive. Here we have a related, but different, set of constraints on the multiplicand matrices: they must be histograms, that is, matrices composed of non-negative integers. We can call this factorization process histogram matrix factorization, or HMF. The hope is that this rather severe set of constraints provides enough structure to solve the equation Y = F G in a way that reveals meaningful objects in the real world, given a large corpus of training images. 3 Learning and inference with histogram matrices We want to look at the vector quantized SIFT features of many, many images, and from the feature counts per image (or counts per image circular region), infer useful things about the objects comprising the images. In particular, given the observation of a new image, by examining the features present, we want to group the features into sets of previously learned objects.

2 Figure 1: Histogram matrix factorization. A histogram matrix, composed of non-negative integers, is decomposed into the product of two other histogram matrices. 2

3 In practice, because of occlusions and noise, the product relationship Y = F G won t hold exactly, so we want to take a probabilistic, rather than numerical approach. There is a penalty for each undetected feature, and a second penalty incurred for each additional feature not predicted by F G. A natural way to impose these penalties is with a factor graph, allowing histogram matrix factorization by running loopy belief propagation, reminiscent of decoding turbo codes or low-density parity check codes because it s a network with large loops and low state dimension at each node. Figure 2 shows the factor graph relating an object to the observed features. Figure 2: Bayesian graphical model relating an object variable to observed feature values. Mathematically, here is a plausible probability that an observed set of feature vector counts, y, was created by a given set of object counts, the column vector g: P ( g y) = ψ(y j F jl g l z jl ) P o (z jl ), (1) featuresj objects l jl where z jl is zero if feature j from object l is occluded, and 1 otherwise. P o (0) is the prior probability that any given feature will be occluded. (Later, when we introduce a hieararchy of objects (by finding objects within smaller sized circles), we can incur an occlusion cost for a single small object or set of features, which would save over incurring the occlusion cost for each one of the features.) g l = 1 if object l is present in the image. F jl = 1 if object l has feature j. ψ(.) is a function that tells how probable is a given deviation from the observed histogram count. 3.1 Performing the histogram matrix factorization The learning problem: we have a histogram matrix (Y ) and we want to break it up into the product of two other histogram matrices. Note that the product of the two found histogram matrices need not give the original matrix exactly, because we will allow for occlusion and other forms of observation noise. The inference problem: Given one of the multiplicand matrices (the features for each object) and a single column of the observation matrix (the features observed for a given image), find the vector of histogram counts for objects which best explains the observed features (taking occlusion and observation noise into account). This problem seems very well matched to Bayesian belief propagation. A number of possible approaches come to mind for the learning problem (histogram matrix factorization). 3

4 1. Perform an initial factorization using SVD. Enforce the constraints (positive, integer values) on one of the two matrices, and solve for the other matrix. Update that matrix (average between new and old values), then switch the roles of the two matrices, and repeat. This algorithm works at least on toy examples, see Section Put it all in a big Bayesian network, and run loopy belief propagation to solve for the two matrices, F and G. That s probably expecting too much from BP. 3. Do some on-line algorithm, starting from a small number of possible objects and modifying the objects vs features matrix and building it up as you see more and more columns of the observation matrix. 4 Reducing the number of features or feature-groups Clearly, similiar real-world objects or parts-of-objects will not create the exact same features in the image. So we need to have some way of learning that certain features, or in the hierarchical version, groups of features, are synonyms of each other. Following the perhaps questionable logic of LSA, this can be done by performing SVD on the matrix [U, S, V ] = Y T, and regularizing by reducing the dimensionality of the matrix V T. The columns of V T provide vector coordinates for each feature. Features with similar coordinates could be grouped together. There should also be other, perhaps more principled, ways of examining the matrix Y and learning which features should be merged into the same feature. This might include looking at the joint co-occurrance matrix of all the features. You could look for the pairs of features for which combining them would cause the least loss of information in the probability distribution described by that histogram. 5 A hierarchical framework for unsupervised object learning It might make sense to gradually build up objects from features. First SIFT features playing similar roles in images could be grouped. Then we could consider documents consisting of small regions of images (see Fig. 3). objects (recurring patterns of features) could be found within those documents, which would be added to the feature set. Then we d examine documents consisting of larger regions of images, and work our way up. The hierarchical approach has several benefits It introduces some spatial localization and structure into the otherwise flat, spaceless bag-of-words model. It allows for entire object parts to be occluded for a penalty less than would be incurred by requiring each feature to be independently occluded. A benefit of introducing a hierarchical representation would be in the treatment of occlusion. Instead of having to incur a penalty independently for each occluded feature, we pay the penalty once for occluding a meta-feature, then we get occlusions of all the component features for free.l Unsupervised Object Learning 1. Form the observation matrix, Y, a histogram of the number of occurances of each feature (rows) in each document (colums). The document can be an image, or smaller regions of images. 2. Group similar features together, based on their cooccurances in Y, to form a new observation matrix Y, which histograms the new features. (Using LSA, as above, or else using something better.) 3. Perform Histogram Matrix Factorization on Y to identify repeating objects or object-components. 4. Treat the identified object-components as features, and add them to the feature set. Form a new observation matrix histogram, Y. 4

Figure 3: Showing 3 different areas of circular regions over which to compute feature histograms, leading to a hierarchical representation of objects. 5.

5 Figure 3: Showing 3 different areas of circular regions over which to compute feature histograms, leading to a hierarchical representation of objects. 5. Group the related object-components together using LSA (or something better). 6. Perform Histogram Matrix Factorization on Y to identify repeating objects or object-components. 7. etc. 6 Some numerical experiments with histogram matrix factorization Figure 4 shows the result of a histogram matrix factorization method applied to a small example observation matrix. For this example, the factorized matrices, ff and gg, give a product which is actually closer to the original matrix than to the observation matrix they were factored from: The sum of absolute values of errors was 26 for observations minus true, but was 23 for fitted model minus true. >> sum(sum(abs(y-ytrue))) ans = 26 >> sum(sum(abs(y-ff*gg))) ans = 45 >> sum(sum(abs(ytrue-ff*gg))) ans = 23 The matlab code for this toy example. It is very doubtful that this algorithm will scale up by a factor of 100 or The algorithm: Perform an initial factorization using SVD. Enforce the constraints (positive, integer values) on one of the two matrices, and solve for the other matrix. Update that matrix (average between new and old values), then switch the roles of the two matrices, and repeat. % play with histogram matrix factorization. % June 21, 2004 Billf. % The number of possible objects in the toy world. 5

6 Figure 4: Top left: Maximum absolute error for any observation matrix histogram cell in the iterative factorization of the observation matrix. Top right: image showing histogram counts for observation matrix, before adding 1 count errors to one percent of the counts. Bottom left: the resulting observation matrix. Bottom right: the product of the factorized histogram matrices, which should explain the observation matrix. 6

7 dim = 10; % make synthetic f and g matrices. % f tells the features (rows) for each object (column) % g tells the objects (rows) for each document (column) ndoc = 50; nfeat = 80; % Typically, these matrices will be % mostly zeros with some ones. g = rand(dim,ndoc); g = real(g > 0.9); f = rand(nfeat, dim); f = real(f > 0.9); ytrue = f * g; % optionally add some noise to the observation counts. y = abs(-1ˆ(rand(1)>0.5) * real(rand(size(ytrue)) > 0.99) + ytrue); [u, s, v] = svd(y); alpha = 0.5; % initialize estimates for f and g (called ff and gg). gg = s*v ; ff = u(:,1:dim); niter = 300; err = []; for i = 1:niter gg = alpha * round(abs(ff\y)) + (1-alpha) * gg(1:dim,:); ff = alpha * round(abs(y * pinv(gg))) + (1-alpha) * ff; err = [err, max(max(abs(y - ff*gg)))]; end figure; subplot(2,2,1); plot(err); title([ max err for alpha = num2str(alpha)]); subplot(2,2,2); showim(ytrue); title( ytrue ); subplot(2,2,3); showim(y); title( y ); subplot(2,2,4); showim(ff*gg);title( ff*gg ); 7

Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku

Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku Todd K. Moon and Jacob H. Gunther Utah State University Abstract The popular Sudoku puzzle bears structural resemblance to