Nonnegative matrix factorization for segmentation analysis

Size: px

Start display at page:

Download "Nonnegative matrix factorization for segmentation analysis"

Warren Murphy
5 years ago
Views:

1 Nonnegative matrix factorization for segmentation analysis Roman Sandler

2 Nonnegative matrix factorization for segmentation analysis Research thesis In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Roman Sandler Submitted to the Senate of the Technion - Israel Institute of Technology Tamuz 5770 Haifa June 2010

3 The Research Thesis Was Done Under The Supervision of Prof. Michael Lindenbaum in the Faculty of Computer Science. The generous financial help of Technion is gratefully acknowledged

4 Contents 1 Introduction 2 1 Introduction Segmentation Feature space clustering Graph partitioning methods Numerical geometry methods Hierarchical Segmentation Model-assisted segmentation Segmentation evaluation Supervised evaluation Unsupervised evaluation Consensus of several segmentation hypotheses Theoretical performance prediction Nonnegative matrix factorization Earth mover s distance Fast EMD approximations Segmentation Evaluation 16 1 Framework Finding true segmentation distributions using nonnegative matrix factorization Nonnegative matrix factorization NMF algorithms Factorizing the histogram matrix Estimating model complexity using several modalities Dealing with boundary inaccuracies Nonnegative Matrix Factorization with Earth Mover s Distance metric 24 1 Observations and intuitions EMD NMF Earth mover s distance Single domain LP-based EMD algorithm Convergence Bilateral EMD NMF Efficient EMD NMF algorithms A gradient based approach

5 3.2 A gradient optimization with WEMD approximation The optimization process Applications 35 1 A tool for unsupervised online algorithm tuning Face recognition The EMD NMF components Face recognition algorithm Texture modeling NMF and image segmentation A naive NMF based segmentation algorithm Spatial smoothing Multiscale factorization Boundary aware factorization Bilateral EMD NMF segmentation algorithm Experiments 47 1 Evaluation experiments The accuracy of unsupervised estimates Application: image-specific algorithm optimization Face recognition experiment Texture descriptor estimation Segmentation Discussion 60

6 List of Figures 2.1 Distribution curves Object distribution examples Space-feature domain Illustrating the intuition Precision/recall performance with fixed and manually chosen parameter sets Precision/recall performance of the N-cut on 100 Berkeley images for some fixed k-s Facial space for 4 people. The two-dimensional (w 1, w 2 ) convex subspace is projected onto the triangle with corners in (1, 0), (0, 1), and (0, 0). The corners of the triangle represent the basis facial archetypes obtained by EMD NMF. The inner points show the actual facial images weighted in this basis Examples of texture mosaics Multiscale W estimates Precision/recall estimation Inconsistency distributions for different measurement methods Precision/recall performance for fixed sets and automatic parameter choice The performance for the outlier images Examples of segmentation errors The Yale faces database. The database contains images of 15 people, and we considered 8 images for each person. The first two rows show examples of the database images. The last row shows the basis images obtained with EMD NMF Typical recognition error in ORL database. When the test face image (a) is in a very different pose from that of the same person in the training set, the most similar person in the same pose (b) may be erroneously identified. The second-most similar identifications (c,d) are correct Texture descriptor estimation accuracy Segmentation examples, Weizmann database Segmentation examples, Berkeley database

7 List of Tables 4.1 Distribution of the images in the Berkeley test set according to the better performing algorithm The average F and F values for the segmentation algorithms Classification accuracies of different algorithms on the ORL database and the corresponding basis sizes cited from [74] Classification accuracy of EMD NMF on the ORL database for different basis sizes

8 Abstract The conducted research project is concerned with image segmentation one of the central problems of image analysis. A new model of segmented image is proposed and used to develop tools for analysis of image segmentations: image specific evaluation of segmentation algorithms performance, extraction of image segment descriptors, and extraction of image segments. Prevalent segmentation models are typically based on the assumption of smoothness in the chosen image representation within the segments and contrast between them. The proposed model, unlike them, describes segmentations using image adaptive properties, which makes it relatively robust to context factors such as image quality or the presence of texture. The image representation in the proposed terms may be obtained in a fully unsupervised process and it does not require learning from other images. The proposed model characterizes the histograms, or some other additive feature vectors, calculated over the image segments as nonnegative combinations of basic histograms. It is shown that the correct (manually drawn) segmentations generally have similar descriptions in such representation. A specific algorithm to obtain such histograms and combination coefficients is proposed; it is based on nonnegative matrix factorization (NMF). NMF approximates a given data matrix as a product of two low rank nonnegative matrices, usually by minimizing the L 2 or the KL distance between the data matrix and the matrix product. This factorization was shown to be useful for several important computer vision applications. New NMF algorithms are proposed here to minimize two kinds of the Earth Mover s Distance (EMD) error between the data and the matrix product. We propose an iterative NMF algorithm (EMD NMF) and prove its convergence. The algorithm is based on linear programming. We discuss the numerical difficulties of the EMD NMF and propose an efficient approximation. The advantages of the proposed combination of linear image model with sophisticated decomposition method are demonstrated with several applications: First, we use the boundary mixing weights (the boundary is widened and is also considered a segment) to assess image segmentation quality in precision and recall terms without ground truth. We demonstrate a surprisingly high accuracy of the unsupervised estimates obtained with our method in comparison to human-judged ground truth. We use the proposed unsupervised measure to automatically improve the quality of popular segmentation algorithms. Second, we discuss the advantage of EMD NMF over L 2 -NMF in the context of two challenging computer vision tasks face recognition and texture modeling. The latter task is built on the proposed image model and demonstrates its application to non-histogram features. Third, we show that a simple segmentation algorithm based on rough segmentation using bilateral EMD NMF performs well for many image types.

9 List of Notations NMF - Nonnegative Matrix Factorization H - Feature distributions associated with the given segments H - Feature distributions associated with the true segments W - Mixing weights of the distributions EMD - Earth Mover s Distance P - Precision R - Recall F value - A commonly used scalar measure representing both P and R h x (f) - Feature distribution in location x h f ( x) - Spatial distribution for a feature subset f BEMD - Bilateral Earth mover s distance W EMD - Wavelet Earth mover s distance approximation F ( x) - Feature value in location x T V (x) - Total variation

10 Chapter 1 Introduction 2

11 1 Introduction The task of segmentation, or more generally grouping, is an essential stage in image analysis. The image analysis of a typical complex scene is significantly simplified if the image is partitioned into semantically meaningful parts. High level vision algorithms, such as recognition and scene analysis, usually require reasonably segmented image parts as an input. The demand for good grouping methods is defined and well known. A lot of research on segmentation has been done during the last decades. Many algorithms, implementing different approaches have been proposed. However, the goal of efficient, fully automatic image segmentation algorithm is yet to be reached. Although the existing algorithms are able to solve correctly and efficiently variety of specific (even very general) cases of segmentation tasks, there is still no unified approach which is able to solve the problem of segmentation in general. At first sight the task of image segmentation seems to be a reasonable and well defined one. Most humans can segment an image into meaningful objects, in a way that seems reasonable to other people, with no effort at all. Through the years image segmentation has been considered from numerous directions. Psychologists found different image properties which humans associate with object boundaries and details (Gestalt rules [72]). Different mathematical formulations (e.g., a functional minimization problem [11, 64, 51] and a relaxation problem [63, 29]) have been proposed for segmentation description. Many different representations of the image data have been proposed: brightness, color, texture, etc. Many combinations between the approaches[46], spaces[29, 20, 14, 66, 19], and mathematical tools were tested. Learning methodologies for solving specific classes of segmentation problems were shown [8, 37]. However, on contrary to humans, the existing state-of-the-art methods usually perform well for specific classes of segmentation tasks. None of them can be taken as is and applied to a different segmentation task class. Moreover, even if such method is applied to an image from the appropriate class it still may fail if the image is non-representative for the class. Consequently, this research is not concerned with the development of yet another segmentation algorithm (though, in the end, we propose one), but rather with proposing a new, segmentation related, image model and development of image adaptive analysis in the terms of this model. The model refers to two complementary domains the spatial and image feature ones. We assume that in the spatial domain each image location is associated with a single image segment. Ideally, we would like each object to be associated with a unique feature value. However, following this ideal assumption delegates the concern for the effects of texture and image quality to the image features, which may be very complex, such as distribution patterns of filter responses; see e.g., [46]. Here we use an opposite approach. The considered image features are relatively simple, such as brightness gradient magnitude or filter responses and naturally, many objects share similar feature values. To specify an object uniquely in the feature domain we assume that it is associated with a feature distribution different from distributions of other objects. This approach allows us to use same image features for different image classes, while feature domain object description becomes image-specific instead of class-specific. In the proposed model, a feature description of any (not necessarily correct) image segment can be obtained from the descriptions of the true segments. A feature distribution 3

12 associated with such a segment is a mixture of the true segments distributions. Moreover, the mixture weights are equal to the ratios of the true segment areas building the analyzed segment. This model can be mathematically expressed as a matrix product: H = HW, (1.1) where H = (h 1... h M ) are feature distributions associated with M different segments, H = (h 1... h K ) are the object feature distributions, and W are the mixing weights, which actually have a spatial interpretation. Note that usually matrix H is easy to calculate directly from the image. Then, knowing H one can obtain spatial information on the segments associated with H. Moreover, even if H is unknown, the feature distributions associated with the true segments along with some spatial information on these segments can be estimated using nonnegative matrix factorization (NMF). A result of NMF is a pair of matrices holding the feature (distributions) and spatial information on true segments. This information can be directly used in important vision tasks. Given a segmentation hypothesis, its quality can be evaluated by comparing the feature distributions associated with the hypothesized segments to the true distributions. The obtained true feature distributions can be used to identify the textures building the image in a database. Finally, the feature and spatial distributions could be used for actual image segmentation. The core of the proposed approach is a NMF process. NMF approximates a given data matrix H as a product of two low rank nonnegative matrices H HW. The factorization becomes useful and interesting when the multiplied matrices are of low rank, implying usually that the factorization is approximate. In this case, the decomposition is useful for signal representation as an additive combination of a small number of atomic signals (segment descriptors in our case). The basic algorithm proposed by Lee and Seung [39] gets a matrix H and tries to find a pair of low rank nonnegative matrices H and W satisfying min Dist φ(h, HW )s.t.w 0, H 0, (1.2) H,W where the distance φ is either the Frobenius norm or the Kullback-Leibler distance. We argue that measuring the dissimilarity of H and HW as L 2 or KL distance, even with additional bias terms, does not match the nature of errors in realistic imagery and other types of distances should be preferred. In particular, the Earth mover s distance (EMD) [56] metric is known (e.g., [71, 56, 42, 30]) to quantify the errors in image or histogram matching better than other metrics. The error mechanism, modeled as a complex local deformation of the original descriptor is often a good model for the actual distortion processes in image formation. In this work it is proposed to factorize the given matrix using the EMD. That is, the minimization (1.2) is considered here with the EMD as φ distance. We propose two EMD based NMF tasks and provide linear programming based algorithms for the factorization. More efficient algorithms, based on Wavelet EMD approximation [65], are described as well. The more general algorithm, denoted bilateral EMD NMF, is suitable for the case when the distortion is modeled well by small, in EMD sense, error in 4

13 both spatial and feature domains. The simpler algorithm is preferred when the distortion fits the EMD model in only one of the domains. We examined the proposed approaches with four vision tasks. First, the proposed image model in conjunction with EMD NMF was applied to image segmentation quality estimation. We demonstrate a surprisingly high accuracy of the unsupervised estimates obtained with the proposed method in comparison to human-judged ground truth. Then, the proposed unsupervised measure is used to automatically improve the quality of popular segmentation algorithms. The strength of EMD NMF was tested on traditional NMF test-case: face recognition. We handle unaligned facial images with some pose changes and different facial expressions and show a performance superior to that of other popular factorization methods. The third task is texture modeling. Given an unlabeled image containing multiple textures we extract the descriptors of individual textures using EMD NMF instead of actually segmenting the image. Finally, we show, for the first time, actual NMF based image segmentation. In all cases we consider sets of naturally deformed signal samples and reconstruct parts which appear to be the meaningful original signals. The following sections are organized as follows: sections 2 through 5 in this chapter contain a background on some of common segmentation algorithms, segmentation quality estimation approaches, nonnegative matrix factorization, and the Earth mover s distance. Chapter 2 provides intuition on the proposed image model by presenting it in the context of segmentation evaluation task. The EMD NMF methods are developed in chapter 3. Chapter 4 presents four applications of the proposed approach to actual vision tasks. The experimental validation of the theory and benchmarks on the proposed applications are reported in chapter 5. The last chapter (6) provides a discussion on the reported research. 5

14 2 Segmentation Much effort has been dedicated to the development of segmentation algorithms and their analysis. This effort resulted in a large variety of segmentation approaches that exist today. Some of the commonly used are graph cut algorithms [11, 64], hierarchical segmentation [63], model assisted methods [8] and numerical geometry based methods [51]. The performance of the state-of-the-art algorithms associated with each approach is more-or-less similar, but it still not close to human performance. A recent comparative evaluation of several algorithms is described in [27]. Most of the algorithms mentioned above are optimization procedures over numerous basic properties of an image. The properties commonly used for such optimization are the intra-region homogeneity, the inter-region contrast and a model of boundaries. While the first two properties are relatively well studied and a bunch of algorithms for clustering and edge detection are available, the latter property is more subjective and usually obtained with learning. It should be noted that the optimization techniques generally used in segmentation algorithms are relatively simple in order to remain efficient. During the years, many algorithms dealing with the problems described above were proposed. Below, we review one or more representative methods for several popular approaches. The algorithms mentioned in this section are not completely automatic and require parameter tuning. In chapter 4 we propose an automatic method to do so. There are several families of image segmentation and segmentation evaluation algorithms widely discussed in computer vision press recently. Most of them contain some preprocessing steps, but it is common to group them according to the last (most important) step - the segmentation itself. In this summary the same approach is taken, and the differences between the preprocessing steps are discussed for the relevant examples. Note also, that some relations between the different algorithm families were shown and, in principle, the same algorithm may be considered from the point-of-view of different algorithm type, even though from this point-of-view it will be much more complicated. 2.1 Feature space clustering In this method the pixels of an image are transferred into another representation, and each pixel has a feature vector containing the values describing the location and/or its neighborhood. The segmentation algorithm receives the distribution of those vectors and searches for the best clustering of it. The locations associated with each cluster are considered parts of the same object. A geometric restriction is usually applied, thus two distant pixels are less likely to become parts of the same object. The common steps of the algorithms in this family are: Converting the image to the feature space, such as color-location [18], or Gabor[59], wavelet or using just the original gray-scales. Some more complex modern algorithms [46, 69] cluster the obtained feature space data and take the distribution of cluster indexes in pixel s neighborhood as its feature vector. Clustering of the feature vectors. The simplest method is K-Means, but it is very popular because it is relatively fast. The more precise methods, like mean-shift for 6

15 example, demand much bigger computational effort, but improves the results[19]. Postprocessing. In this stage the pixels are set to the group of the closest cluster center. If the resulting segmentation contains some isolated pixels of different groups it is repaired by some smoothing algorithm. The most simple example for this approach is gray-scale image segmentation by thresholding. A more complicated examples may be found in [28]. In our experiments we use the mean shift tool [19], which is one of the state-of-the-art segmentation tools as benchmarked in [27] and [4]. Mean shift considers two algorithm specific parameters: the pixels neighborhood size in spatial and feature spaces [27]. Interestingly, the mean shift algorithm considers the spatial coordinates as features, and thus is related to the methods which operate in both feature and spatial domains, such as graph methods and numerical geometry methods. 2.2 Graph partitioning methods This approach considers image pixels as graph vertices. The edges of the graph represent the similarity between the connected pixels. The algorithms in this approach look for graph partitioning in a way that minimizes the cost of the deleted edges and maximizes the cost of the remaining ones. Partitioning of any given graph is a hard problem. To simplify it two approaches are taken. One [64] is to apply some simplifications on the graph structure and to refer the problem as finding eigenvalues of a sparse matrix. Another, [11], is to define some vertices as seed points which belong to different graph partitions. In this way the segmentation problem becomes standard maximal flow problem with known solution. The problem of finding the minimal cut is different from the feature space clustering. However, the simplifications done in [64] can be shown, e.g., [76], to bring the problem back into feature clustering domain. Another domain of segmentation problems which was shown to be related to the graph formulation is a domain of numerical geometry methods; see [12]. The graph partitioning approach implicitly supposes that the similarity values, used as edge weights, describe well the image nature. That is the edge values between vertices belonging to different objects should be higher than the edge values between vertices belonging to the same object. If this assumption does not hold, the segmentation fails. For example, the N-Cut method with features based on brightness similarity fails to segment some images associated with various textured objects. The edges between the different textures are weaker than those created within the same texture. One way to overcome this problem is to learn the edge values associated with actual edges for a specific class of images [4]. The parameters of graph-cut algorithms usually specify the relative importance of the spatial vs. feature originated similarity information [46]. Additional parameter may specify the importance of the seed information over the contrast information [11]. In our experiments we considered the normalized cut algorithm [64]. We used the implementation of the authors [21] and represented the images in three feature spaces (grayscale, color, and texture). We considered only one parameter the number of segments. 7

16 2.3 Numerical geometry methods In contrast to the previously mentioned approaches, these methods try to separate an object (or a group of objects) from the background. These methods usually consider the image as a field of some potential. A contour is initialized to circle the area of interest and tries to shrink to a zero length by moving towards the center of the circled area. The image (the potential) denies this movement from it in regions associated with evidence for object boundary (e.g., high gradient). As a result in some places the contour enters farther than in other, and in the end it is located on the boundary of the object. The segmentation process is actually a process of solving image-based PDEs, originally formulated in [51]. This approach is very successful in solving specific classes of segmentation problems. However, it is not a good method for segmenting a general image with unknown number of objects. As already mentioned, a connection between numerical geometry and graph partitioning methods was shown in [12], thus in principle one could choose the better suiting approach according to the problem in hand. The algorithm for choosing the optimally performing algorithm for a given image, described in chapter 4, may be a step towards a practical use of this connection. 2.4 Hierarchical Segmentation On contrary to the previous bottom-up approaches, this is a combination of bottom-up with top-down methods. Each small group of neighboring pixels of the original image is represented by a pixel on some other image, of smaller resolution. Building a pyramid of these images of representative pixels above the original image terminates when each pixel of the highest image represents an object on the original image. After the pyramid is built the association of the pixels of the lower levels may be changed according to some relaxation algorithm. Usually there are several such top-down reshuffling sweeps, after which the pyramid is rebuilt. In [16] each pixel belong to a single group in the same time. Modern algorithms, e.g., [63], allow multi-group belonging for each location. Both algorithms use only raw, graylevel information. More recent version of the latter algorithm, [29], introduces the use of filter responses and shape distributions as pixels features. The parameters used in the hierarchical approach determine the similarity thresholds used in the pixel gathering process. Additional parameters are related to the relaxation process, in which the final decision on the pixel-to-object association is taken, where these parameters specify the probability thresholds. 2.5 Model-assisted segmentation This method supposes a-priory knowledge about the class of the object(s) which appear in the image (top-down approach). Given the class, the statistics about object shape, or in more recent algorithms the statistics of the relative appearance of object parts between themselves, is collected. In the segmented image the known shapes are matched to the image data, while the ability of the shape to change and the amount of the image data to ignore are given by the algorithm parameters. 8

17 Recent variations of this method apply hierarchical segmentation [8] or texture class recognition [37] for improving the obtained segmentation. To avoid over-fitting of the model to the training examples, a combination of bottom-up with top-down approaches can be done in the learning stage [41]. 9

18 3 Segmentation evaluation In order to evaluate the quality of different segmentation algorithms a segmentation evaluation mechanism is required. In opposite to, say, object recognition there is no binary indicator about a success or failure of the segmentation task. The segmentation may be partially correct. There may be two different segmentations, but both of them would be reasonable. Segmentation algorithms are sometimes evaluated in the context of specific applications. The advocates of this task-dependent approach argue that a segmentation is not an end in itself, and therefore should be evaluated using the application performance (See e.g. [9] for an estimation of grouping quality in the context of model based object recognition.) This approach is best when working on a specific application, but it does not support modular design and does not guarantee a suitable performance for other tasks. As we know, humans can consistently discriminate between good and bad segmentations. This implies that, at least in principle, task-independent measures exist [47]. Such task independent evaluations may be done by comparing the segmentation results to ground truth segmentations (supervised evaluation). Alternately, the evaluations may be done without using any reference segmentation at all (unsupervised evaluation). 3.1 Supervised evaluation Supervised, or ground truth based, evaluation is commonly used for empirical comparison of algorithms. Some approaches compare the evaluated segmentation to the reference segmentations using some type of set difference. See [3, 48, 73, 78] for some examples. Some of these methods focus on the boundaries between the segments, and compare them to the reference boundaries, in statistical terms of miss and false positive, or precision and recall [48]. A different approach, building on information theoretic considerations, is proposed in [26]. The recently available large image databases associated with manual segmentations [47] reveal the inconsistency of human segmentations, allow the quantitative comparison of different approaches on a common test bed [27], and enable learning based design of segmentation procedures [8, 48]. 3.2 Unsupervised evaluation Unsupervised evaluation of segmentation does not require ground truth and is based only on the information included in the image itself. It is usually based on heuristic measures of consistency, related to Gestalt laws, between the image and the segmentation. Some examples are intra-region uniformity, inter-region contrast[10, 17]), specific region shape properties (e.g., convexity [36]), or combinations thereof [77]. It may also be based on statistical measures of quality (e.g., high likelihood), when a statistical characterization of the underlying perceptual context is available [3, 55, 25]. Unsupervised evaluation is considered rather weak for evaluating segmentation [78]. It is sensitive to texture and context, suffers from the absence of the very informative ground truth and does not offer a clear interpretation: unsupervised evaluation algorithms provide a measure which, supposedly, increases monotonically with the perceptual quality of the 10

19 segmentation. Yet, this measure is not explicitly related to the empirical error probability provided by, say, precision/recall. Unsupervised evaluation is rarely discussed as an end in itself; see however [10, 17, 75, 49]). It is more commonly discussed in the context of the numerous segmentation methods (see, e.g., [51] [55]). In fact, every segmentation algorithm may be interpreted as an optimization of an unsupervised quality measure. Clearly, the evaluation methods associated with segmentation algorithms are often simplistic in order that the resulting segmentation algorithm be efficient. In spite of its weaknesses, unsupervised segmentation is needed for generating effective segmentation algorithms, for selecting their parameters in different contexts, and for informing subsequent stages of the visual process (e.g., recognition), which gets the segmentation as an input, about its quality. 3.3 Consensus of several segmentation hypotheses A relatively recent approach to segmentation evaluation compares the given segmentation hypothesis to a reference obtained from several other hypotheses [75, 70, 54]. This approach is based on an assumption that automatic segmentations contain many true details. Moreover, the false details are random and do not repeat in different segmentations. Thus, a consensus of many automatic segmentation hypotheses will contain only the true segmentation details. The consensus approach is similar in some details to the evaluation method proposed in this work. Grouping quality is evaluated, in an unsupervised way, relative to a result combination of several grouping algorithms. This way, the evaluated segmentation may be compared to some reference in quantitatively meaningful way. The key difference is in the unsupervised reference estimation. While the consensus methods try to find a true segmentation reference basing on an assumption of reasonability of automatic segmentations, we propose to estimate a much simpler ground truth reference (distributions associated with the true segments) which is based on a specific image model and requires much weaker assumptions. 3.4 Theoretical performance prediction Besides evaluating the obtained segmentation quality, there were several attempts to predict analytically the expected quality and control the segmentation process in order to obtain the better quality [3, 6]. The evaluations are based on the data quality and grouping cues reliability. The analytical prediction is always dependent on the details of specific algorithm, but since the analysis relies on the Maximum Likelihood criterion, the generic algorithm analyzed in [3] performs similarly or better than any general segmentation algorithm. 11

20 4 Nonnegative matrix factorization For a long time it has been popular to use component analysis (CA) methods (e.g., PCA, LDA, CCA, spectral clustering, etc.) in modeling, clustering, classification and visualization problems. The idea of CA techniques is in decomposition of a given signal into components that are related to the basic signals in the problem s domain. The given signal is characterized, explicitly or implicitly (e.g., kernel methods), by the mixing coefficients of these basic components. Many CA techniques can be formulated as eigenproblems, offering great potential for efficient learning of linear and nonlinear models without local minima. It is common to consider only the components related to the largest eigenvalues and work with signal approximation in a low dimensional space. This allows considering relatively few samples for successful estimation of the components. CA techniques are especially useful to handle highdimensional data due to the curse-of-dimensionality, which usually requires a large number of samples to build accurate models. During the last century many computer vision, computer graphics, signal processing, and statistical problems were posed as problems of learning a low dimensional CA model. The traditional eigenproblems are equivalent to a least squares fit to a matrix, see e.g., [52, 22]. In application to physical processes standard CA methods, such as PCA fail to reconstruct physically meaningful components because of incorrect weighting and allowing negative component entries. To overcome these drawbacks Paatero and Tapper proposed [52] an alternative, least squares based approach. The Nonnegative Matrix Factorization (NMF) is a representation of a nonnegative matrix as a product of two nonnegative matrices. It is common to consider matrices of low rank, implying usually that the factorization is approximate. Initially [52], this task was often formulated as follows: Given a non-negative matrix A R n m and a positive integer k < min(m, n), find non-negative matrices H R n k and W R k m which minimize the functional f(h, W ) = 1 2 A HW 2 2. (4.1) Minimizing (4.1) is difficult for several reasons, including the existence of local minima as a result of the non-convexity of f(h, W ) in both H and W, and, perhaps more importantly, the non-uniqueness of the solution. Additional information is commonly used to direct the algorithm to the desired solution [23]. The problem got much attention after its information theoretic formulation and the multiplicative update algorithm for the Frobenius norm and the Kullback-Leibler distance proposed by Lee and Seung [38, 39]. Different aspects of this latter algorithm were analyzed and many improvements were proposed [7, 24, 32, 31, 23, 74]. The main research topics include speeding up the minimization process, research of the influence of the initialization seeds and extension of the NMF to include additional constrains on W and H. See the survey in [7]. The factorization is commonly done by iterative algorithms: one matrix (e.g., W ) is treated as a constant, getting its value from the previous iteration, while the other H is changed to reduce the cost f(h, W ). Then the roles of the matrices are switched. The algorithms differ mostly in the specific cost reducing iteration, and in the use of additional 12

21 information. There are four main approaches to NMF iterations. Paatero and Tapper [52] used the alternating least squares (ALS) approach, which was reported to be the most accurate, but also the slowest method [7]. Our empirical observations concur with this claim [60]. As already mentioned, for successful application of NMF for practical problems the factorization is biased toward {H, W } having some desirable properties. In such cases it is simpler to use gradient descent step for each iteration, e.g., [34]. The third approach uses multiplicative update algorithms (MUA), which can be regarded as a special case of gradient descent methods. The speed and the simplicity of the original multiplicative update algorithm by Lee and Seung [39] are in the foundation of the NMF s popularity. The algorithms which we develop in this work are related to the fourth approach, which in a sense may be cosidered as a generalization of the ALS approach. Each NMF iteration is an optimization process by itself which solves a convex task [32]. Sometimes, even approximate solution is sufficient for good performance [1, 60]. Biasing the solution to special form of H or W is usually dictated by the application considerations. Common choices are weighting the importance of each matrix entry of the factorized matrix, enforcing sparsity on the factors, and enforcing similarity o the weights factor. In [52] the matrix columns are weighted according to their reliability. In [23] a more general weighting approach is presented. It was shown that sparse basis functions [43, 34] and sparse mixing weights [1] should be preferred for many applications. If the relation between the mixing weights is known they may be forced to comply with it [74]. In this research, the columns of both H and W are forced to sum to one. In the preliminary version of segmentation evaluation tool we enforced a special parametrization of H matrix columns. The use of better metric turned out to be better biasing tool for this application. The NMF technique has been applied to many applications in the fields of object and face recognition, action recognition, and segmentation [74, 67, 60]. In computer vision the use of NMF is strongly motivated by the relative complexity to obtain pure examples of, say, class descriptors, while their mixes are easily available. In a sense, Following [38] face representation/recognition became standard test case for NMF methods. Therefore, in this work we also address the face recognition problem, although it is not in the main line of this research. The different NMF algorithms mentioned in this section are variants of the L 2 and KL distances with additional biasing terms. In the beginning [60] we also followed this path. It turned out, however, that using a different basic distance measure is advantageous for our line of application. Naturally, many of the methods developed for other NMF methods may be applied to EMD NMF as well. Chapter 3 of this report is dedicated to derivation of EMD NMF method. In chapter 4 a bias on W factor is demonstrated. To avoid repetition, further details on NMF background related to the proposed new factorization may be found in these chapters in adjunction to the related details of the proposed method. 13

22 5 Earth mover s distance The problem of distribution comparison is a very important for computer vision. Many vision tasks deal with large amounts of data and thus need to summarize it with descriptors, e.g., mean filter responses [56]. It was shown that distributions of such features are more informative than just mean values [40, 69, 45]. However, when the data is described by a histogram or histograms-like descriptor the need for distribution comparison tool arises. The problem of comparing a distribution S = (s 1,..., s n ) to distribution T = (t 1,..., t n ) is a thoroughly studied one. When the compared distributions are related to known and theoretically studied processes, the comparison techniques are also usually theoretically well founded. Some examples are: Kullback-Leibler divergence is a tool from the Information Theory. It tells how well, on average, T is coded in the terms of S. Formally: D(S, T ) = i s i log s i t i. (5.1) χ 2 distance has a statistical justification. It measures how unlikely is that T describes a sample of population represented by S. Formally: D(S, T ) = i (s i t i ) 2 s i. (5.2) Kolmogorov-Smirnov distance also has a statistical justification. It measures how unlikely is that T and S are two samples drawn from the same distribution. Formally: D(S, T ) = max i ŝ i ˆt i, (5.3) where ŝ i and ˆt i are bins of the cumulative versions of S and T respectively. While these measures were successfully applied to important computer vision problems, see, e.g., [48, 38], in practice, however, they suffer from numerical difficulties. Dividing by zero in the first two distances brings the overall measure to infinity. The discrete histogram bins may cause a situation when the data corresponding to some bin in one histogram moved to a neighboring bin in the second histogram. Moreover, sometimes it is reasonable to consider histograms with different bin boundaries and different sum of bins. In the latter case we would like, following [56], to consider distances between signatures. To overcome the division by zero, more numerically stable versions of Kullback-Leibler divergence (Jeffrey divergence) and χ 2 distance were proposed. It is also common to use empirically justified L p norms and some other less common methods. To overcome the binning problem, besides using the Kolmogorov-Smirnov distance, which is useful only in 1D, it is common to consider quadratic-form distance d(s, T ) = (S T ) T A(S T ) where the matrix A contains the probability of i-th bin content to move to bin j. Note, however, that each of these slution is ad hoc and do not provide a general tool to compare a pair of signatures with a prespecified relation between the bins. To address all these problems in the same framework Rubner proposed to use the Earth mover s distance [56]. 14

23 The Earth Mover s Distance (EMD) is a method to evaluate dissimilarity between two distributions in some feature space, where a distance measure between single features, the ground distance, is given. The EMD, which variants are known as Monge-Kantarovich metric, Wasserstein metric, Mallows distance, etc., was first applied to computer vision tasks by Werman et al. [71] and generalized by Rubner [56]. The name EMD follows the intuitive explanation of the measure: Given two distributions, one can be seen as a mass of earth properly spread in space, the other as a collection of holes in that same space. Then, theemd measures the least amount of work needed to fill the holes with earth. Here, a unit of work corresponds to transporting a unit of earth by a unit of ground distance. Formally, to compute EMD one should solve a transportation problem. This can be done by solving a linear problem, see eq. (4.5, 2.4). In this formulation the bins of S represent the suppliers, the bins of T represent the receivers, and the ground distance measures the cost of moving a mass unit of goods from each supplier to each receiver. EMD was shown to outperform other distances for numerous computer vision problems, e.g., [71, 56, 42, 30]. However, the solution of the transportation problem takes O(N 3 logn) for a pair of histograms of N bins. Thus, frequently it is not applied to practical problems due to its computation time. 5.1 Fast EMD approximations The different accelerated EMD computation techniques proposed over the years may be roughly divided into two main groups. One refers to special cases of EMD problem and show fast and exact EMD calculation methods which work only for a specific ground distance or a specific type of signature. Another proposes fast but approximate calculation methods for more general cases. We start with the exact, special case methods. In the original work Werman et. al. [71] showed that EMD between one dimensional histograms with L 1 as ground distance is equal to the L 1 distance between the cumulative histograms. Ling and Okada proposed EMD-L 1 [44]. They showed that if the ground distance is L 1 the number of variables in the LP problem can be reduced from O(N 2 ) to O(N). The worst case time complexity is exponential as for all simplex methods, however empirically, they showed that this algorithm has an average time complexity of O(N 2 ). Pele and Werman [53] proposed using EMD with thresholded distances. They have shown that in this case the EMD-s are metrics and their computation time is in the order of magnitude faster than that of the original algorithm. A special case of thresholded ground distance of 0 for corresponding bins, 1 for adjacent bins and 2 for farther bins and for the extra mass can be computed in linear time. Now we mention some examples of approximate methods. Indyk and Thaper [35] proposed approximating EMD-L 1 by embedding it into the L 1 norm. Embedding complexity is O(Nd log ), where N is the feature set size, d is the feature space dimension and is the diameter of the underlying space. Grauman and Darrell [30] substituted L 1 with histogram intersection in order to approximate partial matching. Shirdhonkar and Jacobs [65] presented a linear-time algorithm for approximating EMD with some L n ground distances using the wavelet coefficients of the difference between histograms. In this work we use the Wavelet EMD approximation [65] because it is one of the general and fastest methods and mainly because it has an analytic gradient expression. 15

24 Chapter 2 Segmentation Evaluation 16

0.25 1 1 0.9 0.2.75 0.8 0.7 0.15 0.6 0.5 0.5 0.1 0.4 0.05.25 0.3 0.2 0.1 0.25.5.75 1 0 0.25.5.75 1 0 0.25.5.75 1 (a) (b) (c) (d) Figure 2.

25 (a) (b) (c) (d) Figure 2.1: Edge strength distributions on different regions of an image. The regions are specified by manual (a) or automatic segmentations. Distribution densities on the segments specified by (several) manual segmentations (b). Each curve is associated with a distribution of a single segment. The plot contains distributions of several different segmentations of the same image. Cumulative distributions corresponding to manual segmentations (c). Cumulative distributions corresponding to incorrect segmentations of the same image (d). The thick curves are actually clusters of similar thin curves. 1 Framework We consider the evaluation of segmentations such as those created by general purpose segmentation algorithms, e. g., [19, 64]. Thus, a segmentation is a partition of the image into disjoint regions, separated by thin boundaries. We denote the evaluated segmentation as the hypothesized or given segmentation. Every point in the image may be characterized by some local properties, such as intensity, color, or texture, which may be represented by some feature vector. In this paper we used three boundary sensitive operators corresponding to texture, brightness and color. Consider a good segmentation of some image (note that some images have more than one good segmentation). Our basic model is that for every pixel within a particular segment, the local characterization may be regarded as an instance of a random variable, associated with some (discrete) distribution. The distributions associated with different segments are not necessarily different. The region around the boundary is considered as another segment and is characterized by a distribution, just like every other segment. Intuitively, for boundary sensitive operators, we expect this distribution to put higher weights on high values. Note however, that due to texture, the other distributions are not expected to be disjoint from the boundary distribution. A given image is associated with a small number of distributions, characterizing the different object appearances and the boundaries. We assume that the distributions of the true segment parts are approximately equal to the distribution of the whole true segment. As an example, consider the distributions associated with the intensity operator for several human segmentations of the same image (Fig. 2.1(a)(b)). Note that the distributions are clustered into several types. The clustering phenomenon is clearer in the less noisy, cumulative representation of these distributions (Fig. 2.1(c)). The lower cumulative distribution curves, which rise only for relatively high values, are those associated with the manually marked boundaries. The examples in Figure 1 show that this phenomenon occurs in many images in each of the three modalities. It should be emphasized that these distributions, characterizing the true segments and the true boundary, are not only unknown but are not 17

0.5 1 1 1 1 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 f(x,y) 0 0.5 1 0 0.

5 1 0 0 5 10 15 20 25 30 35 0 0.5 1 0 0.5 1 0 0 5 10 15 20 25 30 35 1 1 1 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.

5 1 0 0 5 10 15 20 25 30 35 (a) original (b) manual segmentations (c) brightness (d) color (e) texture Figure 2.

The boundary distributions associated with different manual segmentations shown in the second column are plotted in red.

Note, however, that despite the significant difference in boundary markings made by different people, the corresponding boundary distributions are

We shall show that estimating them leads to a quantitative, meaningful and yet unsupervised quality measure for a given segmentation.

Each such part is associated with a distribution of the true segment to which the part belongs.

The basic goal considered here is to estimate the correctness of a given segmentation when no ground truth is given.

26 f(x,y) (a) original (b) manual segmentations (c) brightness (d) color (e) texture Figure 2.2: Examples of the object distribution clustering effect for different modalities. The boundary distributions associated with different manual segmentations shown in the second column are plotted in red. The object distributions are blue. even uniquely specified. Note, however, that despite the significant difference in boundary markings made by different people, the corresponding boundary distributions are similar for each image; see Figure 1. We shall show that estimating them leads to a quantitative, meaningful and yet unsupervised quality measure for a given segmentation. Consider now an incorrect segmentation hypothesis. Every incorrect segment contains parts from several true segments. Each such part is associated with a distribution of the true segment to which the part belongs. Therefore, we expect the incorrect segments to be characterized by mixtures of the true distributions; see Fig. 2.1(d). The basic goal considered here is to estimate the correctness of a given segmentation when no ground truth is given. Specifically, we would like to estimate the accuracy of the inter-segment boundaries in precision/recall terms [48]. To carry out this seemingly impossible task, we first consider a simpler one. Assume that the number of true segments (including the boundary segment), k, as well as the associated distributions are known. All the mixture distributions lie in the convex hull of these true distributions. Therefore, given a particular hypothesized segment and its distribution, the mixture coefficients associated with the hypothesized segment may be obtained by solving an overconstrained system of linear equations. Then the precision and recall may be easily calculated; see below. 18

27 Consider now the more difficult task, where the true distributions are not known. To find these true distributions, specify many (not necessarily correct) hypothesized segments and find their distributions. Each of these hypothesized distributions lies in the convex hull of the k true distributions. Thus, this set of distributions may be regarded as a matrix product of the true distributions and a nonnegative weights matrix. Therefore, finding the true (hidden) distributions associated with the true segments is a nonnegative matrix factorization task. Formally, let h i be the operator response distribution in the i-th segment, represented as an n-bin histogram or column vector. Thus, H = (h 1, h 2,..., h k ) R n k represents all the underlying (true) distributions on the image. Consider now some segmentation containing m segments (including the boundary). Let H = (h 1, h 2,..., h m) R n m be the matrix of the distributions associated with these segments. Then, H may be written as H = HW, (1.1) where W R k m is a weight matrix. In practice, the measured distributions may be noisy. The factorization still holds as an approximation H HW for an effective value of k, which we estimate. W.l.g. let h 1 and h 1 be the histograms associated with the boundaries in the true segmentation and in the hypothesized one. Then, by definition, P recision = w 11 Recall = α 1w 11 j α jw 1j, (1.2) where α j is the size of the j-th segment. Thus, the quality of a given hypothesized segmentation may be found by decomposing its operator response histogram matrix into two matrices H and W, representing the distributions associated with the true segments and the mixture coefficients, respectively. Note that we do not find the ground truth segmentation at any step of this evaluation, and we have only its description in terms of feature distributions. 2 Finding true segmentation distributions using nonnegative matrix factorization 2.1 Nonnegative matrix factorization The decomposition of the measured histogram matrix H into a mixture of basic histograms is a nonnegative matrix factorization (NMF) task [39, 23, 7, 24]. This task is often formulated as follows: Given a nonnegative matrix H R n m and a positive integer k < min(m, n), find nonnegative matrices H R n k and W R k m which minimize the functional f(h, W ) = Dist(H, HW ). (2.1) The matrix pair {H, W } is called a nonnegative matrix factorization of H, although H is not necessarily exactly equal to the product HW. Minimizing (2.1) is difficult for several reasons, including the existence of local minima as a result of the nonconvexity of f(h, W ) in both H and W, and, perhaps more importantly, the nonuniqueness of the solution. Additional information is commonly used to direct the algorithm to the desired solution [23]. 19

28 2.2 NMF algorithms The problem of nonnegative matrix factorization was introduced by Paatero [52] but got much attention only after its information theoretic formulation and the multiplicative update algorithm by Lee and Seung [39]; see the survey in [7]. The factorization is commonly done by iterative algorithms: one matrix (e. g., W ) is treated as a constant, getting its value from the previous iteration, while the other, H, is changed to reduce the cost f(h, W ). Then the roles of the matrices are switched and W is changed for fixed H. The algorithms differ mostly in the specific cost reducing iteration, and in the use of additional information. The preliminary version [60] of this paper was based on a variation of Lee and Seung s algorithm. This method is based on the traditional minimization of L 2 distance in (2.1), and required additional constraints to perform well. For many signal comparison tasks, where the error mechanism is not modeled well by additive noise but is rather a complex local deformation of the original signal or the signal descriptor, the Earth mover s distance (EMD) performs better than many other metrics [56, 42, 13]. Histogram comparison is a well-known case of such a task, because any noise added to a signal affects several bins of that signal s histogram. EMD measures the minimal change needed to convert one of the compared histograms into another, subject to the given deformation costs. In [61] it is shown how to perform an NMF task that minimizes the EMD between the matrix columns. As expected, EMD NMF performs much better for factorization of histogram sets than L 2 based analogs [61]. In EMD NMF, H is factorized with a sequence of linear tasks H k = arg min H EMD(Hm, (HW k 1 ) m ) m W k m = arg min W EMD(H m, (H k W ) m ), (2.2) where A m is the m-th column of the matrix A; see [61] for details. 2.3 Factorizing the histogram matrix To carry out the factorization (1.1), we used the EMD NMF method [61] as well as several supporting techniques. Much data is needed for successful factorization. Therefore, instead of factorizing the matrix H, associated with a single segmentation, we consider a larger matrix associated with several segmentations (of the same image). Such segmentations are either available or may be created using a segmentation algorithm with different sets of parameters. H is thus redefined as an n M matrix whose columns are the M histograms associated with all segments of all segmentations. H = ( h 1, h 2,..., h m, h m+1,..., h M). (2.3) The factorization (1.1) is now changed to H HW, where H is an n k matrix (unchanged) and W is a much larger k M weight matrix. Clearly, for successful factorization, H should contain different combinations of true H vectors. Geometrically, the columns of H are points in the convex hull specified by the 20

29 columns of H in R n+. To get a stable reconstruction of the convex hull, the samples (vectors in H ) should represent it well. An ideal situation would be if each H vector were equal to some vector in the true H matrix and all H vectors were represented. The more realistic scenario is to get some vectors in H that are close to the vectors of H and many others that are in the inner regions of the convex hull. In terms of segmentation quality, this means that we are interested in segments which are pure examples of the existing object classes in addition to segments which are mixtures of different objects. A trivial way to obtain such examples is an over-segmentation of the image, but in practice this is not a good solution. For reliable histogram estimation, the pure segments should be large and include a significant (10% 20%) fraction of the true segment. Empirically, we found that if the segmentations constructing the H matrix are associated with a large diversity of precision and recall grades (though none necessarily requires both to be high), the reconstruction is stable. This can usually be achieved by choosing a large diversity of automatic segmentation algorithm parameter sets. For common segmentation sets, many segments are associated with very similar distributions. For an example, see Fig. 2.1d, where the middle cluster of curves corresponds to such similar distributions. Geometrically, this means that the center of the convex hull is overrepresented. Such uneven representation increases the computational effort and sometimes may even cause incorrect factorization. Following [24], we represent the combinations of H distributions more evenly by a dilution process which replaces every set of similar columns with a single representative. (Technically, we consider only a prespecified number of the most EMD different H vectors.) After the NMF is carried out, the full, nondiluted weight matrix W is found by a single W iteration from (3.1). The factorization algorithm is formally described in Algorithm 3. This algorithm actually finds the precision and recall not only for the segmentation of interest, but for all segmentations in H. These estimations are useful for finding the model complexity k; see below. Given H, we still need to identify the distribution (column of H) associated with the boundary. We choose the distribution associated with highest µ + 2σ value, where µ is the expected value of the distribution and σ is its standard deviation. Algorithm 1 Factorization Input: Histogram matrix H, model complexity k. 1: Dilute H to H as described in : Initialize H 0 R n k with the most EMD different columns from H. Initialize W R k m with random values, and normalize its columns to sum to 1. 3: Do W and H iterations (3.1), solving H HW, until convergence. 4: Order columns of H by µ + 2σ. 5: Solve H = HW for W with W iteration using the the obtained H. 6: Decompose W into segmentation-specific coefficients matrices W i. Estimate the precisions (P i ) and the recalls (R i ) for each segmentation using W i and (1.2). Output: {P i }, {R i }. 21

30 2.4 Estimating model complexity using several modalities The factorization algorithm described above decomposes the available histograms to sums of k basic histograms. We found, empirically, that assigning the correct value to the model complexity k is critical to the algorithm s success: For example, a too-high value of k may lead to a decomposition of the true boundary histogram into two or more estimated basic histograms, and to inaccurate estimations of P and R. The best value of k differs from image to image and depends on the type of boundary-sensitive operator (modality) as well. Specifying the true number of clusters is a hard and, in the general case, unsolved problem. However, for the current problem, we found that using the correspondence between estimations in different modalities may assist us in finding the correct model complexities. We use three modalities: brightness, color and texture. The NMF (Algorithm 3) is applied to each of them separately, using three corresponding model complexities, k 1, k 2, k 3. Let ˆP ij, ˆR ij be the estimated precision and recall associated with the i-th segmentation and the j-th modality. We found empirically that these estimations should have the following properties: Consistency between modalities. In principle, if we use different boundary sensitive operators, we should still get similar precision and recall if they function properly. Likewise, we should get similar F-values (commonly used scalar measure representing both P and R, F = 2P R ) for each modality if the model complexities were chosen P +R properly. Thus, α c = max i 3 ( ˆFij median j ( ˆF ij )) j=1 (2.4) measures the consistency and should be small. This measure was empirically chosen over other possible measures, e.g., measuring the consistency in P and R separately. Diversity of evaluations The parameters of the segmentation algorithm are chosen to maximize the diversity of the precision and the recall grades of the respective segmentations to ensure good estimation of H vectors. Thus, it is expected that α v = std i (median j ( ˆP ij )) std i (median j ( ˆR ij )) (2.5) will be large for correct ˆP ij and ˆR ij estimations. Note that when k j is too small (i.e., one vector in H represents two or more actual classes), the variety of recall grades decreases, because substantial parts of the inner segments are assigned to be the boundary and always remain undetected. Analogously, when k j is too large, the variety of precision grades decreases. Boundary size cannot be too large, because otherwise the image would contain only boundaries and no pixels inside segments. Thus, using ˆP i,j -s and the segment sizes, we estimate the boundary area percentage in the image and denote it α b. Then, one empirically selected way to quantify the considerations discussed above is to minimize c(k 1, k 2, k 3 ) = α c α b α v. (2.6) 22

31 Algorithm 2 Evaluation Input: A test image I and its segmentation(s) s i, i 1,..., M. 1: If needed, add additional segmentations using a segmentation algorithm and different parameter sets. 2: Run three boundary sensitive operators (denoted different modalities), and measure their distribution within the segments. Construct three matrices H 1, H 2, H 3. 3: For all combinations of k 1, k 2, k 3 {2, 3, 4, 5}, factorize every matrix H j using the corresponding k j value, by applying Algorithm 3, and obtain the precisions { ˆP i,k j } and recalls { ˆR i,k j } of all segmentations. Choose the (k 1, k 2, k 3 ) triple minimizing c(k 1, k 2, k 3) (2.6). ( ) ( ) 4: Calculate: P i = median ˆPi,kj and R i = median ˆRi,kj Output: {P i }, {R i } Note that none of the factors should dominate this expression even if it is very small. Thus each alpha is limited by some small constant. 2.5 Dealing with boundary inaccuracies Typical segmentation algorithms distort the boundaries. That is, even for a segmentation providing roughly true segments, the boundary locations are inaccurate. This problem is recognized in supervised evaluation methods [48, 27], and some small location error margin is allowed. Naturally, the problem arises here as well: the distribution evaluated on the inaccurate boundary is not the one characterizing the true boundary, and the distribution evaluated within a segment contains contributions from the boundary. Thus, we do not use the distribution of the boundary sensitive operator directly. Rather, to handle this difficulty, we replace the responses in the boundary points with the highest response in their circular neighborhood (r = 5). Because we expect higher values on the boundary, the pixel contributing the maximal value is indeed likely to belong to the true boundary. The other segment distributions are calculated similarly, except that points which were considered when the boundary distribution was calculated do not contribute to this distribution. A summary of the full factorization-based evaluation algorithm is described in Algorithm 2. 23

32 Chapter 3 Nonnegative Matrix Factorization with Earth Mover s Distance metric 24

300 250 200 150 100 50 (a) 0 0 50 100 150 200 250 (b) 10000 8000 6000 4000 2000 (c) 0 0 50 100 150 200 250 (d) Figure 3.1: Bilateral relation between the spatial and the feature domain representation.

The highlighted feature bin in the whole image histogram h(f) (d) is associated with spatial distribution h f0 (x) - (c).

33 (a) (b) (c) (d) Figure 3.1: Bilateral relation between the spatial and the feature domain representation. The image (a) may be represented in two domains. The highlighted spatial bin is associated with feature distribution h x0 (f) - (b). The highlighted feature bin in the whole image histogram h(f) (d) is associated with spatial distribution h f0 (x) - (c). The highlighted bins in the spatial (c) and the feature (b) distributions are identical; h f0 (x 0 ) = h x0 (f 0 ). 1 Observations and intuitions The EMD NMF methods we propose are general and are not limited to an image domain. For concreteness and a more intuitive explanation, we chose to focus here on image representation. Consider an image f( x) describing some feature f as a function of the coordinate x. We shall be interested in two types of histograms representing, respectively, parts of the image and parts of the feature space. A feature distribution h x (f) corresponds to a region R x in the image and describes the feature distribution corresponding to the pixel values in this region. A spatial distribution h f ( x) corresponds to a subset f of the feature space and describes the distribution of spatial locations corresponding to pixels having a value in this subset. Note that the spatial distributions do not necessarily sum to one. See Figure 3.1 for the relations between the two domains and the respective histograms. In this work we consider only spatial regions and feature domain subsets large enough to contain a reasonable number of samples. Note that many other image representations follow this formulation. Two examples are orientation histograms [45] and Gabor jets [62]. While both coordinates may be multidimensional, e.g.,gabor jets, we chose to discuss only a scalar feature f in the following lines for simplicity. 25

34 Consider representing an image object, or several similar objects (denoted visual class through this paper), using spatial and feature distributions. Ideally, we would expect such an object to be associated with the same feature vector in all its locations. We would also expect the spatial distributions to be piecewise constant within the objects for every feature subset. Naturally, this expectation is unrealistic and the respective distributions are somewhat different, though these differences often follow a systematic pattern described below. Consider a region belonging to a visual class with some ideal gray level histogram h(f). Different regions of the same class may be associated with different surface normal directions and corresponding histograms which are brighter or darker. In this case, the absence of some gray level in the histogram is better explained by the presence of additional gray levels in nearby feature histogram bins than in the distant, unrelated bins. Consider now the spatial domain. In realistic textures, the distribution of gray levels in every region is not entirely uniform. Consider, for example, two adjacent regions in an image of a zebra. One region may contain more black pixels than the other, but the union of the regions has a histogram which is closer to the ideal class histogram. More generally, the absence of some gray level in a spatial bin is better explained by the presence of surplus instances of this gray level in nearby spatial bins than in other locations. This model of distortion leads to comparison of distributions with the Earth mover s distance, as will be explained in greater detail in the next section. The proposed image model is well-suited to the NMF representation. Let the (i, j)-th element of H measure the number of pixels with the i-th feature in the j-th region of the image. Then, the j-th column of H contains the feature distribution in region j, h j (f). Analogously, the i-th row contains the spatial distribution of the i-th feature subset, h i ( x). The factorization variables, H and W, refer to the feature and spatial representations of the visual classes of the image. The columns of H represent the ideal feature distributions and the rows of W represent the ideal visual class locations, the image segments. The value of the (i, j)-th bin in the product matrix HW is the sum of i-th feature probabilities in different classes weighted by their relative area in j-th region. In other words, it tells us how many of the feature values in the range i we expect to find in region j, which is exactly the property the (i, j)-th bin of the matrix H measures. By factorizing H, we perform clustering in both spatial and feature domains. For image segmentation it is common to consider such groupings and gather pixels with similar appearance features and spatial locations. Some methods, for example, explicitly use this principle by clustering pixels in a combined (color, spatial coordinates) space [66, 19] Here we show that NMF models both the spatial and the feature image descriptors in a complementary way and acts as an iterative, EM-like, segmentation algorithm. For reasonable factorization we should ensure that H HW and that the differences follow the local deformation model we discussed earlier. This compels us to require minimization of the EMD error between both the rows and the columns of H and HW. In the next sections we quantify these requirements and use them to propose EMD NMF. 26

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.

lying inside the background segments, and the yellow solid lines refer to the yellow squares intersecting the totem and background

The respective spatial histograms are shown in ((d)). 2 EMD NMF Consider M nonnegative histograms with N bins.

The matrix H may be decomposed into a product of H R N K and W R K M, where H and W are interpreted as K basis vectors in two

Then, the desired factorization H, W is a solution of eq. (1.2) for small K values.

Then, H T W T H T implies that Dist φ (H, HW ) is the sum of distances between the feature histograms.

Therefore, in order to find the spatial distributions, we should factorize H T by solving arg min Dist φ(h T, W T H T )s.t.w 0, H 0.

35 (a) The original image (b) Feature histograms (c) Feature indicator images (d) Spatial histograms Figure 3.2: Intuitive explanation of the model. The feature distributions in the graphs ((b)) (lower, feature histograms, upper cumulative histograms) are associated with the squares on the image ((a)). The red dotted lines refer to the red squares lying inside the totem segments, the green dashed lines refer to the green squares lying inside the background segments, and the yellow solid lines refer to the yellow squares intersecting the totem and background segments. The feature indicator images ((c)) show the pixels with equal feature values. The respective spatial histograms are shown in ((d)). 2 EMD NMF Consider M nonnegative histograms with N bins. The histograms are represented in a matrix form, H R N M, where the j-th histogram is the column Hj. The matrix H may be decomposed into a product of H R N K and W R K M, where H and W are interpreted as K basis vectors in two complementary domains. In most cases, a low dimensional approximation is more meaningful than exact factorization. Then, the desired factorization H, W is a solution of eq. (1.2) for small K values. Let Dist φ (A, B) be the sum of distances φ between the corresponding columns of A and B. Then, H T W T H T implies that Dist φ (H, HW ) is the sum of distances between the feature histograms. Analogously, H T W T H T implies that Dist φ (H T, W T H T ) is the sum of distances between the spatial histograms. Therefore, in order to find the spatial distributions, we should factorize H T by solving arg min Dist φ(h T, W T H T )s.t.w 0, H 0. (2.1) H,W A joint clustering in both domains is, therefore, arg min H,W λ 1 Dist φ (H, HW ) + λ 2 Dist φ (H T, W T H T ) s.t. W 0, H 0. (2.2) Conveniently, the L 2 distance is bin-wise and Dist φ (H, HW ) = Dist φ (H T, W T H T ). Thus, segmenting an image in spatial and feature domains is equivalent to solving the traditional L 2 -NMF of the feature distribution matrix associated with this image. Unfortunately, this algorithm fails for real images. Solving (2.2) with L 2 -NMF implicitly associates the error independence assumption with different histogram bins. This assumption is not a good 27

36 model for the sample deviation in the approximation H HW, neither in the feature nor the spatial domain. As already mentioned, we propose to use the EMD metric for column comparison and show its ability to solve such problems. 2.1 Earth mover s distance The Earth mover s distance (EMD) evaluates the dissimilarity between two distributions in some feature space, where a distance measure between single features is given [56]. For image features, the EMD is motivated by the following intuitive observation: Some histogram bin mass may transfer to nearby bins due to natural image formation processes. The distance between two distributions which may be considered as small local deformations of each other should be less than that of other distribution pairs which differ in non-neighboring bins. Intuitively, we can view the traditional EMD metric as a sum of the changes required to transform one distribution into the other with low cost given to local deformations and high cost to nonlocal ones. Formally, the EMD distance between two histograms is formulated as a linear program (2.4, 2.5) whose goal is to minimize the total flow f(i, j) between the bins of the source histogram (i) and the bins (j) of the target histogram for a given inter-bin flow cost d(i, j); see [56]. The cost parameter d(i, j), denoted also the ground distance, specifies the inter-bin flow cost for each pair of source and target bins. EMD is a metric when d(i, j) is a metric as well; thus, we consider here only this type of cost function and denote it the underlying metric. We consider a nonnormalized distance where f(i, j) is a solution of: min f EMD(h s, h t ) = i,j f(i, j)d(i, j), (2.3) f(i, j)d(i, j) (2.4) i,j s.t. f(i, j) 0, f(i, j) h s i, j f(i, j) h t j, (2.5) i ( f(i, j) = min h s i, i,j i j because the total flow in our case is prespecified. Earth mover s distance between matrices We define EMD between two matrices with M columns as a sum of EMDs between each column in the source matrix and the corresponding column in the target matrix: M H s H t EMD = EMD(Hm, s Hm). t (2.6) m=1 28 h t j ),

37 For columns representing feature vectors, this distance measures the sum of distances between respective feature pairs. Naturally, to consider EMD in spatial domain, we should find H st H tt EMD. 2.2 Single domain LP-based EMD algorithm The general NMF problem is nonconvex and has a unique solution only for limited cases [24]. However, if one of the variable matrices H or W is given, the problem becomes linear. Thus, by consecutively fixing either H or W, one can find a local minimum for (1.2) by solving a sequence of convex tasks. This approach is also applicable to the case at hand by a simple reformulation of the EMD linear programming problem. As a result, the local minimum of EMD NMF is found by solving a sequence of linear programming tasks. Consider h s = Hm and h t = (HW ) m. Note that both vectors are normalized histograms i hs i and thus sum to one: = j ht j = 1; this constraint implies that the columns of W sum to 1 as well. With these normalizations, the linear programming constraints associated with the EMD between Hm and HW m (eq. 2.5) become f m (i, j) 0, f m (i, j) = H (i, m), (2.7) j f m (i, j) = i k H(j, k)w (k, m). Note that the constraint i,j f m(i, j) = 1 is satisfied automatically since i,j f m(i, j) = i H (i, m) = 1. Note also that if we know H, both f m (i, j) and the matrix W minimizing it may be found as: arg min f m (i, j)d(i, j) s.t. (2.7). (2.8) f,w m i,j Analogously, if we know W, we can find both f m (i, j) and the matrix H minimizing it as: arg min f m (i, j)d(i, j) s.t. (2.7). (2.9) f,h m i,j Thus, given some initial guess for H or W, we can improve the solution by the following two-phase Algorithm 3. For columns representing feature distributions, this algorithm finds a set of basic distributions (H) and the mixing weights (W ) to construct the samples in H from this set. For the spatial domain we factorize H T. This way we find a set of basic spatial distributions (rows of W ) and the mixing weights (H) to construct the samples in H from this set. 2.3 Convergence Theorem 1.1. Algorithm 3 converges to a local minimum 29

38 Algorithm 3 EMD NMF Input: The objective matrix H R N M and an initial guess for the basis H 0 R N K. 1: Find W 0 using (2.8). 2: k = 0 3: repeat 4: k=k+1 5: Find H k using (2.9). 6: Find W k using (2.8). 7: until ɛ > H H k W k EMD H H k 1 W k 1 EMD Output: W k and H k. Proof. 1. Feasibility: First note that Algorithm 3 is a sequence of LP processes. We should show that a feasible solution exists for every one of them. The minimization (2.8) gets a pair H, H k of normalized matrices. Any normalized matrix W k ensures that i H mi = j (HW ) mj and thus implies that a feasible solution exists. This follows from EMD being a transportation problem, which has a feasible solution when i hs i = j ht j [33]. An identical argument shows the existence of a feasible solution for minimization (2.9). 2. Linear programming, by definition, minimizes the flow cost and, due to (2.6), minimizes H HW EMD. Thus, applying (2.9) finds globally optimal H k for a given W k 1 and applying (2.8) finds globally optimal W k for a given H k. 3. Since the objective in (2.9) and in (2.8) is the same, H H k W k 1 EMD H H k 1 W k 1 EMD and H H k W k EMD H H k W k 1 EMD. 4. From the above it follows that every cycle of Algorithm 3 monotonically decreases the distance H H k W k EMD. This distance is lower-bounded, and therefore the algorithm converges (to a local minimum). 2.4 Bilateral EMD NMF Algorithm 3 minimizes the EMD distance between the corresponding columns of a given matrix and a matrix product approximating it. Note, however, that in the general case specified by eq. (2.2), our goal is to minimize EMD distance both between the corresponding columns and the corresponding rows. W.l.g. we shall denote the columns as feature distributions and the rows as spatial distributions, as we did in section 1. The proposed bilateral NMF is a mathematically similar extension of Algorithm 3: while Algorithm 3 considers only the feature domain but regards the spatial histogram errors as independent, we now add the minimization of the EMD in the spatial domain to the optimization function. Thus the 30

39 bilateral EMD distance is M BEMD(H, HW ) = λ 1 EMD(h m, Hw m ) (2.10) m=1 F + λ 2 EMD(H T f, W T Hf T ). Both EMD terms depend, of course, on the ground distance metric [56]. See the detailed specification below. To minimize this proposed distance, we extend the EMD NMF technique of alternating convex minimizations. Thus, analogously to Algorithm 3, each step of the proposed minimization is a linear programming task, and a sequence of such tasks achieves a local minimum and provides estimates for H and W. The EMD between one column of Hm and Hw m is: min f m f=1 f m (i, j)d f (i, j) s.t. (2.7) (2.11) i,j where f m is a variable measuring the flow that we want to minimize between the histogram bins, and d f (i, j) is a ground distance measuring the cost of moving between the bins. In the new distance we need to minimize the flow f m between feature histogram bins while also minimizing the flow f s between spatial histogram bins. Thus, the new cost function is: min f m (i, j)d f (i, j) + f s (x, y)d x (u, v) (2.12) f m,f s,z i m,i,j s,x,y subject to the constraints (2.7) and the additional constraints on the spatial flow for the i-th rows in H and HW : f s (u, v) 0, f s (u, v) H (i, u), (2.13) v f s (u, v) H(i, k)w (k, v), u k ( f s (u, v) = min H (i, u), ) H(i, k)w (k, v). u,v x k,v The ground distance d x (u, v) measures the cost of moving between the spatial bins u and v. The alternating steps are: W step minimize (2.12) for f m, f s, and W such that (2.7) and (2.13). H step minimize (2.12) for f m, f s, and H such that (2.7) and (2.13). 31

40 Note that the two sets of constraints (2.7) and (2.13) are not of the same form. The first specifies equality constraints and thus requires the total flow i j f m(i, j) to equal one. This is necessary to ensure that the columns of the solution matrices H and W still sum to one. The second constraint set (2.13), on the other hand, cannot be of the equality type, because formally there is no constraint on the sums of the H and W rows. In practice, however, the sums of the HW rows are very similar to the sums of the H rows. We apply here the standard inequality constraints of the EMD [56]. In a sense, this formulation of the problem may be regarded as solving EMD NMF between the columns with an EMD penalty term on the distance between the rows. 32

41 3 Efficient EMD NMF algorithms It is possible to find a local minimum of (2.6) by iterative application of (2.9) and (2.8) starting from some reasonable guess for H. Linear programming is a well-studied problem and plenty of freeware and commercial solvers are available. However, for (2.9) the dimension of the problem is MN 2. This means that even for a traditional, relatively small problem of factorizing 100 facial images (each in resolution), the LP optimization problem operates about 6 million variables. This makes even the specification of the problem (construction of the constraint matrix) a challenging task with today s solvers. Most of the variables arise from the need to calculate the flow f m (i, j) (and possibly f s (i, j)) in order to estimate the EMD between the histograms. The actual variables of interest are H and W, which are only a small fraction of the variables in both (2.8) and (2.9). 3.1 A gradient based approach The task of finding H k and W k in each step of Algorithm 3 is: H k = arg min EMD(Hm, (HW k 1 ) m ) H For bilateral EMD NMF it is: m W k m = arg min W EMD(H m, (H k W ) m ). (3.1) H k = arg min H BEMD(H, (HW k 1 )) W k = arg min W BEMD(H, (H k W )). (3.2) Given both H and W, the error (2.6) can be calculated by solving M (or M + N) independent, relatively small LP problems. We can solve both minimizations in (3.1) or (3.2) with some gradient based optimization over possible H (or W ) values. We are guaranteed to find the globally optimal solutions for each optimization because tasks (2.8) and (2.9) are convex. Unfortunately, the complexity of a single precise EMD computation is O(N 3 logn). Thus, the gradient based approach is expected to be complex as well. 3.2 A gradient optimization with WEMD approximation Much effort has been devoted to speeding up the EMD calculation. For some underlying metrics it is easier than for others. For example, the match distance [71], which is the EMD between 1D histograms with a specific underlying metric, can be calculated as an L 1 distance between the cumulative versions of the histograms. A short survey of other methods suggested for faster EMD calculation may be found in [65, 53]. Shirdhonkar and Jacobs [65] proposed an efficient way to calculate the EMD between two histograms for some common underlying metrics d(i, j). They proved that the result of 33

42 optimization (2.4) is approximated very well by: d(h t, h s ) W EMD = λ α λ W λ (h t h s ), (3.3) where W λ (h t h s ) are the wavelet transform coefficients of the n dimensional difference h s h t for all shifts and scales λ, and α λ are scale dependent coefficients. The different underlying metrics are characterized by the chosen scale weightings and wavelet kernels. Note that we are looking for local minima of some calculated EMD values and not for the EMD values themselves. Empirically we found that the local minima of EMD and WEMD are generally co-located, and thus the accuracy of the WEMD approximation of the actual EMD is less important for our goal. Using the approximation (3.3) in (3.1) and (3.2) reduces the computational complexity of EMD to be linear. However, gradient methods naturally require knowledge of the gradient for the optimization variables. In the case of linear programming, the gradient may be derived from the solution of the dual problem; therefore, it is a byproduct of EMD calculation. Unfortunately, for the WEMD we need to calculate the gradient separately. This gradient is: d W EMD = α λ sign(w λ (h t h s )) W λ (h t ), (3.4) λ where the explicit expression for the gradient W λ (h t ), with respect to either W or H, is lengthy but straightforward. The complexity of the gradient (3.4) computation for H is O(N 2 K). Note, however, that many additives remain constant between the iterations, and a smart calculation of the gradient greatly accelerated the computation. Note that formally applying WEMD requires equality constraints in (2.13). This condition is not satisfied, but in practice the sums of the H rows are similar to those of the HW rows. Thus we used WEMD to find the EMD and its gradient for both the columns and the rows of the matrices. 3.3 The optimization process We tested two optimization strategies: constrained optimization (H 0, W 0) of the distance (3.3), and unconstrained optimization with high penalty for negative variable values: arg min x d(hm, HW m ) W EMD + Φ(x), (3.5) m where x is either W or H according to the relevant iteration and Φ(x) is a quadratic penalty term for x < 0. The latter unconstrained optimization appears to be more precise and faster. Still, EMD NMF iterations are more complex than those of L 2 -NMF. Using Matlab on an Intel Core 2 Quad 2.5 GHz processor, one full H iteration for M = 256, N = 32, K = 3 (corresponding to the texture experiment described in section 3) takes around 30 seconds. One full H iteration for M = 200, N = 1024, K = 40 (corresponding to the face recognition experiment described in section 2) may take up to 20 minutes. 34

43 Chapter 4 Applications 35

44 Recall N cut gray N cut color 0.3 N cut texture MeanShift color 0.2 MeanShift gray best N cut 0.1 best MeanShift Optimal by image Precision Figure 4.1: Precision/recall performance with fixed and manually chosen parameter sets of the 5 tested algorithms on Berkeley images 1 A tool for unsupervised online algorithm tuning Benchmark databases ([47] and more recently, e.g., [2]) have become very popular in the last decade. Comparing the performance of an algorithm on a large set of images helps determine its advantages and disadvantages for diverse image types, to learn algorithm parameters for optimal performance [48, 4], and compare its performance to that of competing algorithms [27]. Naturally, the requirement for ground truth segmentations restricts these works to the off-line mode. In section 1 we will show that estimating segmentation performance online allows us both to select the optimal algorithm (out of a set) and increase the performance of a given algorithm by fitting better parameters for each image. Segmentation algorithms usually depend on a set of parameters and their performance is characterized by measuring the average precision and recall grades achieved for an etalon set of images for each parameter set. We denote these grades as fixed parameter set precision and recall. The term fixed explicitly refers to the parameters being the same for all images in the set. Performance curves the fixed parameter set precisions plotted versus the recalls are often used to illustrate the performance characteristics and choose the algorithm parameter set with desired performance specifications; see Figure 4.1. For general evaluation, an algorithm s performance is usually measured with a scalar performance grade. A common choice for such grade is the maximal F-value, F = 2P R, associated with a point on the P +R performance curve. The optimal fixed parameter set is considered to be the one for which the maximal F-value is obtained. It is rather interesting that the curve points summarize a wide distribution of image specific performances. See Figure 4.2 for image specific performances of all database images associated with three curve points from Figure 4.1. What we show here is that online analysis of existing algorithms might improve their performance by allowing an optimal parameter set to be chosen for each image. The single purple point in Figure 4.1 illustrates the possible improvement gain. This point shows the average (over the dataset) precision/recall performance when the optimal algorithm and 36

45 k=2 k=24 k=105 k=2 k=24 k= Figure 4.2: Precision/recall performance of the N-cut on 100 Berkeley images for k = 2, 24, and 105. The thick points correspond to the averaged precision and recall of 100 segmentations. Table 4.1: Distribution of the images in the Berkeley test set according to the better performing algorithm Algorithm N-cut Mean shift Modality gray color texture gray color #of images (fixed) #of images (auto) parameter set choice is manually chosen for each image. Note that because the parameters are now image dependent, the algorithm s performance is now described by a single point in this plot (and not by a curve). In all precision-recall plots (Figures 4.1, 5.3, and 5.4), we show the performance of different parameter optimization methods on each image as compared to the performance of the traditional method of optimizing parameters for an ensemble of images. Additional details are given in section 1. Segmentation performance when the algorithm (with its optimal parameters) is manually chosen for each image is much better than that of any of the five algorithms with any fixed parameter set because the traditional method of optimizing the algorithm parameters for an ensemble of images may not work well for outliers. Even the best parameter set might not be optimal for these outliers, which might be segmented better with the same algorithm albeit with a different parameter set; see Figure 5.3. This is where our unsupervised evaluation algorithm comes into play: it can serve here as an independent referee, able to automatically estimate the performance of a segmentation algorithm with different parameter sets on a specific image. Choosing better parameters for each image can enhance performance notably. If we are not limited to a specific algorithm, performance can be further improved by using several candidate segmentation algorithms. See the distribution of the Berkeley test images between the algorithms in Table 4.1. Each algorithm can segment an image with different parameter sets; the proposed evaluation algorithm will point out the better performing algorithm and parameter set for it. Although each algorithm performs moderately well for 37

46 predefined parameter sets, choosing the optimal parameter set for each image boosts their performance to be close to the state-of-the-art segmentation for predefined parameter sets; see [4]. Choosing the optimal algorithm improves this performance even more. The proposed segmentation tuning process is hierarchical. The external part uses the proposed evaluation to specify a particular internal algorithm and tune its parameters. Any common algorithm may be used; here we use five variations of two algorithms: N-cut [64] and mean shift [19]. In section 1 we show that this approach indeed improves the performance of each internal algorithm and of the combination thereof. Note that the NMF is run only once on a small number of segmentations (we used 10). The F values for all other segmentations are calculated using the basic histograms computed during this initialization. 38

47 2 Face recognition Face representation is a common test case for the NMF algorithms [38, 38, 43, 74]. Traditional NMF algorithms measure the differences between the faces with translation-sensitive L 2 related metrics, and thus require a good alignment between the facial features. It was shown that when the NMF is forced to prefer spatially limited basis components, these L 2 based algorithms perform better and provide perceptually reasonable parts [43, 34]. Here we show that the use of NMF with the EMD metric yields different, but still perceptually meaningful components. We found that these components are even more efficient for face classification. 2.1 The EMD NMF components Unlike the L 2 distance, the EMD is not very sensitive to small misalignments, facial expressions, and pose changes. The basis components provided by the EMD NMF are facial archetypes, each of which looks like a slightly deformed face. Each facial feature (e.g., the shape of the head, the haircut, or the shape of the nose) associated with some archetype is shared by several people. The face images in a set associated with the same person, and with different poses and expressions, are usually close (in the EMD sense) to a common facial prototype. This prototype is usually a convex combination of a small number of archetypes. Every face image is a combination of a few archetypes with relatively high coefficients (the prototype) and some other archetypes with much lower coefficients. To better illustrate this structure, we start by considering a simple image set of 4 faces: two parents, their daughter, and another, male, non-family member (six images of each person; see examples in Figure 4.3). The people in the database share several features. The males have rougher facial features, while the female faces are smoother. The daughter shares facial features with both of her parents, especially with her father. The 24 images were put into the columns of H and it was decomposed with EMD NMF with k = 3. The ground distance is the 2D distances between the image pixels. Note that the number of archetypes is smaller than the number of people. The resulting weight diagram is shown in Figure 4.3. The 3 weights associated with every image and the EMD NMF may be plotted in 2D because w 1 + w 2 + w 3 = 1. See Figure 4.3, where the input faces are plotted as (w 1, w 2 ) points. The k=3 archetypes correspond to the (1, 0), (0, 1), and (0, 0) points. The archetypes and some input images are shown as well. Note the similarity between the father (red circles) and the daughter (black triangles): both are represented mainly by the archetype in (0, 0). However, the father shares some male facial features with the archetype in (0, 1). The daughter, on the other hand, shares many facial features with her mother s archetype, located in (1, 0). The very noticeable changes in facial appearance caused by pose and expression are represented by small translations in the obtained subspace. Interestingly, the representation of visual objects as a combination of object-like archetypes was suggested as a plausible model for object recognition in the human visual system [15, 68]. 2.2 Face recognition algorithm To demonstrate the power of the EMD NMF, we use a straightforward recognition algorithm, based on 1-NN in the coefficient space. Let {(I j, C j ) j = 1,..., L} be the training set (I j is 39

48 Figure 4.3: Facial space for 4 people. The two-dimensional (w 1, w 2 ) convex subspace is projected onto the triangle with corners in (1, 0), (0, 1), and (0, 0). The corners of the triangle represent the basis facial archetypes obtained by EMD NMF. The inner points show the actual facial images weighted in this basis. an image, and C j is the corresponding class label). Training: Input: {(I j, C j ) j = 1,..., L} 1: Normalize every image I j so that I j 1 = 1. 2: Decompose the matrix I (with columns I j ), by EMD NMF, I = HW. 3: Normalize every column w j so that w j 2 = 1. Output: H, W Test: Input: I t, H, W. 1: Normalize the test image I t so that I t 1 = 1. 2: Approximate I t as a convex combination of H s columns, with weights w t = arg min w EMD(I t, Hw). 3: Normalize w t so that w t 2 = 1. 4: Find j = arg max j < w j, w t >. Output: C j. This algorithm was successfully tested on two standard face recognition databases; see section??. 40

Figure 4.4: Examples of texture mosaics. The mosaic borders change randomly, resulting in random combinations of the textures in the sample rectangles.

3 Texture modeling A texture mosaic is an image containing several types of textures in random arrangements; see examples from [50] in Figure 4.

We also would like to classify the textures in each mosaic location, at least roughly (e.g., for consecutive segmentation).

Therefore the NMF suggests itself as an analysis tool. The textures in the database [50] exhibit a lot of spatial variation.

49 Figure 4.4: Examples of texture mosaics. The mosaic borders change randomly, resulting in random combinations of the textures in the sample rectangles. Here, the images contain 3, 4, 6, and 7 textures. Note the high local variability of the textures. 3 Texture modeling A texture mosaic is an image containing several types of textures in random arrangements; see examples from [50] in Figure 4.4. We consider the task of estimating the texture descriptors associated with each texture class of the mosaic. We also would like to classify the textures in each mosaic location, at least roughly (e.g., for consecutive segmentation). To that end, we consider the texture in nonoverlapping square image patches (blocks). The texture in each block is a positive mixture of the basic textures. Therefore the NMF suggests itself as an analysis tool. The textures in the database [50] exhibit a lot of spatial variation. Even for relatively large blocks, the average texture descriptor in the block differs greatly from the average descriptor for the whole texture patch. Nor are the mosaics large enough to render descriptor distribution methods (e.g., [40]) effective. The EMD metric better compensates for the variability of the texture descriptor within the same texture than does L 2 [56, 13]. Therefore, EMD NMF is expected to be more accurate than L 2 -NMF in estimation of the texture descriptors and the mixing coefficients thereof. We rephrase the image model from section 1 as follows: Let each texture class be associated with some vector descriptor h true k in each location of this texture. Then the K descriptors associated with a mosaic image are H true = (h true 1,..., h true K ). Ideally, the mean texture descriptor in the j-th image block should be h j = H true wj true, where wj true is the vector of true fractions of the j-th block area associated with each texture class. We applied the NMF to the texture mosaics by: 1. Converting the image to some feature vector representation. Following the findings in [56],we chose to work with the Gabor features, and thus each location is represented by a 6-orientation 5-scale feature vector of Gabor responses [62]. Again, although the texture descriptors are organized in matrix columns, we consider 2D ground distance in the scale-orientation space. 2. Dividing the image into M nonoverlapping rectangular blocks and calculating the mean feature vector h j for each block. We denote all the sampled mean block descriptors H = (h 1... h M ). 41

50 3. Finding the factorization H HW. In this case only the domain of texture descriptors fits the EMD noise model, thus we use the single domain EMD NMF version. The results of the factorization are the approximated representative texture descriptors H = (h 1... h K ) and the approximated fraction of each texture in each block W = (w 1... w M ). In section 3 we show that the results obtained with EMD NMF are more accurate and more robust than those obtained with L 2 -NMF. 42

51 4 NMF and image segmentation 4.1 A naive NMF based segmentation algorithm The NMF may be applied to image segmentation. We start by describing a preliminary, naive NMF based segmentation procedure and then continue developing it to achieve better results. Suppose that we use the NMF procedure to obtain an H and W associated with relatively small tiles R m covering the image. W gives us a rough localization of the segments in the same resolution as the tiles; see Figure 4.5, top line. To obtain a refined, pixel resolution segmentation, we use the following Bayesian consideration: The w k,m fraction is the fraction of pixels coming from class k in the tile R m, and may be regarded as the prior probability that a pixel in R m belongs to the class k. We propose to decide, for every pixel, to which class it belongs, by means of a maximum a-posteriori decision. Suppose the image is scalar and F ( x) is the value in pixel x. Let H k,f be the value of the bin associated with the feature value f in the histogram of the class k. Then: C( x) = arg max P (c k f = F ( x)) k = arg max k The preliminary NMF-based segmentation algorithm is: 1. Tile the image with M regions. 2. Compute H for these regions. 3. Factorize H with NMF and obtain H and W. 4. Compute C( x) for each image pixel using eq.(4.1). w k,m H k,f ( x) K k=1 w k,mh k,f ( x). (4.1) For computational simplicity we use square tiles. Unfortunately, this algorithm does not work well for real images. Even though the EMD NMF succeeds in finding reasonable approximations for H and W matrices, as shown in section??, the inaccuracies in the obtained W estimations cause frequent errors in the Bayesian assignment (4.1). Now we propose several improvements which bias the bilateral EMD NMF toward even more accurate W estimation, and a corresponding better image segmentation algorithm. 4.2 Spatial smoothing Recall that, ideally, the spatial basis histograms W T are piecewise constant. To use this information, we propose to implement the NMF under the BEMD distance with preference to minimizing the total variation [57]: [Ĥ, Ŵ ] = arg min H,W BEMD(H, HW ) + λt V (W ), where (4.2) T V (W ) K M = d x W m,k + d y W m,k. (4.3) k=1 m=1 43

52 d x W m,k (d y W m,k ) is the difference between the spatial histogram value W m,k and another value W m,k associated with the following x (y) coordinate on the image plane. In the new distance we need to minimize z x and z y the differences between neighboring W entries in addition to minimizing the flows f m and f s between the feature and spatial histogram bins. Thus, the new cost function is: min f m,f s,z i m,i,j f m (i, j)d f (i, j) + s,x,y + m,k f s (x, y)d x (u, v) z x (m, k) + z y (m, k). (4.4) Subject to the constraints (2.7), (2.13), and the additional constraints on the spatial changes of W (similar for the x and y directions): z x (m, k) 0 z x (m, k) d x (W m,k ) z x (m, k). (4.5) The ground distance d x (u, v) measures the cost of moving between the spatial bins u and v. The alternating steps become: W step minimize (4.4) for f m, f s, z, and W such that (2.7), (2.13), and (4.5). H step minimize (4.4) for f m, f s, z, and H such that (2.7), (2.13), and (4.5). In practice, we use WEMD based optimization to solve each step, analogously to what is described in section Multiscale factorization The preferred solution for W is piecewise constant. Thus, we can save a lot of computational effort by working with W in lower resolution during most of the factorization process. Moreover, the feature histogram estimation is more precise when applied to larger regions, e.g., see section 3. To use this twofold advantage we worked with a hierarchical, or multiscale, BEMD NMF solver. First, the image is divided into large tiles and a small H matrix is built. This matrix is factorized quickly and a rough W along with a precise H are estimated. Then, the new H associated with smaller tiles is constructed and factorized with BEMD NMF. The latter factorization is initialized with the estimated H. This process may be continued to finer resolutions; however, for the finer scales, the complexity grows and the model becomes less accurate. Therefore, we usually applied the factorization with 3-4 scales; see Figure Boundary aware factorization We refer to a boundary as a special, one pixel wide segment such that each pixel has at least a pair of neighbor pixels belonging to different object classes. 44

53 Figure 4.5: W estimates by multiscale BEMD. The results are for three-class factorization. The rightmost image for every scale shows the boundary class. Because of its small size and high variability, the boundary is not modeled as a standard row of W. In each W step the factorization algorithm associates a small part α (2.5% in our implementation) of each non-single-class region to be in the boundary segment; see Figure 4.5, the rightmost image for each scale. For a single-class region (i.e., a region with W m,k > 1 α for some k) the boundary class weight is zero. The boundary class is usually associated with a wide distribution because of the high variation in the boundary feature values. Technically, the boundary class is associated with a column in H and the H step of BEMD NMF remains the same. Effectively we gain a twofold advantage: The boundary feature histogram effectively collects the feature distribution of the outliers in nonsingular regions and the class feature histograms become more precise. 4.5 Bilateral EMD NMF segmentation algorithm The final segmentation algorithm (Algorithm 4) is an enhancement of the first, naive algorithm proposed in the beginning of this section by the spatial smoothing term, the hierarchical decomposition, and the boundary extraction. The parameters are: β max is the number of scales (we used 3 or 4); is the length of the tile side (we used 80 pixels); K is the manually specified number of classes. Pixelwise Bayesian assignment sometimes creates a salt-and-pepper like mix between two classes if both classes have similar probability in a region. To avoid this kind of noise, we smoothed the obtained probability maps with several iterations of anisotropic diffusion. 45

54 Algorithm 4 Bilateral EMD NMF segmentation Input: I(x, y), K, β max,. 1: Guess initial Ĥ Rn k+1 in a reasonable way. Set the boundary distribution as uniform. 2: for scale β = 1 : β max do 3: Calculate H β for tiles. β β 4: repeat 5: Find Ŵ β using W step. 6: Ŵ β F indboundary(ŵ β ), see sec : Find Ĥ using H step. 8: until convergence 9: end for 10: Find P (c k F ( k)) with (4.1). 11: Smooth P (c k F ( k)) and find C( x) with MAP. Output: Ŵ β, Ĥ, and C(x, y). 46

55 Chapter 5 Experiments 47

56 1 Evaluation experiments The proposed evaluation method was experimentally validated as follows: We segmented the images in the Berkeley dataset using 5 segmentation algorithms, varying dozens of parameter sets for each. We then compared the precision and recall grades obtained by the manual procedure as described in [48] with the automatic grades of the proposed Algorithm 2. We also used the automatic estimations obtained in this experiment to choose the best algorithm and parameter set pair for each image. We used two popular segmentation tools: normalized cut [64] and mean shift [19]. For both we used the code published by the authors (for normalized cut we used a faster multiscale code version [20]). We tested mean shift in grayscale and color modalities and tested normalized cut in grayscale, color, and texture modalities. The proposed factorization, Algorithm 3, and the normalized cut algorithm require image edge strength maps as input. For multidimensional features (color and texture), edge detection is not a simple operation as it is for the grayscale feature. Here we used the edge detection algorithm described in [62]. We expect, however, that other edge detection operators would give results similar to those reported here. 1.1 The accuracy of unsupervised estimates The manual markings supplied as a part of the Berkeley database serve as ground truth, and are considered below as true. Note, however, that different people segmented the same images differently and the ground truth is not unique. Thus, before comparing the automatic estimations to the supervised quality assessments, we ascertain the intrinsic limitation of the supervised assessment method itself. The Berkeley benchmark tool contains two types of manual segmentations. The human operators saw either a color or a grayscale version of each image. The resulting segmentations are cataloged respective to the observed modality. We adopt here the supervised quality assessment procedure as it is described in [48]. The different observed modalities correspond to two different assessment methods. We compare our automatic estimations to the grades given by both methods. The difference between the two manual methods is considered as an intrinsic accuracy limitation. The distribution of manual assessment inconsistencies is shown in Figure 5.2(a). We found that the precision of segmentation assessed according to the graylevel based ground truth is systematically lower by 4% than the one assessed according to the color based ground truth. Analogously, the precision assessed automatically is systematically 8.5% higher than the color based ground truth. The recall assessments are unbiased for the three methods. We normalized the grayscale and the automatic precision grades by the factors of 1.04 and accordingly for inter-modality comparisons. Note that this normalization is not needed if one uses a single modality estimation, e.g., for algorithm comparison, as it is done in common precision estimations, e.g., [48]. The distributions of the differences between the automatic estimations and the manual estimations in both modalities are shown in Figure 5.2(b) and (c); see details below. The automatic estimation of the recall is almost as good as the manual one. The differences between the manual and the automatic estimations of the precision are slightly greater 48

0 0.2 0.4 0.6 0.8 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.

1 0 2 0.1 0 2 0.1 0 2 0.1 0 2 0.1 0 0 0.2 0.4 0.

1: A comparison of unsupervised precision (blue)

evaluations made with two types of ground truth.

consistencies between the three evaluation methods.

stand in the way of a good quality assessment.

segmentations, e.g., the image of the boys in the

57 Figure 5.1: A comparison of unsupervised precision (blue) and recall (green) estimates with the supervised evaluations made with two types of ground truth. Each image was segmented 48 times by the N-cut algorithm. The examples include the best and the worst consistencies between the three evaluation methods. Note that poor segmentation quality, e.g., the bird image in the second line, does not stand in the way of a good quality assessment. On the other hand, good but too consistent segmentations, e.g., the image of the boys in the sixth line, might result in a worse assessment. 49

58 P color P gray 2500 P color P gray 2500 P color P est 0.5 abs(p unsupervised P human ) abs(p color P gray ) 2000 R color R gray 2000 P est P gray R est R gray 2000 R color R est P color P gray 0.4 abs(r unsupervised R human ) abs(r color R gray ) % of occurences (a) human color vs. gray (b) automatic vs. human gray (c) automatic vs. human color (unsupervised,human) (color,gray) (d) automatic vs. human Figure 5.2: Inconsistency distributions for different measurement methods. The color and automatic precision assessments were normalized to be unbiased relatively to the graylevel based assessments; see text. than between the two manual estimations. This result is expected: the boundary is the smallest and the most highly varying segment in the image. Thus, adding a small amount of non-boundary data to the estimated boundary distribution yields significant differences in precision estimation, because there are a lot of miscounted non-boundary pixels. On the other hand, adding a small amount of boundary data to the estimated non-boundary distributions does not change the recall much, because the few miscounted boundary pixels are insignificant relative to the non-boundary area. Note also that the manual evaluations are partially based on semantic image analysis, and yet, surprisingly, the precision of the proposed automatic evaluation is very high. The accuracy was tested as follows: Most of the experiments were performed on the Berkeley test database. To estimate the histogram basis H, we segmented each color image by mean shift with 10 different parameter sets. We found empirically that this is the best way to quickly obtain segmentations with the most diverse precision and recall values. In all successive experiments we used the established H bases and estimated W in a single W -iteration. The histograms shown in Figure 5.2 refer to 4800 N-cut segmentations of 100 images from the Berkeley test database. Each image was segmented 16 times in each of the three modalities using a different number of expected cuts (we used k = 2, 4, 8, 12, 16, 20, 24, 29, 34, 40, 50, 60, 70, 80, 90, and 105). Note that the segmentation method is not important for this test. Similar results were obtained in [60] by using mean shift segmentations. Typical precision/recall correspondences for different images are shown in Figure 5.1. Note that the automatic recall estimations are always very similar to the supervised ones. The precision estimates may be less consistent with human opinion for some difficult images, for which the differences in supervised precision are also higher. Even for the worst (for 100 images) inconsistency in the automatic and manual precision assessments (the boys image in Figure 5.1), the automatic precision estimation is still very similar to the supervised ones. Note that the unsupervised estimate is usually monotonic in the true one. An interesting observation is that the input segmentation quality does not influence the automatic estimation performance; see, e.g., the bird in Figure 5.1. We found that the distribution of the differences between the automatic and the manual estimations are correlated with the inconsistencies in human judgment (Fig. 5.2(d)). Large precision or recall differences between the automatic and the manual estimations are as rare 50

59 as between two different manual estimations. Numerically, the variance of the precision (recall) inconsistencies between the two supervised estimations is 0.01 (0.012) and between the supervised(color) and the algorithmic estimations is 0.03 (0.018). The variance of the inconsistency difference is 0.02 << , i.e., the difference is far from being random, and the difference magnitudes are correlated for the two measurements. For recall the situation is similar: << Application: image-specific algorithm optimization We now test the power of the proposed evaluation method for unsupervised image-specific performance optimization. We used 100 images from the Berkeley test database. Each image was segmented 16 times by normalized cut in graylevel, color, and texture modalities and 75 times by the mean shift algorithm in graylevel and color modalities. For each image segmentation we estimated the precision, the recall, and the F-value in the proposed unsupervised way. Then, for each image we selected 5 segmentations associated with the highest automatic F-values, one for each algorithm. Among those we also chose the segmentation with the highest F-value, which represents the optimal algorithm and segmentation for this image; see Figure 5.3. Similarly to [48] and [27], we tested the Recall N cut gray N cut color N cut texture 0.6 Meanshift color Meanshift gray Automatic best Precision Figure 5.3: F-value differences between the segmentations in a set (the parameter sets and the automatically chosen segmentations) and the best segmentation of the same image with the same algorithm and the optimal, image specific, segmentation parameters. The width of the marker is proportional to the average F-value difference for the set. Note that the automatic estimations are better and closer to the optimum than any fixed parameter set for all algorithms. The top-right automatic estimation (black) is associated with choosing the best segmentation from the 5 algorithms. average performance on the dataset, although, as shown in Figure 4.2, this performance is very different for different images. As expected, we found that choosing the best parameter set for an algorithm increases its performance; see Figure 5.3 for algorithm-specific graphs. We also estimated, in a supervised way, for each algorithm and for their union, the best possible segmentation. We checked for each image the F-value difference from the chosen 51

60 Table 5.1: The average F and F values for the segmentation algorithms Algorithm N-cut Mean shift All Modality gray color texture gray color fixed parameters automatic F fixed F auto segmentation to the optimal one and found that the variability of these differences for the chosen segmentations is smaller than for the predefined parameter sets; see Figure 5.3. One more way to see the improvement due to the online parameter tuning is to compare the average F-values of the best parameter set and the chosen segmentations in Table 5.1. Note that because the F-value is a nonlinear combination of P and R, the average F differs from the F computed from average P and average R (as done in [48]). The F values associated with average P and R are larger by 0.02 than those in Table 5.1. Recall P R curve P deficient (23%) Standard (63%) R deficient (10%) All images Automatic choice Precision Figure 5.4: The performance for the outlier images. The performance is shown for the mean shift color segmentations. The points with circle markers show the performance of the proposed automatic choice algorithm, while the other end of the lines connected to them shows the performance with the optimal set of algorithm parameters. The algorithm s tuning curve is plotted for illustration purposes. The most important improvement proposed by our method is associated with the outlier images. For these images the optimal segmentation is associated with a parameter set significantly different from the algorithm s optimal parameter set. See Figure 5.4 for the performance curves associated with such images. As described in section 1.1, there are natural differences between the different estimation methods, even between two supervised methods. All the precision/recall curves are calculated in the supervised color modality, and even in another supervised modality they are somewhat different. Thus, it is not surprising that the automatic parameter tuning did not bring the algorithms to the optimum in the color modality. Parameters which are optimal for 52

61 the color based estimation are non-optimal for the grayscale based one. Note, however, that even for the largest misses of the online algorithm, the chosen segmentations are reasonable; Figure 5.5. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) (r) Figure 5.5: The largest inconsistencies between the manual and the automatic choice of the best mean shift segmentation from a set. The left segmentation in each pair corresponds to the automatic choice. The right segmentation in each pair corresponds to the best supervised score. 53

62 2 Face recognition experiment We tested the EMD NMF based recognition algorithm on the popular Yale [5] and ORL [58] face databases. We follow the experimental procedure of [74], so that we can relate our results to those in [74] using the ORL database. Therefore, the face images are downsampled so that their longer side is 32 pixels. Moreover, as observed in [74], the recognition performance depends to a small extent on the partition of the database into the training and test sets. Following [74] and the approaches cited there, we provide the best results obtained in several training/test partitions. In contrast to [74], we did not tightly align the faces by forcing the eye positions to coincide. Both databases contained images that were only roughly aligned. We did not touch the ORL database and, in the Yale database, we only centered the faces. This was necessary to avoid a situation in which facial position plays too great a role in identification. Figure 5.6: The Yale faces database. The database contains images of 15 people, and we considered 8 images for each person. The first two rows show examples of the database images. The last row shows the basis images obtained with EMD NMF. The Yale face database contains fewer people than ORL, but is more challenging for recognition. We used a subset of it containing a set of images corresponding to the same lighting direction. Even with this restriction, the recognition task is not easy due to the high variability of expressions and to the possible presence of glasses. This implies that even for the best partition of the database into training and test sets, the test faces always differ considerably from their closest training examples. Four images were used to represent every person in the training set. A relatively high recognition rate of 86.6% was achieved using only 6 basis archetypes (representing 15 people). The archetypes obtained in this test are shown in Figure 5.6 together with examples of the faces they represent. Increasing the number of archetypes to 15 (one per person) increased the recognition rate to 95%. All the misses are due to glasses appearing in the test image but not in the corresponding training images. It is interesting to observe that the proposed algorithm does not behave like a nearest neighbor algorithm with EMD metric. When a representative archetype for each person was computed as the image minimizing the sum of EMD distances over the corresponding training images, and 1-NN (with EMD metric) was used for recognition, accuracy was only 73.3%. This advantage of the EMD NMF based algorithm could be predicted also from the weight diagram in Figure 4.3, where, clearly, the father s images are closer to the daughter s mean image than to his own mean image (in weight space) and can be recognized only by the additional components. The ORL database contains images of 40 people and is somewhat easier. As in [74], five images were used to represent every person in the training set. The recognition accuracy 54

63 (a) (b) (c) (d) Figure 5.7: Typical recognition error in ORL database. When the test face image (a) is in a very different pose from that of the same person in the training set, the most similar person in the same pose (b) may be erroneously identified. The second-most similar identifications (c,d) are correct. naturally changes with basis size K. For K equal to or larger than the number of classes (people), the EMD NMF algorithm outperforms all the NMF based algorithms considered in [74], which often use much larger bases; see Table 5.2. Even with much lower basis dimension, the proposed algorithm achieves very high, competitive, accuracy. Analyzing the (few) recognition errors, we found that they are associated with poses which differ notably from those in the training set; see Figure 5.7. Table 5.2: Classification accuracies of different algorithms on the ORL database and the corresponding basis sizes cited from [74]. Algorithm NMF LNMF NGE PCA LDA MFA Basis Size Accuracy (%) Table 5.3: Classification accuracy of EMD NMF on the ORL database for different basis sizes. Basis Size Accuracy (%)

64 3 Texture descriptor estimation We applied the algorithm described in section 3 to 90 online generated mosaics [50]. Each test was repeated for combinations of two parameters: the number of textures in the mosaic (K = 3,..., 12 textures) and the number of blocks M = 16, 64, 256, 1024 (number of columns in H ). The blocks tessellate the image. Therefore, M also specifies the block size to be , 64 64, 32 32, and pixels respectively. In each test the K parameter was set to the number of texture classes in the image. We compared the estimated H and W matrices with the actual matrices H true and W true using the following correlation measure: Q a (A, A true ) = 1 K K i=1 < a i, a true i > a i a true i. (3.1) The estimated Q h = q(h, H true ) and Q w = q(w T, (W true ) T ) values for the different test parameters are shown in Figure 5.8. The columns/rows are assigned to the respective ones in the true matrices by sequential greedy assignment, which maximizes Q a. Note that as the block size increases, the descriptors H are evaluated over a bigger area and are thus more precise for both metrics. The graphs in Figure 5.8 illuminate two important differences in the behavior of the two metrics. Although they perform comparably when sufficient (64) samples of relatively reliable (64 64 blocks) of data are available, EMD NMF outperforms L 2 -NMF when the number of sample vectors is small or the samples less reliable. For the EMD metric, the performance of the H reconstruction does not depend on the number of classes, whereas for the L 2 metric it decreases with a larger K. These findings also support the observation that EMD is more robust when ideal data is not available. In addition to the mean of the column/row correlations (3.1), we also measured their standard deviation. We found that the EMD NMF is generally associated with much smaller (in 30-50%) standard deviation than the L 2 -NMF. The intuitive explanation is that while the L 2 -NMF estimations of H columns and W rows are either very accurate or very inaccurate, the EMD NMF estimations are generally more stable. Together with the average correlation results, this makes the EMD NMF estimations for both H and W more reliable than those of L 2 -NMF. 56

65 Average correlation Average correlation H correlation with the true values L 2 EMD x128 64x64 32x32 16x16 Window size W correlation with the true values L2 EMD x128 64x64 32x32 16x16 Window size Average correlation Average correlation H correlation with the true values L EMD L EMD Number of classes W correlation with the true values L L EMD EMD Number of classes Figure 5.8: Texture descriptor estimation accuracy. The first row shows the reconstruction quality of the basis descriptors and the second row shows the reconstruction quality of the mixing coefficients. The left column shows the average (over different K-s) reconstruction quality for the different sizes of the sampling blocks and the right column demonstrates the reconstruction quality as a function of the number of texture classes for two sizes of the sampling blocks. 57

66 4 Segmentation We experimented on two popular image databases: the Berkeley Segmentation Dataset [47] and the Weizmann Segmentation Evaluation Database [2]. Both databases are built on similar ideas and provide tools to benchmark algorithm performance using the manual segmentations of the database images. Both test performance in similar terms an algorithm receives an F -number score for each segmented database image. The evaluation task associated with the F -value score is different for the two databases. The score in the Berkeley database judges the algorithm by its ability to detect all object boundaries specified in manual segmentations and to avoid boundary detection in other places. The evaluation task in the Weizmann database is to specify the main object s pixels in the image as accurately as possible. We performed a similar simple test on both databases. Each pixel was characterized with gray level value and gradient size as its 2D feature. Each image was segmented with Algorithm 4 into a manually specified number (between 2 and 7) of classes and the boundary class. In both tests the proposed algorithm showed consistent results; see some examples in Figures 5.9 and However, the interpretation of these results is different for the two databases. Weizmann database. The goal of this database benchmark is to detect the main object in the image accurately. The database was purposely designed to contain a variety of images with objects that differ from their surroundings by either intensity, texture, or other low level cues. These low level cues may differ along the goal object as well as along the background. The images in the database are gray scale. The best achieved performance of the algorithm on this database was F = According to [2], this performance is much better than that of N-Cut (F = 0.72) and MeanShift (F = 0.57) and even better than that of some complex multifeature algorithms. The algorithm best succeeded with images having different feature descriptions of the object and the background, no matter how complex this description is, and failed mostly on the images where the object and background descriptions share a large part of the feature space, especially if these shared features have large spatial presence; see examples in Figure 5.9. Berkeley database. The Berkeley test checks an algorithm s performance on boundary detection tasks for color images. The ground truth segmentations include some of the image objects chosen manually. Algorithm 4 provides for each image point a probability to be a boundary point. These probability maps were tested by the database benchmark tools. Testing the object on this database reveals both the merits and the deficiencies of the algorithm. While its results (F = 0.55) are worse than those obtained by state-of-the-art learning based algorithms [4], it should be noted that the state-of-the-art results are obtained using the color information from the images. Our results are similar to those obtained by N-cut and mean shift on grayscale images [?]. Looking at some examples, it is apparent that the algorithm is able to extract the appearance model but fails to exploit this knowledge to segment the fine details of the object. Stronger features (e.g., texture and color) and a more sophisticated final segmentation stage are needed to exhibit the strength of the proposed algorithm in this test. 58

67 Figure 5.9: Segmentation examples, Weizmann database Figure 5.10: Segmentation examples, Berkeley database 59

Fast and Robust Earth Mover s Distances

Fast and Robust Earth Mover s Distances Ofir Pele and Michael Werman School of Computer Science and Engineering The Hebrew University of Jerusalem {ofirpele,werman}@cs.huji.ac.il Abstract We present a