Diploma Thesis. Color Image Segmentation Based on an Iterative Graph Cut Algorithm Using Time-of-Flight Cameras. Markus Franke

Size: px

Start display at page:

Download "Diploma Thesis. Color Image Segmentation Based on an Iterative Graph Cut Algorithm Using Time-of-Flight Cameras. Markus Franke"

Oswald Logan
6 years ago
Views:

1 Diploma Thesis Color Image Segmentation Based on an Iterative Graph Cut Algorithm Using Time-of-Flight Cameras Markus Franke Tutor: Dipl.-Inf. Anatol Frick Institut für Informatik Christian-Albrechts-Universität zu Kiel Multimedia Information Processing Prof. Dr.-Ing. Reinhard Koch January 16, 2011

3 Abstract This work describes an approach to efficient color image segmentation by supporting an iterative segmentation algorithm with depth data collected by time-of-flight cameras. The algorithm uses an energy minimization approach to segment an image, taking account of both color and region information. The minimization itself is performed by formulating an equivalent minimum cut problem on an image graph, which can then be solved efficiently by maximum flow algorithms. The foreground and background color distributions of the images subject to segmentation are represented by probabilistic color models, which are optimized iteratively by parameter learning. To initialize these models for the purpose of optimization, the algorithm requires an initial classification of pixels. Depth data collected by timeof-flight cameras is used to provide this preliminary pixel classification for the initialization of the color models. This approach automates the initialization step, which otherwise relies on user input. It is shown, that the presented segmentation technique can be successfully applied to difficult segmentation problems, where foreground and background color distributions are similar. Also different implementation techniques for the maximum flow algorithms are investigated to optimize the performance of the segmentation. Keywords: segmentation, graph cut, time-of-flight iii

5 Danksagung An dieser Stelle möchte ich mich bei allen Personen bedanken, die mich bei der Enstehung dieser Arbeit unterstützt haben. Mein Dank geht an Prof. Dr.-Ing. Reinhard Koch, der es mir ermöglichte an einem solch interessanten Thema zu arbeiten. Besonders bedanken möchte ich mich bei Herrn Dipl.-Inf. Anatol Frick für seine erstklassige Betreuung. Seine vielen Hinweise und Hilfestellungen haben die Entstehung dieser Arbeit maßgeblich unterstützt. Mein Dank geht auch an René Klein Gunnewiek und Patrick Vandewalle, die mich während meines Praktikums bei Philips Research betreut haben und immer ein offenes Ohr für Fragen hatten. Schließlich möchte ich mich bei meiner Familie für den jahrelangen Rückhalt und die Unterstützung während meines gesamten Studiums bedanken. In diesem Zuge danke ich auch meiner Freundin, ohne deren Fröhlichkeit und Hilfsbereitschaft so mancher Tag anders verlaufen wäre. Selbstständigkeitserklärung Ich erkläre hiermit, dass ich die vorliegende Arbeit selbstständig und nur unter Verwendung der angegebenen Literatur und Hilfsmittel angefertigt habe.... Markus Franke v

7 Contents 1 Introduction Motivation and Ambition Structure of this Work Theoretical Background Graphs and Networks The Minimum Cut Kolmogorov Maximum Flow Algorithm Push-Relabel Maximum Flow Algorithm Color Models RGB Color Model XYZ Color Model L u v Color Model Statistics Random Variables and Probability Functions Mixture Models Vector Quantization Binary Morphology Cameras and Perspective Geometry Pinhole Camera Model Perspective Geometry Time-Of-Flight Cameras Depth Image-Based Rendering Image Segmentation Segmentation by Energy Minimization Graph-Based Color Image Segmentation Trimap Generation Color Clustering K-means Clustering Binary Tree Color Quantization Gaussian Mixture Model Initialization Iterative Segmentation Assignment of Pixels to GMM Components vii

8 3.4.2 Updating the Gaussian Mixture Models Segmentation by Graph Cut Performance Improvements Reusing the Graph Reusing the Flow Improvements for the Kolmogorov Algorithm Augmenting Terminal Paths Saving Visited Edges Distance Heuristic Timestamp Heuristic Reusing the Search Trees Order of Orphan Processing Improvements for the Push-Relabel Algorithm Order of Vertex and Edge Processing Global Relabeling Heuristic Gap Relabeling Heuristic Early Termination Experimental Results Comparison of Clustering Algorithms Convergence of the Iterative Minimization Segmentation Parameters Influence of Parameter γ Influence of Parameter K Segmentation Performance Evaluation Segmentation of Downsampled Images Results for L u v Images Video Segmentation Conclusion and Future Work 91 A Appendix 93 A.1 Singular Value Decomposition

9 List of Figures 2.1 A s-t network with flow and the corresponding residual network A s-t network subject to a minimum cut Example of the search trees formed by the Kolmogorov algorithm A s-t network with initialized preflow The RGB color model Gaussian mixture example Example of binary erosion and dilation The pinhole camera model Time-of-flight principle Principle of PMD camera Overview of image warping Disocclusion at image warping Ambiguity of segmentation Camera setup Sample images from camera setup Warping result images Depth image from combined warping Trimap generation Binary tree structure of divisive hierarchical clustering Orchard-Bouman-Clustering neighborhood of a pixel Overview of the iterative segmentation algorithm Example of image graph created for graph cut Example of s-t network reparameterization Push operations prevented by global relabeling Clustering comparison Influence of clustering on segmentation Input images for clustering performance analysis Convergence behavior for test images Influence of parameter γ I Influence of parameter γ II Influence of parameter K ix

10 5.8 Input images for performance analysis Segmentation results for downsampled images Influence of parameter γ on segmentation (L u v ) Input images for RGB/L u v comparison Comparison of RGB and L u v results Video segmentation results

11 List of Algorithms 1 P ush(u, v) Relabel(u) P ush-relabel(g, s, t, c) K-means(I, K) Orchard-Bouman-Clustering(I, K, A) Update-t-links(G) Repair-trees(G) Discharge(u) xi

13 1 Introduction The ongoing progress in camera and display technology is but one cause of the rising demand for methods of computer-aided processing of visual information. Systems like 3D television, camera-based driver assistance, or medical imaging all use image processing techniques to extract important image information. The processing done by the just mentioned systems often includes the partitioning of an image into several different regions, which is called a segmentation of an image. For example, magnetic resonance tomography images of patients are segmented to identify certain organs for diagnosis and treatment. Driver assistance systems, like collision avoidance try to identify foreground objects like persons or obstacles from camera images to help preventing accidents. Techniques to postprocess captured 3D video content include segmentation to improve depth perception experienced by the user of a 3D television system. In this thesis we are concerned with the segmentation of images captured by a color camera. 1.1 Motivation and Ambition The segmentation of an image into foreground and background regions can be regarded as an extraction of the foreground objects of the depicted scene, which are closer to an observer than the background of the scene. Besides this depth information, a human observer uses other cues to identify foreground objects of a 3D scene, e.g. perspective, shadows, color difference, occlusion and recognition of known shapes. These cues even enable a human to identify foreground objects of 2D images, where no depth information is available. However, the creation of an algorithm that automatically segments a 2D color image similar to visual perception without depth information is a difficult task. Often additional user input is necessary to narrow down the number of feasible segmentation results. There are several segmentation approaches available, many of them try to segment image pixels to foreground and background based on their color values. However, solely relying on color data is problematic, if the colors of the foreground objects are similar or equal to the background colors. If we could aquire accurate per-pixel depth information for a 2D image, then the segmentation task would be trivial. Luckily, the development of cameras that are able to aquire this depth information directly is making progress. However, the method utilized by these so called time-of-flight cameras is subject to noise and thus the 1

14 1 Introduction generated depth information is not completely reliable. Additionally, the resolution of the produced depth images is limited compared to high definition color images. Thus, to generate per-pixel depth information for corresponding high resolution images, the depth data needs to be upsampled, introducing additional errors due to the large resolution differences. In [BJ01] an algorithm is presented that uses a graph representation of an image to efficiently perform a segmentation by energy minimization. Foreground and background regions of the image are represented by statistical models based on color data. The pixels are then assigned to the region to which they fit best in terms of color similarity. This technique is further developed in [RKB04] to an iterative optimization algorithm featuring parameter learning. However, to establish said models for foreground and background, a preliminary segmentation of several or all image pixels is necessary to create the statistical models in the first place. This initial segmentation is obtained by user input. The aim of this thesis is to combine the iterative segmentation algorithm in [RKB04] with depth data collected by time-of-flight cameras. The preliminary segmentation of an image subject to segmentation, which is required by the algorithm, can be created from this data. As the depth images are not completely reliable at depth discontinuities, the pixels in an uncertainty region around foreground objects will be processed by the segmentation algorithm. This combines the advantages of both techniques, leading to an automatic segmentation approach, as the amount of user input for the iterative optimization is reduced. Also value is set upon a high performance implementation, which is important if the segmentation is applied to large sequences of images. 1.2 Structure of this Work After the introduction presented in this chapter, the theoretical background of this thesis is given in Chapter 2. This includes an introduction of basic graph theory, as the segmentation of images is done by using maximum flow algorithms on a graph representation. In Chapter 3 the actual segmentation algorithm is presented, including the modification of the initialization step by integrating depth data from time-of-flight cameras. After that, in Chapter 4, several measures of performance improvement for the maximum flow algorithms are presented. These measures are going to be implemented to decrease the run-time of the segmentation algorithm. Experimental results of the segmentation are presented in Chapter 5, including a performance analysis of the maximum flow algorithms. Finally, a conclusion is given in Chapter 6, along with a discussion on possible future work to improve or expand the techniques presented here. 2

15 2 Theoretical Background In this chapter several terms and definitions used in this thesis are presented [CLRS01]. At first a general introduction to graphs and maximum flows in networks is given. Then the equivalence of the maximum flow problem to the minimum cut problem is presented, which enables us to efficiently solve the image segmentation task by calculating the maximum flow in a corresponding network. 2.1 Graphs and Networks A graph is a mathematical abstraction of a set of objects affected by a certain kind of relation. The objects are represented by a set of vertices, and the relation by a set of edges that connect related vertices. If the relation is not symmetric, the corresponding graph is connected by directed edges: Definition 1. A directed graph is a tuple G = (V, E) with V being the set of vertices and E V V being the set of edges, with (u, v) E : u v. Suppose we want to model a network of interconnected pipes that route certain amounts of water from a source to a target basin. We define two distinguished vertices: the source vertex s which supplies water to the network, and the sink vertex t which represents the target basin. The other vertices represent the connection points of the pipes, wich are represented by the edges. As yet we can only describe a binary relation among the vertices, namely that water flows between them or not. To model the amount of water that each edge can hold at any given time, we define a capacity function c, assigning to each edge a positive capacity value. We now can define a network as follows: Definition 2. A s-t-network is a tuple N = (G, s, t, c) comprising of a directed graph G = (V, E), the two distinguished vertices s (source) and t (sink) with s, t V, and a capacity function c : V V R +. We require that (u, v) E = c(u, v) = 0. The vertices s and t are often referred to as terminal vertices, whereas the vertices v V \ {s, t} are called nonterminal vertices. The value representing the amount of water currently inside a pipe is, like the capacity, assigned to each corresponding edge by a real valued function: 3

16 2 Theoretical Background Definition 3. The flow of a graph G is a function f : V V R, satisfying for all u, v V : f(u, v) c(u, v) (capacity constraint), (2.1) f(u, v) = f(v, u) (skew symmetry), (2.2) and for all u V \ {s, t}: f(u, v) = 0 (f low conservation). (2.3) v V Definition 4. The residual capacity of an edge (u, v) E is given by r(u, v) = c(u, v) f(u, v). An edge (u, v) E with r(u, v) = 0 is called saturated. An edge (u, v) E with r(u, v) > 0 is called residual edge. Note that, in contrast to the capacity and residual capacity, the flow on an edge can be negative. In our water network this corresponds to a flow of water in the opposite direction. The capacity constraint states that the flow on an edge can never exceed its capacity. Skew symmetry and flow conservation assure that no flow is produced by any vertex except the source and not consumed by any vertex other than the sink. To maximize efficiency while not changing any capacities in a network, we want to find an optimal capacity utilization. That corresponds to maximizing the amount of flow that passes from the source to the sink: f = v V f(s, v). (2.4) f is called the flow value and finding its maximum, given a network, is referred to as a maximum flow problem. To keep track of those edges that are not saturated many algorithms utilize a residual graph, where the saturated edges are removed: Definition 5. The residual graph of a given network N = (G, s, t, c) with G = (V, E), and a flow f, is the induced graph G f = (V, E f ), where E f = {(u, v) V V : r(u, v) > 0}. (2.5) The residual graph G f induces the residual network N f = (G f, s, t, c). Notice that the edge (u, v) in 2.5 is in the set V V and not necessarily in E. Since for every edge (u, v) with f(u, v) > 0, the reverse edge (v, u) satisfies r(v, u) = 4

17 2.1 Graphs and Networks c(v, u) f(v, u) > 0, as f(v, u) < 0 due to skew symmetry (2.2). Thus, for every edge (u, v) with positive flow, the reverse edge (v, u) is also present in the residual graph, while (u, v) is no longer present if r(c, v) = 0. Figure 2.1: Left: A s-t network with flow (blue) and capacity (red) displayed for all edges. The value of the flow is 1. Right: The residual network for the s-t network in the left image, displaying the edges residual capacities. Also notice that Definition 4 only applies to graphs G = (V, E) with (u, v) E = (v, u) E. However, for our image graph created for segmentation purposes it holds that (u, v) E : (v, u) E, meaning that there is always a reverse edge present for every edge. Then the residual capacity of an edge (u, v) has to account for the flow present in the opposite direction, becoming r(u, v) = c(u, v) f(u, v) + f(v, u). (2.6) The definitions for residual graphs and residual networks still apply for those graphs, if the residual capacity is calculated as given by Equation (2.6) The Minimum Cut There are several efficient algorithms available to calculate a maximum flow [GT86, KB01]. In order to use them to solve our image segmentation task, we have to formulate a problem that is equivalent to calculating a maximum flow in a network. For this purpose we introduce a cut which is a segmentation of network vertices [CLRS01]: Definition 6. A cut of a network N = (G, s, t, c) with G = (V, E) is a partition of V in two disjoint sets C = (S, T ). A cut is called s-t cut, if s S and t T. The capacity of an s-t cut is the sum of capacities of those edges that originate from S and target T : 5

18 2 Theoretical Background Definition 7. The capacity of a s-t cut C = (S, T ) is given by c(s, T ) = c(u, v). (2.7) (u,v) S T The minimum cut problem is to find a s-t cut that minimizes the capacity of all cuts in a network N = (G, s, t, c), called a minimum cut. One way to solve the minimum cut problem directly is to check all possible cuts of a network for their capacity to find their minimum. However, for networks with thousands or millions of vertices, like those used for image segmentation, this method is obviously very inefficient. There is one important theorem that identifies a connection between the minimum cut and the maximum flow of a network: Theorem (Max-flow min-cut theorem). Given a network N = (G, s, t, c), the value f of a maximum flow f in N is equal to the capacity c(s, T ) of a minimum cut C = (S, T ) of N. The theorem is easier to understand if we imagine that the partitions S and T are only connected by a single edge (u, v). Then it becomes obvious, that the flow from S to T is limited by the capacity of (u, v), as this edge forms the bottleneck connection between the two partitions. Thus, if we want to maximize the flow, we have to utilize the bottleneck to full capacity. This does not change if there are more edges connecting the two partitions, as those can be seen as a set of bottlenecks, which have to be utilized. Figure 2.2: A s-t network subject to a minimum cut C = (S, T ) with S = {s, 1, 2} and T = {t}, denoted by the red dashed line. Notice how the capacity of the cut is equal to the maximum flow value f = 3. 6

19 2.1 Graphs and Networks The min-cut max-flow theorem allows us to segment an image by formulating an equivalent minimum cut problem, which itself is essentially a segmentation of vertices based on edge weights. We can then calculate the maximum flow of the created network and get a solution for our segmentation problem. The available efficient maximum flow algorithms can be categorized into augmentingpath algorithms and push-relabel algorithms. In the following sections one of either category will be introduced Kolmogorov Maximum Flow Algorithm The algorithm presented in this section was published in 2001 by Yuri Y. Boykov and Vladimir Kolmogorov [KB01] and is referred to as the Kolmogorov algorithm throughout this thesis. Before the actual algorithm is explained, some definitions need to be introduced to understand how it works [CLRS01]. Definition 8. A (finite) path in a graph G = (V, E) is a sequence of vertices s = (v 1, v 2,..., v n ), where from each vertex there exists an edge to the next vertex in the sequence and it holds that v i, v j s : v i v j. Definition 9. A s-t path in a network N = (G, s, t, c) is a path in G whose first vertex is the source s and whose last vertex is the sink t. Definition 10. An augmenting path in a residual network N f = (G f, s, t, c) is an s-t path (v 1, v 2,..., v n ) in G, where successive vertices are connected by residual edges: v i, v i+1 (v 1, v 2,..., v n ) : r(v i, v i+1 ) > 0. (2.8) The basic augmenting path algorithm was published by Ford and Fulkerson in 1956 [CLRS01]. It works by executing the following steps for a given network N = (G, s, t, c), after the flow f on all edges has been initialized to zero: 1. Find an augmenting path in the residual network N f = (G f, s, t, c). If no augmenting path can be found, the algorithm terminates. 2. Find the smallest residual capacity r min of the edges along the augmenting path. 3. Increase the flow f on all edges (u, v) along the augmenting path by r min (update the flow on the respective reverse edges by subtracting r min to maintain skew symmetry). Go to step 1. As seen above, the algorithm finishes if no augmenting path can be found in the residual network. That means that no additional flow can be routed from the source to the sink. It can be shown that the resulting flow f is a maximum flow [KB01]. 7

20 2 Theoretical Background The practical performance of those kind of algorithms is obviously dependent on the implementation of the first step, that is, finding the augmenting path. This is achieved by performing a depth-first search or a similar algorithm for graph traversal. The result of those algorithms is a tree: Definition 11. A tree is a connected graph that has no cycles. A connected graph is a graph containing a path for every pair of vertices. A cycle is a sequence of edges (e 1,..., e n ) with e 1 = e n. A disjoint union of trees is called a forest. Thus, the tree displays which vertices are reachable from a chosen start vertex. Depth-first search, for example, has a worst case performance of O( V + E ). Since we are dealing with very large graphs for our segmentation problem, where one vertex corresponds to one image pixel, something more efficient is required. The Kolmogorov algorithm features two important aspects that make it efficient and usable for very large graphs. First it starts its search for an augmenting path not only from the source but also from the sink, forming two different trees instead of one. Secondly, in each iteration, it does not start to build the trees again from the start, but reuses the trees from the previous iteration. To calculate the maximum flow of a given network N = (G, s, t, c), at first two trees S and T are created, with s and t being their respective roots. Each vertex has a status that indicates whether it is passive or active. If a vertex has active status, it means that it belongs to one of the trees and not all of its outgoing edges have been checked for the purpose of adding new vertices to its tree. Only vertices connected by residual edges can be added to a tree. If all neighbors of an active vertex have been checked, the vertex becomes passive. As only active vertices are checked to find an augmenting path, this prevents the algorithm from building the search trees from scratch in each growth phase. Only s and t are marked as active at the beginning. The algorithm calculates the maximum flow by repeatedly executing the following three phases: 1. Growth phase. The active vertices try to add additional vertices to their tree by checking residual edges. If all residual edges of an active vertex have been checked, the vertex becomes passive and the algorithm continues with the next active vertex. The phase ends if one tree tries to aquire a vertex from the other tree, as this forms an augmenting path. 2. Augmentation phase. The flow along the found path is augmented by the smallest residual capacity of its edges. Thus, at least one edge becomes saturated. This causes one or both trees to collapse into a forest. The disconnected vertices become orphans. 3. Adoption phase. Try to re-adopt all orphans to their previous search trees. This is done by examining their residual edges to neighboring vertices. If one 8

21 2.1 Graphs and Networks of those neighbors has an unsaturated path to the orphan s previous search tree root, the orphan (and its forest) are re-adopted. If no such parent vertex can be found, the orphan will remain unbound. All its children will become orphans, too. Unbound vertices can be reaquired to the trees in subsequent growth phases. Note that vertices from the source tree S try to add vertices connected by outgoing edges, while vertices from the sink tree T check incoming edges in the growth stage. That follows because the augmenting paths should consist of edges originating from S and targeting T in order to route the flow from s to t. If a vertex u adds a vertex v to its search tree, then u is called parent of v. Similarly, the vertices v aquired by u are called children of u. Figure 2.3: Example of trees formed in the growth phase of the Kolmogorov algorithm. The white vertices form the source tree and the black vertices form the sink tree. Grey vertices are not yet part of a tree. The border color indicates if the vertex is active (red) or passive (blue). If vertex u tries to add vertex v to its tree, an augmenting path is found, indicated by the dashed edges. The algorithm terminates with a maximum flow if no connection between the trees can be found in the growth stage [KB01]. This is the case of no more active vertices are left. There are several implementation techniques that can further improve the performance of the Kolmogorov algorithm. These are explained in detail in Section 4.3. In the following section the push-relabel algorithm is introduced, which takes a different approach on the solution of a maximum flow problem Push-Relabel Maximum Flow Algorithm The push-relabel algorithm was published by Andrew Goldberg and Robert Tarjan in 1986 [GT86]. In contrast to augmenting path algorithms, the push-relabel method does not rely on finding applicable paths in the network. Instead it works by 9

22 2 Theoretical Background temporarily restraining a property the flow must otherwise satisfy. To understand how this works, the concept of excess flow needs to be introduced [CLRS01]. Definition 12. Given a graph G = (V, E) and a flow f, the excess of a vertex v V is the sum of flow of all edges originating from v and all edges targeting v: e(v) = f(u, v) f(v, w). (2.9) (u,v) E (v,w) E During execution of the push-relabel algorithm for a given network N, a preflow is maintained, which is a relaxation of a flow: It does satisfy capacity constraints (2.1) and skew symmetry (2.2), but restrains flow conservation (2.3) in the following way: u V \ {s} : f(v, u) 0. (2.10) In other words, the preflow allows all vertices (except the source) to have an excess of flow. If we compare this to our water pipe network example from Section 2.1, this can be seen as if each vertex represents a reservoir where water can be saved. Now suppose that each of those reservoirs has a variable height, implying that the water only flows from higher reservoirs to lower ones. This leads to the definition of a height function [CLRS01]. Definition 13. Given a network N = (G, s, t, c) with G = (V, E), and a preflow f, a function h : V N is called a height function, if v V h(s) = V, (2.11) h(t) = 0, (2.12) (u, v) E f : h(u) h(v) + 1, (2.13) where E f is the set of edges in the residual graph G f = (V, E f ) (2.5). To be able to increase the flow on a residual edge (u, v), the push-relabel algorithm requires the heights of u and v to obey condition (2.13). In other words: For the height function to be valid during execution, all residual edges (u, v) are only admissible for an increase of flow, if the height of u exceeds the height of v by one at most. Definition 14. An edge (u, v) is called admissable edge, if (u, v) is a residual edge and u, v obey h(u) = h(v) + 1. Before the push-relabel algorithm is started for a given network N = (G, s, t, c) with G = (V, E), the heights of all vertices except the source is set to zero, the height of the source is set to V. Note that the heights of the two terminal vertices s and t 10

23 2.1 Graphs and Networks are never changed during execution of the algorithm, as this would hurt conditions (2.11) and (2.12). The preflow f is initialized by setting the flow for all outgoing edges of the source equal to their capacity, saturating them. The flow on the reverse edges is updated accordingly. That means, that the algorithm starts with an excess on those vertices adjacent to the source, that were connected by residual edges. Figure 2.4: A s-t network with the preflow initialized for the push-relabel algorithm (the height and excess of the vertices are displayed in brown and green, respectively). The push-relabel algorithm basically consists of two operations executed on a given network, initialized as described in the previous paragraph: Algorithm 1: P ush(u, v) if e(u) > 0 and r(u, v) > 0 and h(u) = h(v) + 1 then temp := min(e(u), r(u, v)); f(u, v) += temp; f(v, u) = temp; e(u) = temp; e(v) += temp; The Push operation only applies when vertex u of the current edge (u, v) has available excess and the edge is admissable. It then pushes as much flow as possible to 11

24 2 Theoretical Background the target vertex v and updates the excess flow accordingly. 1 2 Algorithm 2: Relabel(u) if e(u) > 0 and v V : (u, v) E f : h(u) h(v) then h(u) := min(h(v) : (u, v) E f ) + 1; If a vertex u has excess and outgoing edges with residual capacity, but no adjacent vertex v satisfies h(u) = h(v) + 1, then the height of u is increased a by minimal amount so that (at least) one edge becomes admissable. With the push and relabel operations defined, a basic version of a push-relabel algorithm works as follows for a given network N = (G, s, t, c) with G = (V, E), and an initialized preflow f: 1 2 Algorithm 3: P ush-relabel(g, s, t, c) while P ush(u, v) or Relabel(u) is applicable for vertices u, v V do execute an applicable P ush(u, v) or Relabel(u) The algorithm terminates if no vertex except the sink has remaining excess. After termination, the preflow f is now a legal flow, as the unrestrained flow conservation (2.3) is now satisfied. Additionally, it can be shown that the flow is not only legal but also a maximum flow [GT86]. The final excess of the sink t equals the maximum flow value. The set of vertices S reachable from s in the residual graph G f form the source side of the minimum cut. 2.2 Color Models The images that are subject to segmentation can have different models of color, that is, a different way of numerical representation of the pixels color values. For the segmentation task investigated in this thesis, we are dealing with images employing two different color models: the RGB color model and the L u v color model. For the purpose of conversion between these two models, the XYZ model is also introduced [Pau98] RGB Color Model The RGB color model is an additive model with a basis of three primary colors. That means colors other than the primary ones are produced by adding specific amounts of the three primary colors red, green, and blue together. The fact that three colors are used as a basis is derived from the three different receptor types in the human retina. 12

25 2.2 Color Models The RGB colors can be treated as cartesian coordinates in Euclidean space, denoted by a triple (r, g, b) relative to a white point that defines the white color (1, 1, 1). The specification of this model by the International Commission on Illumination (CIE) provided the foundation for electronic representation of images. That is why this color model is still the most widely used one in computer graphics today. One disadvantage of the RGB model is that the perceptual difference of two colors does not equal their distance in Euclidean color space. As the result of image segmentation should ideally be consistent with the perceptual sensation of a human observer, this model is not well suited for segmentation purposes. Another disadvantage is that not all colors can be described by a RGB triple. For the modelling of some colors the triple would have to include negative values. As the triple corresponds to the intensity of human receptor stimulation, those colors would represent a physically impossible, negative stimulus. magenta (1,0,1) blue (0,0,1) B white (1,1,1) cyan (0,1,1) red (1,0,0) R grey scale black (0,0,0) yellow (1,1,0) green (0,1,0) G Figure 2.5: The RGB color model XYZ Color Model At the same time as the introduction of the RGB color space, the CIE derived from it another color space, whose values X,Y,Z are not designed to directly correspond to some perceptual color sensations like the RGB primary colors. The Y coordinate of the XYZ color model is closely related to the luminance of a color, while the primary color information is contained in the coordinates X and Z. The chromaticity of a color can be outlined as a plane defined by parameters x and y, given by 13

26 2 Theoretical Background x = X X + Y + Z, y = Y X + Y + Z, (2.14) The XYZ color model forms the foundation of all CIE color models and thus can be used to convert between RGB and L u v coordinates. This is done by first converting colors from RGB to XYZ, followed by a conversion between XYZ to L u v. The conversion from RGB coordinates to XYZ coordinates can be performed by the following transformation [Pau98]: (X, Y, Z) T = (R, G, B) T (2.15) L u v Color Model The L u v color model was introduced by the CIE in It is similar to the XYZ color model as the L channel represents the lightness and the channels u and v are chromaticity coordinates. Its advantage over the XYZ and RGB color models is the fact that it is intentionally designed to feature perceptual conformity. That means that colors with the same Euclidean distance are also approximately perceived equally distant by a human observer. That makes the L u v color model more suitable for segmentation purposes as the other models, as the Euclidean distances of colors have a stronger relation to perceptual color difference. The images we are using for our experiments are RGB images. In order to evaluate the segmentation for L u v images, the available RGB images have to be transformed accordingly. This transformation is done in two steps. As already mentioned in Section 2.2.2, the image is first transformed from RGB to XYZ values. After that, the transformation from XYZ to L u v values is done as follows [Pau98]: ( L Y = κ Y Y r, Y r ) 16, Y Y r Y Y r > ɛ ɛ (2.16) ( ) u 4X = 13 L X + 15Y + 3Z 4X r X r + 15Y r + 3Z ( r ) v 9Y = 13 L X + 15Y + 3Z 9Y r X r + 15Y r + 3Z r (2.17) (2.18) where (X r, Y r, Z r ) is the reference white point of the XYZ model and ɛ = along with κ = are constants defined by the CIE standard. 14

27 2.3 Statistics 2.3 Statistics During segmentation of a color image we want to group those image pixels with similar visual properties into the same subgroup. If we have established a model of foreground and background of an image, we can compare the image pixels to those models. This allows us to decide into which category the pixels belong. The models themselves are represented by probability density functions. Thus, the comparison of pixels to a model based on definite foreground an background pixels essentially equates to the calculation of a probability. In this section several basic terms of statistics and cluster analysis are presented [SS01] Random Variables and Probability Functions To be able to define random variables and functions of probabilities, we first have to introduce the probability space as the statistical description of a random experiment. Definition 15. Let S := (Ω, Σ, P ) be a probability space with: Sample space Ω consisting of atomic events, σ-algebra Σ consisting of all subspaces of Ω called events, probability measure P : Σ [0, 1]. The σ-algebra Σ includes the impossible event denoted by, which implies that none of the atomic events in Ω occur, and the certain event denoted by Ω, which implies that any of the atomic events in Ω may occur. The probability measure P assigns to each event a probability value between 0 and 1. We can conclude that P ( ) = 0 and P (Ω) = 1. Often we are not interested in the probability of particular events, but in an outcome of the experiment. Thus, we require an assignment of a value to an outcome. Since we are interested in the success of the experiment, we assign the value 1 to success, and the value 0 to failure. To express this mathematically, we introduce a random variable defined on the probability space S as a function X : Ω R. The function that assigns probabilities to random variables is called probability function. X is called discrete, if the possible values of X form a countable set. Then the probability function for a discrete X is denoted by f X : R [0, 1], directly assigning probabilities to the occurence of single values x of X. If a random experiment is repeated several times, some values of a corresponding random variable can occur more frequently than others. The probability distribution of a random variable X is a mapping of the different values of X to the probability of their occurence. The expected value E(X) = i x if X (x i ) is similar to the arithmetic 15

28 2 Theoretical Background mean of the possible values x i of X, but with each value weighted with its probability of occurence. The variance of a random variable X, having an expected value E(X), is given by Var(X) = E[(X E[X]) 2 ]. It describes the amount of variation of the possible values of X, or more precisely, the squared deviation of X from its mean value E(X). Given two random variables X 1, X 2, then covariance Cov(X 1, X 2 ) = E[(X 1 E[X 1 ])(X 2 E[X 2 ])] is a measure of how strong the two variables correlate statistically. Thus, if Cov(X 1, X 2 ) = 0 then the two variables are uncorrelated. Variance can be regarded as a special case of covariance, as Cov(X, X) = Var(X). Suppose that we want a random variable Y to represent the execution time of a program. As time is a measure that can be a arbitrarily accurate, the possible values of Y are uncountable. The execution time can only be specified by an interval, e.g. between and seconds. Thus, the probability that the execution time maps to exactly one specific value is zero. Only the probability that the outcome of the experiment lies in a time interval may be greater than zero. A random variable that maps outcomes of an experiment to an uncountable set, like the set of real numbers, is called continuous. The probability function f Y for a continuous random variable Y is called density. In contrast to the discrete case, the values of the density do not directly give a probability. Instead the probability that the value y of Y falls within a given interval [a, b] is given by the integral of its density f Y over this interval: P [a Y b] = b a f Y (y) dy. (2.19) Given a density f Y for a continuous random variable Y, the corresponding distribution F Y is also called continuous. One important density used in this thesis for the statistical modelling of image pixels is the Gaussian density given by f X (x) = 1 (x µ)2 e 2σ 2 (2.20) 2πσ 2 for a continuous random variable X with value x, the mean µ, and the variance σ Mixture Models When applying statistics to image processing, we regard image pixels as a set of data points. They can be seen as observations from a greater population, or in other words, as samples of an unknown probability density function. To get information about the distribution of the data points, we have to construct an estimate of the density that best fits the observed data. To select a Gaussian density is often a good choice if the data points are clustered around a central mean, as it is often the case if the number of data points is very 16

29 2.3 Statistics large. Sometimes however a Gaussian density is not accurate enough to model the distribution of a set of observations. For example the set of pixels of an image tend to be centered around several mean color values. By selecting a Gaussian with a single mean we would thus lose information and only have an imprecise model of the real distribution. A solution to this problem is to choose a model that is built upon a combination of several densities. In the case of Gaussian densities, each of them has its own mean and variance values. Additionally each single Gaussian has a weight factor that represents the relative significance of this density, corresponding to the fraction of data points assigned to this function. This enables us to create one parametric density for each of the clusters formed by the data Figure 2.6: Gaussian mixture example. Left: Three 1D Gaussian densities with different means and variances. Right: Convex combination of the Gaussians with the respective weights ω 1, ω 2, ω 3 denoted in the left image. Definition 16. A mixture model is a probabilistic model for density estimation using a mixture distribution, which is a distribution based on a convex combination of K probability density functions. If a random variable X e.g. models the possible pixel values of an RGB image, we have to define X as a 3-dimensional random vector X R 3 to represent each color channel. This leads to a generalization of Gaussian densites to higher dimensions [SS01]. Definition 17. A multivariate Gaussian distribution for a random vector X R d is characterized by the density f(x µ, Σ) = 1 (2π)d detσ e( 1 2 [x µ]t Σ 1 [x µ]) (2.21) 17

30 2 Theoretical Background with mean vector µ and covariance matrix Σ. The covariance matrix Σ = [σ i,j ] is a symmetric (Σ = Σ T ), positive-semidefinite matrix which can be regarded as a generalization of variance to dimensions d > 1. Each entry σ ij gives the covariance of the random variables x i, x j X, while its main diagonal entries σ ii denote the variance of variable x i. A convex combination of above Gaussian densities leads to a definition of a Gaussian mixture model: Definition 18. A Gaussian mixture model (GMM) with Kcomponents is a model using a weighted sum of K Gaussian densities f for a d-dimensional random variable X R d, components k {1,..., K}, and component weights ω k 0, given by F (X ω k, µ k, Σ k ) = K ω k f(x µ k, Σ k ), (2.22) k=1 with the weights ω k satisfying K k=1 ω k = 1. A GMM can be used to model the distribution of K clusters formed by color-based classification of image pixels and presents a parametric alternative to non-parametric models like histograms Vector Quantization The process of quantization in signal processing is related to clustering, as its purpose is the compression of a signal by reducing its number of different values to a smaller number. A special quantization technique that can also be used for density estimation is called vector quantization (VQ) [GG92]. This works by dividing a set of data vectors into smaller sets that are approximately equally large. A vector quantization is characterized by a so called codebook C, consisting of a limited number of prototype vectors C = {c 1,..., c n }. The initial set of data or source vectors X = {x 1,..., x m } is called training sequence with n < m. As usual with density estimation, the set of source vectors is assumed to be a large enough set of observations to capture the statistical properties of their source population. The intention is now to approximate each source vector x i, 1 i m with a prototype vector c j, 1 j n. Let P = {P 1,..., P n } be the partition of the training sequence, where x i P j if x i has been assigned to the code vector c j. In other words, each P j is the set of source vectors that are approximated by the corrsponding code vector c j. The decision to which code vector a source vector is assigned relates to the minimization of some measure of error. As in our case the source vectors are 3-dimensional color vectors, an adequate measure is for example the mean squared error 18

31 2.4 Binary Morphology E = 1 m m x i c 2, (2.23) i=1 where c is the code vector to which x i is assigned. The problem of quantization can thus be summarized as follows: Given a training sequence T and a desired number of code vectors n, find a codebook C and a partition P such that the error E is minimized. The code vectors c j work as the centroids of each partition S j. Thus, for each source vector x in S j its distance to all other code vectors must be greater than the distance to its assigned centroid c j : S n = {x : x c j 2 x c j 2, j = (1,..., n)}. (2.24) Vector quantization can be done iteratively by initially sorting all data vectors in one set which is then divided in equally large subsets. Then the next set to be splitted is chosen by some criteria and this is repeated until the required quantization level is reached. This kind of quantization is following a binary tree structure: The root vertex is the initial set. From each vertex there are two outgoing edges, representing the splitting of the corresponding set into two sets, which are represented by the target vertices of the edges. Clustering algorithms, such as K-means, can be used to generate a codebook, with its parameter K corresponding to the number of code vectors. 2.4 Binary Morphology Binary morphology is a technique that allows to change the geometrical shapes of binary images. A binary image B is an image whose pixels do only have two possible values: 0 and 1, making it possible to store each pixel in a single bit. They can be used as a mask for a corresponding color image I, e.g. to indicate that only those pixels p I are sorted into a certain category if the corresponding pixels in B are nonzero. The two basic morphological operations are erosion and dilation. Let B be a binary image, regarded as a subset of Euclidean space R 2. The operations are transformations of B by a structuring element S, which itself is a binary image of a predefined shape, e.g. a disk or a rectangle. Definition 19. The erosion of a binary image B by a structuring element S with B, S R 2 is given by [GW92] with S x = {y : y = s + x, s S}. B S = {x : S x B} (2.25) 19

32 2 Theoretical Background S x is a translation of the points of the structuring element S so that its origin is centered at the point x. If now the translated points form a subset of the binary image, the point x will be part of the resulting erosion. Informally, if we have a rectangular structuring element S, we have to superimpose S on all nonzero pixels x of B, so that the central pixel of S coincides with x. If now S is not visible because it is completely contained within B, then the pixel x will also have a value of one in the result image B S, otherwise it will be set to zero. Definition 20. The dilation of a binary image B by a structuring element S with B, S R 2 is given by [GW92] B S = {x : (Ŝ) x B } (2.26) with Ŝ = {y : y = s, s S} and (Ŝ) x = {y : y = s + x, s Ŝ}. Ŝ is the symmetric of the points s S, because for the dilation we do not want to superimpose S on all nonzero pixels but on all zero pixels x B. So after S is placed with its center on a zero pixel x of B, if at least one pixel of S coincides with a nonzero pixel of B then x will be set to one in the resulting image B S, otherwise it will remain zero. (a) Binary image. (b) Eroded image. (c) Dilated image. Figure 2.7: Example of erosion and dilation of a binary image ( pixels) with a square-shaped structuring element (10 10 pixels). Notice the loss of fine image details caused by the erosion process. An example binary image subject to above morphological operations is given in Figure 2.7. Erosion and dilation can be combined to select a border strip around a contour: Let B be a binary image and let B e be the erosion and B d the dilation of B. Then the pixelwise subtraction of the images B d e only features those pixels that have been changed by both operations. 2.5 Cameras and Perspective Geometry In this section the mapping process from 3D objects, given as 3D world coordinates, to image pixel coordinates is explained [HZ03]. At first the pinhole camera model 20

33 2.5 Cameras and Perspective Geometry is introduced which describes the perspective projection process of a perspective camera. After the projective geometry has been introduced, time-of-flight cameras are presented as an instrument for the acquisition of depth imformation. Finally, rendering techniques for the creation of synthetic images are presented, which are used for the trimap generation of the segmentation algorithm Pinhole Camera Model The pinhole camera model describes the geometry between a 3D point in Euclidean space and its corresponding projection onto a 2D image plane. A pinhole camera is a closed box with just one hole in it. Through this hole the emitted or reflected light of an object passes and is projected onto the surface of the box opposite to the hole. The orientation of the projected scene is rotated by 180 and mirrored, because the light from every object point travels in a straight line through the pinhole. The scene is represented geometrically by 3D points P = (X, Y, Z) in Euclidean space and the center of projection is called the camera center C. The 3D points are projected onto the image plane I and the distance of the image plane to the camera center is the focal length f. In order to get rid of the rotation and mirroring of the scene the image plane is moved before the camera center. This is physically impossible but is convenient for the geometric model, as mathematically only the orientation of the coordinate system changes. Thus, the 2D point p = (x, y), which is the projection of the 3D point P = (X, Y, Z), is the point on the image plane where a straight line from P to the camera center C meets I. The axis going through the camera center perpendicular to the image plane is called optical axis and the point where the optical axis meets the image plane is called principal point, see also Figure 2.8. Figure 2.8: The pinhole camera model. 21

34 2 Theoretical Background As the described projection is a mapping of 3D coordinates to 2D coordinates, the depth of the 3D point corresponding to its Z coordinate is lost during the process. Thus, a mapping back from 2D to the original 3D coordinates is not possible without additional information Perspective Geometry Usually, 3D points in 3D space are denoted as a triple P = (X, Y, Z) of Euclidean coordinates. However, the Euclidean space is not suitable for the description of transformations in projective space. The human visual perception is based on perspective: objects that are closer to the observer appear larger than objects at a distance. Accordingly, parallel lines, like the two rails of a railroad track, appear to converge at the horizon, which geometrically is a point at infinity. To describe such a perspective transformation in terms of Euclidean geometry is cumbersome, as points at infinity cannot be modeled. To expand the Euclidean space to allow for the modelling of infinitely distant points at each direction, the Euclidean coordinates are embedded into perspective space by the use of homogeneous coordinates [HZ03]. A 3D point P = (X, Y, Z ) T in Euclidean space is represented by homogeneous coordinates P = [X, Y, Z, W ] T in 4-dimensional perspective space, where X = X W, Y = Y W, Z = Z W, with W 0. (2.27) In general, given a scaling factor λ 0, a point (X 1, X 2,, X n ) T in n-dimensional Euclidean space represents all points in (n+1)-dimensional perspective space created by scaling the homogeneous point [X 1, X 2,, X n, 1] T by λ: X X 1 1 X 2 X 2. λ. X X n n 1 (2.28) Thus, homogeneous coordinates are scale invariant. The mapping of 3D world coordinates P = (X, Y, Z) T to image pixel coordinates p = (x, y) T can be described as three seperate operations. At first the world coordinates are transformed into the camera coordinate system. This operation includes a translation that describes the position of the camera center relative to the world coordinate point of origin, and a rotation to account for the orientation of the camera. The next step is the perspective projection described by the pinhole camera model. Here the points 22

35 2.5 Cameras and Perspective Geometry P = (X C, Y C, Z C ) T in camera coordinates are projected onto the points p = (u, v) T in image coordinates by [HZ03] u = X Cf Z C, v = Y Cf Z C, (2.29) where f > 0 is the focal length of the camera.the last step is the transformation of the points p = (u, v) T on the image plane into pixel coordinates p = (x, y) T. This step compensates for the different points of origin of the two coordinate systems, since pixel coordinates typically originate from the upper left corner of the image plane and not from its center. Also the conversion from the image coordinate unit type into pixel units is performed. The intrinsic parameters of the camera can be combined into a single matrix transformation. Assuming that pixels on the camera sensor are represented as square and non-skewed, those parameters are: the focal length f, the scale factors α u, α v, describing the scaling of image coordinates (u, v) to pixel coordinates (x, y), the offset of its principle point to the pixel coordinate point of origin, described by a vector (o u, o v ). The matrix created by the combination of the transformations induced by above parameters is refered to as the camera matrix K and is given by fα u 0 o u K = 0 fα v o v (2.30) The position of the camera center in world coordinates, given as a 3 1 translation vector C, and the rotation of the camera, given by a 3 3 rotation matrix R, are the extrinsic parameters of the camera. Thus, the pixel coordinates [x, y, 1] T of a point [X, Y, Z, 1] T in world coordinates are given by [Mor09] X x 0 λ y = K R 3 3 RC up to a homogeneous scaling factor λ. Y Z (2.31) 1 23

36 2 Theoretical Background Time-Of-Flight Cameras There are two input images for our segmentation algorithm: a color image I, which is the actual subject of segmentation, and a depth image D. The depth information is used to provide an initial classifiction of pixels of the color image into foreground and background. This is required by the segmentation algorithm, which itself is explained in detail in Chapter 3. The depth images are generated based on the time-of-flight (TOF) principle. For example, a camera utilizing this principle, in addition to the intensity information of a scene, is able to capture information about depth of the scene objects relative to the camera. In a basic form, those cameras work by sending infrared light upon a scene, which is reflected by the illuminated objects. The reflected light is then gathered by a camera lens and projected onto the camera s pixel sensor array. The time until the reflected light reaches the camera sensor varies for objects at different depths, because the light travels at a certain speed and thus needs a longer time to travel a larger distance. Based on those time differences and the speed of light constant c m, the depth d of the object s pixels can be estimated by d = tc 2, (2.32) where t is the time measured and the division by two compensates that the light has to travel the distance twice: from the emitter to the object and back to the receiver. In our experiments photonic mixer device (PMD) cameras are used, which use a slightly different approach. They emit continuously modulated light and instead of measuring the time until the light signals are reflected, they measure the phase shift of the emitted modulated signal and the reflected signal. Figure 2.9: Sketch of the time-of-flight principle used by a PMD camera. Let P e be the sinusoidally modulated signal emitted by the PMD camera with the mean optical power P A. Due to the travel from the light source to the 3D scene and back to the camera sensor, the reflected signal P r has a different zero-crossing from 24

37 2.5 Cameras and Perspective Geometry P e, see Figure The difference of the zero-crossings is measured by the phase shift ϕ. It also can be seen that the amplitude of the reflected signal P r is reduced compared the amplitude of the emitted signal by a factor k due to optical loss. Also the mean optical power P B of the reflected signal is smaller than the one from the emitted signal, which is caused by the imperfection of the demodulation process and background noise. Figure 2.10: Principle of PMD camera [BOL + 05]. The demodulation is necessary to obtain depth information based on the measured phase shift. This is done by sampling the signal four times per modulation period, shifted by 90 each, to obtain an unambiguous reconstruction of the unmodulated output signal. A sinusoidal signal is completely defined by its amplitude A, its phase ϕ, and its frequency f. Based on the four samples denoted by S 1, S 2, S 3, and S 4, the amplitude of the reflected signal is given by A = (S 1 S 3 ) 2 + (S 2 S 4 ) 2 2 (2.33) and the phase can be calculated as ( ) S1 S 3 ϕ = arctan S 2 S 4. (2.34) Given the phase ϕ, the distance d of the 3D object from where the signal was reflected is given by d = c ϕ 4π f. (2.35) 25

38 2 Theoretical Background Here c is the speed of light constant and f is the modulation frecuency. Above measurements are done et each pixel of the PMD sensor array. The intensity information received at each PMD pixel is given by the offset O: O = S 1 + S 2 + S 3 + S 4 4. (2.36) Note that O also includes any background light gathered by that particular reflected signal. The maximum distance for the target scene of the camera is limited to half of the signal wavelength, because the light signal has to travel the distance twice. By using the phase shift of modulated light as a basis for depth measurement instead of the light travel time, the background noise produced by other light sources can be more easily disregarded, as the sensor must only answer to the modulation frequency. A depth image of a time-of-flight camera can be used to obtain depth data for a 2D color image created by another camera. If both images are of the exact same view, the process is trivial, since the TOF camera delivers per-pixel depth information. However, due to the concept of the camera and to maintain a high depth accuracy, the resolution of readily available TOF cameras is limited. For example, a typical PMD camera has a pixel array resolution of pixels [Gmb10]. Thus, if the 2D color image is of higher resolution, the depth image needs to be upsampled to match the required resolution. Due to the upsampling process, large resolution differences result in pixel mismatches, especially at object boundaries Depth Image-Based Rendering The term rendering traditionally denotes the process of creating views of a 3D scene by use of geometric modelling. However, if several different 2D image views of a 3D scene are available, a novel synthetic view can be rendered from these existing views without the need for modelling a complete 3D representation of the scene. In the special case of Depth-image based rendering, one or more 2D reference images and corresponding depth images, e.g. aquired by a time-of-flight camera, are used to render novel views of a 3D scene. Assume that a single reference image I was created through perspective projection by a perspective camera C with known intrinsic and extrinsic parameters. Then each pixel position p in I can first be reprojected to their original 3D location, and then projected into a novel view. Given a homogeneous point p = [x, y, 1] in the reference image, the corresponding 3D point P = (X, Y, Z) T in Euclidean world coordinates, using Equation (2.31), is given by P = (KR) 1 (λp + KRC), (2.37) where K are the intrinsic parameters of the camera at C subject to a rotation R and λ is the homogeneous scaling parameter for p. If we now want to create a synthetic 26

39 2.5 Cameras and Perspective Geometry view I of point P, we define a second perspective camera at C with rotation R and intrinsic parameters K. Then the corresponding point p = [x, y, 1] in I is can be calculated by [Mor09] λ x y = [ ] [ ] K R 0 R C T [X, Y, Z, 1] T (2.38) 3 1 = K R (X, Y, Z) T K R C (2.39) = K R (KR) 1 (λp + KRC) K R C, (2.40) where 0 3 denotes the vector (0, 0, 0) T, see also Figure Figure 2.11: Warping of a point p in reference view I into a synthetic view I. However, there are several issues associated with the above warping process. The point p does usually not map to a discrete pixel position, so it is possible that several points from the reference view are mapped to the same point in the synthetic view. Also, especially if the distance between the camera centers C and C is large, it is likely that areas occluded in I will become disoccluded in I, see Figure As those areas are not visible in the reference view, the corresponding pixels in the synthetic view are not defined. To address aforementioned issues, a warping technique using triangle meshes can be used [FBK10]. Here the pixels in the reference view I are locally approximated by a mesh of triangles, where e.g. each pixel corresponds to a triangle vertex. After the vertices have been warped into their new position in I, the triangles are rasterized to produce the actual 2D image. 27

40 2 Theoretical Background Figure 2.12: Example for disocclusion at image warping. The grey area is not visible from reference view I. However, the shaded area is visible from synthetic view I for which no image data is available. 2.6 Image Segmentation The segmentation of an image into coherent, homogeneous regions is an important step in image processing. It allows measures of analysis and processing to be localized to certain image regions of interest. In this thesis we are concered with the segmentation of an image into foreground and background regions. The idea of an optimal segmentation result is derived from object recognition and segmentation achieved by human visual perception. Ideally, a segmentation is obtained automatically, without or with little user input required. Most of the available segmentation techniques can be classified into pixel-based, boundary-based, and region-based methods. Pixel-based methods examine each pixel of an image and decide whether it belongs to the foreground or the background, e.g. based on comparing its color information to the distribution of colors of the whole image. This obviously does not work well, if the distributions of foreground and background colors overlap considerably. Boundary-based methods try to detect those pixels that belong to discontinuities or edges around foreground objects. If the detected edges do not form a closed contour around objects, it can be tried to close the edges, e.g. by aquiring additional pixels to the edge region based on image gradients. However, if the image discontinuities are not very distinctive, they are difficult to detect by this method. Finally, region-based methods try to find coherent image regions. For example, region growing works by initializing a region with a single pixel. The region is then expanded by aquiring neighboring pixels that are similar in color to the pixels already belonging to that region. Notice that this itself is a pixel-based method. Again, if the edges around objects are not very distinctive, regions will most likely grow outside their actual borders. Another region-based method is the energy minimization approach. Here an energy function is created, that assigns a cost to each feasible segmentation. It is then tried 28

41 2.6 Image Segmentation to minimize this energy to find an optimal segmentation. The advantage of this method is that the cost function can be defined to include different terms accounting for color as well as discontinuity information. Then the relative importance of these two terms can be changed simply by adding a factor to the function. (a) Kanizsa triangle. (b) Segmentation 1. (c) Segmentation 2. Figure 2.13: The Kanizsa triangle: an example for the ambiguity of segmentation. (a): observers perceive a white, equilateral triangle centered in the foreground of partially occluded circles. (b) and (c): possible segmentations, with the respective foregrounds colored in red. Aside from the combination of different types of information, often additional user input is used to support the segmentation process. This input ranges from marking several pixels as definite foreground or background to creating complex models that approximate the geometrical shape of the objects tried to extract. The latter method requires previous knowledge of an image and is thus difficult to adapt to different images. The existence of ambiguous segmentation tasks, where user input seems necessary to achieve a desirable result can be illustrated by the well known Kanizsa triangle in Figure Neither color nor edge information can be used to derive the existance of a white triangle in the foreground, which is perceived nevertheless. Thus, for a segmentation technique based on color similarity or discontinuity information, or a combination of both, it seems impossible to arrive at the segmentation result 2.13(c) and it seems logical to deliver the segmentation 2.13(b). The segmentation technique used in this thesis is based on energy minimization. This approach not only combines different types of image information, but also is supported by user input in a way that is convenient for the integration of depth data. The general idea behind the energy minimization approach is presented in the following section Segmentation by Energy Minimization Within the scope of this thesis, color image segmentation refers to the process of partitioning the set of pixels p of a color image I into a foreground and a background 29

42 2 Theoretical Background region. The important aspect is the identification and extraction of one or more foreground objects. If we represent the set of N pixels of an image I as a vector I = {p 1,..., p N }, the segmentation problem corresponds to finding optimal labels A = {α 1,..., α N } for every pixel, where α n = 0 denotes that the pixel p n belongs to the background, and α n = 1 sorts p n into the foreground. The assignment of labels α n to pixels p n is called a labeling l = {l 1,..., l N }. Note that such a labeling can seldom produce a perceptually perfect segmentation, as the colors of pixels in the border region between foreground and background are mostly a combination of foreground and background colors. To account for this we would have to allow every α n to take values between 0 and 1. This kind of labeling refered to as alpha matting is not discussed in this thesis, as our goal is to find the best possible hard segmentation, which can e.g. be expanded to alpha matting in a postprocessing step [RKB04]. If we think of segmentation as an optimization problem, we have to find an optimal segmentation among the feasible solutions. A good segmentation, represented by a labeling, is prominently identified by two qualities: 1. The labeling fits well to a previously established model of foreground and background. 2. The number of discontinuities of actually solid objects or surfaces in the segmented image is as small as possible. The desire to prevent fitting and discontinuity errors leads to the definition of a function where the quality of a segmentation l can be expressed as a minimizable cost or energy of the form E(l) = E data (l) + λ E smooth (l) (2.41) with energies E data and E smooth, and a factor λ controlling the relative importance of the two energies. E data is the energy representing the fit of l to the background and foreground models. It is given by E data (l) = N n=1 m(l n ), (2.42) where m(l i ) gets larger the more the labeling l i disagrees with the models and vice versa. For example if a pixel p n fits perfectly to the background but the labeling l n assigns p n to the foreground then m(l n ) will penalize this with a large value. As we do not want to examine all pairs of pixels for discontinuities, we limit the search to neighboring pixels q of a given pixel p. The set of all pairs (p, q) of neighboring pixels is called the neighborhood N p of pixel p. Now the smoothness term E smooth of the energy E(l) is given by 30

43 2.6 Image Segmentation E smooth (l) = (p,q) N V (l p, l q ), (2.43) where N is the set of all neighboring pixels of an image I = {p 1,..., p N }. Similar to Equation (2.42), V (l p, l q ) is large if the labelings of pixels p and q are different while the pixels are otherwise similar in a certain way. This similarity can e.g. be expressed by the Euclidean distance of the pixel colors. Unfortunately, finding the global minimum of an arbitrary energy function is intractable and minimizing the energy E(l) as defined above directly is a NP-hard problem [ZVB99]. However, another advantage of the energy minimization approach is that it is solvable by a graph cut method [BJ01]. Here an equivalent minimum cut problem is formulated which can then be solved efficiently using a maximum flow algorithm. This involves the creation of a graph that represents the image subject to segmentation, where each pixel is represented by a vertex. The user is able to mark pixels as definite foreground or background, which is reflected in the capacites of the edges connected to the corresponding vertices. As we want to integrate perpixel depth data from time-of-flight cameras, this method of user input is can easily be replaced. We can summarize the reasons for an energy minimization approach solved by graph cuts as follows: it combines color and discontinuity information, which improves segmentation results for images with weak edges or similar foreground and background colors, the user input of the graph cut method consists of determining pixels that are definitely belonging to the foreground or background, which can be replaced directly by reliable per-pixel depth data, no complicated model-based approaches are used requiring previous knowledge of the image, which makes this segmentation technique usable for arbitrary images, there exist several efficent maximum flow algorithms to find a minimum cut in a graph, allowing a high performance even for large graphs from high definition images, the graph cut approach can be extended to an iterative algorithm, successively optimizing the segmentation [RKB04]. 31

44 2 Theoretical Background 32

45 3 Graph-Based Color Image Segmentation In this chapter the actual segmentation process for a color image is given in detail. It is based on the graph cut approach presented in [RKB04]. There the singular graph cut calculation from [BJ01] is replaced by an iterative optimization, successively improving the segmentation results. Gaussian mixture models are used to model the color distributions of foreground and background, accounting for correlations between the different color channels. After the explanation of the trimap generation for the integration of the depth data in the following section, the different methods for the initial generation of foreground and background GMM s are introduced. In section 3.4.3, after the explanation of the final graph cut step, including the actual minimization realized by a maximum flow calculation on an image graph, a complete overview of the iterative segmentation algorithm is displayed by Figure Trimap Generation The purpose of the trimap is twofold: First, it provides an initial classification of pixels needed to build the foreground and background models for the segmentation algorithm. Second, the amount of image pixels subject to segmentation is limited by disregarding areas which are assumed to be accurately segmented by the depth data alone. Following is an explanation how the color and depth images for the segmentation algorithm are aquired. Then the process of creating the trimap from the depth image is described. The color images subject to segmentation are aquired by a Sony X300 color camera with a resolution of pixels (aspect ratio 16:9). The time-of-flight cameras used are PMD CamCube cameras with a resolution of pixels (aspect ratio 1:1). Due to their limited resolution, two of them are used to obtain a depth image of the same view as the color camera. The three cameras are aligned in a setup shown in Figure 3.1(a), where C5 is the used color camera and satellites T 1, T 2 are the PMD cameras. Note that from the depicted setup only these three cameras are used. The distance of the right and left PMD satellites T 1 and T 2 to the center color camera C5 is 180 mm each, as can be seen in the schematic view in Figure 3.1(b). 33

3 Graph-Based Color Image Segmentation (a) Camera setup. (b) Schematic front view of used cameras. Figure 3.

(b): Schematic front view of the camera setup showing the used cameras.

Also all three cameras are horizontally aligned to lie on the same baseline. Figure 3.2 shows sample images generated by the three cameras.

As our goal is a depth image D of the same view as camera C5, both images must now be warped into the central view.

46 3 Graph-Based Color Image Segmentation (a) Camera setup. (b) Schematic front view of used cameras. Figure 3.1: Camera setup used to produce color and depth image data. (a): Only camera C5 (Sony X300) and the TOF-cameras T 1, T 2 (PMD CamCube) are used. (b): Schematic front view of the camera setup showing the used cameras. Each of the PMD cameras T 1, T 2 is attached to a respective module M1, M2, which can be rotated along the axis as indicated in the schematic view. Also all three cameras are horizontally aligned to lie on the same baseline. Figure 3.2 shows sample images generated by the three cameras. The modules M1, M2 are rotated outwards in order to cover the full viewing area of camera C5 by the combined views of the PMD cameras. As our goal is a depth image D of the same view as camera C5, both images must now be warped into the central view. If both images are warped independently, the resulting images feature large disoccluded areas due to their distance from the central view, as shown in Figure 3.3. (a) Camera T 2 (left). (b) Camera C5 (center). (c) Camera T 1 (right). Figure 3.2: Sample images generated by the cameras. (a): Depth image from left PMD camera T 2. (b): Color image from center camera C5. (c): Depth image from right PMD camera T

3.1 Trimap Generation (a) Warped image from left PMD camera. (b) Warped image from right PMD camera. Figure 3.

The black areas are disocclusions where no image information was available. Notice, however, that both images complete each other in the disoccluded areas.

Thus, the depth image information from both views is combined for the warping process, preventing most of the disocclusions.

47 3.1 Trimap Generation (a) Warped image from left PMD camera. (b) Warped image from right PMD camera. Figure 3.3: Depth images resulting from independently warping the right and left depth images of PMD cameras T 1, T 2 into the central view. The black areas are disocclusions where no image information was available. Notice, however, that both images complete each other in the disoccluded areas. Now the advantage of using two PMD cameras becomes obvious: the depth information missing in the left warped image can be obtained from the right warped image and vice versa. Thus, the depth image information from both views is combined for the warping process, preventing most of the disocclusions. For more detail on the warping technique, which uses a triangle mesh reduction to handle the disocclusions, see [BSBK08, FBK10]. The resulting depth image D from warping the left and right PMD images of Figure 3.2 into the central view is shown in Figure 3.4. Notice that still some disocclusions are visible due to the large resolution mismatch of the PMD cameras and the color camera, especially at the border regions of foreground and background. Those disocclusions will have to be disregarded when creating an initial pixel classification for the segmentation algorithm. Figure 3.4: Depth image D created by simultaneously warping both PMD images into the central view [FBK10]. The next step is to create a an initial segmentation of the color image pixels into foreground and background, based on the depth image data D. For that the depth image is thresholded to create a binary image B. The respective threshold can be 35

48 3 Graph-Based Color Image Segmentation selected by user input or can also be automatically aquired using a filtering algorithm [PD07, Maj10]. The pixels of a binary image have only two possible values, 0 and 1, storing each pixel as a single bit. Thus, the binary image is already a segmentation of the corresponding color image I, if the color image pixels are classified as foreground and background based on the pixel values of the binary image. However, as already mentioned in the last paragraph, the depth data from which the binary image is created is not reliable on the border regions of foreground and background. In order to indicate the unreliability of the border region, a trimap T is generated from the binary image. A trimap differs from a binary image as is uses three different pixel classifications instead of two. They are generally displayed as greyscale images where only three values from the available 256 are used. The trimap is created by first eroding the binary image B, forming the eroded image B I. Then B is dilated, resulting in the dilated image B D. Now the difference between the images B I and B D constitutes the unknown region, indicating that the corresponding pixels in the color image will be subject to segmentation. Thus, the final trimap T presents a classification of the color image pixels p I into three sets: T F G : The foreground region of the trimap, T UN : The unknown region of the trimap, T BG : The background region of the trimap. Notice that in Figure 3.5 the trimap actually features four different regions instead of three, because the background region T BG is not including the whole image background. This is done to improve performance of the segmentation, as the size of the constructed image graph subject to maximum flow calculation (see Section 3.4.3) will be reduced significantly. Also notice that the black region of the trimap is not only disregarded during the graph cut step, but is also not considered for the creation of the background Gaussian mixture model. This limits the possible creation of large background color clusters which are similar to foreground colors, negatively affecting the segmentation results. In the next section the initialization of the color models is presented, which uses the pixel classifications established in the current step to create different GMM representations of foreground and background. 3.2 Color Clustering As the task of the segmentation algorithm is to sort pixels into foreground and background classes based on their color, models of both classes have to be created before the actual iterative segmentation starts. Thus, a initial cluster analysis is performed in order to create preliminary models of foreground and background which 36

the segmentation algorithm can optimize. Cluster analysis is the sorting of N data points into K N categories or clusters so that the points in each cluster are similar to each other.

49 3.2 Color Clustering (a) Original image. (b) Depth warping result. (c) Binary image. (d) Trimap. Figure 3.5: Example of trimap generation. (c): Binary image created by depth thresholding of depth image (b). (d): Trimap created from binary image (c) by erosion and dilation, with regions T F G (white), T UN (green), and T F G (red). the segmentation algorithm can optimize. Cluster analysis is the sorting of N data points into K N categories or clusters so that the points in each cluster are similar to each other. The measure of similarity can be arbitrary. In the case of pixel color values we assign pixels to clusters based on their Euclidean distance in color space. Each cluster is represented by its center, which is the mean color value of the fraction of N total pixels p I assigned to this particular cluster. Hence, given a cluster C of M pixels p = (p 1, p 2, p 3 ) C its mean vector µ = (µ 1, µ 2, µ 3 ) is calculated as the sample mean µ = 1 C M m=1 p m, with µ i = 1 C M p im, i = (1, 2, 3). (3.1) Likewise, the corresponding covariance matrix Σ is given by the sample covariance matrix m=1 37

50 3 Graph-Based Color Image Segmentation Σ = [σ i,j ] 3 3 = 1 C 1 M (p im µ i )(p jm µ j ). (3.2) m=1 The sample mean and sample covariance matrix can be regarded as estimates of the mean and covariance of the greater population that the samples p m are taken from. When performing cluster analysis an appropriate number of resulting clusters K needs to be selected. In our case, where the distribution of color values is represented by a Gaussian mixture model, K is adequately chosen if the underlying data can be approximated well by a combination of K Gaussian densities. In the color clustering step for a color image I a total of 2 K clusters will be created, K for the foreground and K for the background region, respectively. Those regions are can be directly adopted from the trimap T created in the previous step. However, a decision remains regarding the unknown region T UN of the trimap. We could either ignore this region for the purpose of clustering or include it into either foreground or background. In the original Grabcut algorithm presented in [RKB04], the trimap is created from user input rather than from a depth image. There the user specifies the unknown region T UN of the trimap T by pulling a rectangle around the foreground object(s). All other pixels are marked as T BG, so initially there are no pixels marked as foreground. A preliminary segmentation A = {α 1,..., α N } is created by sorting all pixels marked as T UN in the foreground class (α = 1) and all pixels marked as T BG in the background class (α = 0). To stay true to this mechanism, in our case the unknown region T UN and the foreground region T F G are combined to the preliminary foreground class (α = 1), while the initial background class (α = 0) is formed by T BG. Thus, the first iterations of the segmentation algorithm will reduce the foreground region by excluding pixels that are identified to fit better to the background. The clustering step results in an assignment or matching of pixels to clusters K = {k 1,..., k n,..., k N }, where each value k n is the index k of the cluster to which the pixel p n has been assigned to. Later, as the clusters will be represented by components of Gaussian mixture models, k n will be the index of the assigned component accordingly. Hence, the classification of each pixel p n is completely defined by its region (α n A) and the assigned cluster (k n K) in that region. In the following sections two clustering algorithms are presented that are used in our experiments, which follow rather different approaches to solve the clustering problem K-means Clustering A simple but effective algorithm for cluster analysis is called K-means clustering. Its parameter K determines the number of clusters created. K-means is a partitional clustering algorithm, meaning that all K clusters are created right at the beginning, 38

51 3.2 Color Clustering which are then optimized consecutively. This is done by iteratively minimizing the squared deviation within each cluster C k with P = {C k : k = (1,..., K)}: arg min P K k=1 p j C k p j µ k 2, (3.3) where µ k is the mean color vector of cluster C k and p j is the j-th pixel color vector assigned to cluster C k. Algorithm 4 presents an example execution of K-means to create K clusters for an RGB image I with pixels p I. Note that in our case, K- means is performed for the foreground and background regions seperately to create 2 K clusters in total, as already pointed out in the previous section Algorithm 4: K-means(I, K) Create K random mean color vectors µ k, k = (1,..., K); while pixel assignments change do for k 1 to K do C k ; for p I do l arg min p µ k 2 ; k C l C l p; for k 1 to K do µ k 1 C k p C k p; As the initial cluster centers are assigned randomly, it is possible to find the global optimum of equation (3.3) but highly unlikely. Thus, the quality of the clustering result largely depends on the choice of the initial cluster centers Binary Tree Color Quantization Another technique that can be utilized for clustering is color quantization. There the intention is the reduction of the number of different colors of an image, while the quantized image should look as similar as possible to the original image. This is useful when e.g. the image should be displayed on a device that can only display a fewer number of colors at once than the image provides. The problem of reducing the colors of an image to a number K of different colors is equal to cluster analysis for three-dimensional data points. Hence, the techniques used in color quantization can be used to create the initial clusters required for our segmentation algorithm. The quantization technique presented here was introduced by Charles Bouman and Michael Orchard in 1991 [BO91]. It can be classified as an divisive hierarchical 39

52 3 Graph-Based Color Image Segmentation clustering algorithm, indicating that, unlike partional methods, the clusters are not all created initially but are successively created from clusters established in previous iterations. Divisive means that the algorithm starts with all data points in a single cluster, which is then splitted by some criteria into several clusters. In the present technique, in each iteration one previously established cluster is selected which is then split into exactly two clusters. Thus, the hierarchy of clusters can be represented by a binary tree. Figure 3.6: Schematic view of binary tree divisive hierarchical clustering. In each clustering step one of the previously created clusters is split until the required number of clusters (here 5) is reached. Which cluster is split in a given iteration is determined by means of calculating the principal components of the clusters. To understand this we need to introduce some basics of linear algebra. The spectral theorem [Fis03] states that every symmetric real matrix, like our sample covariance matrix Σ given by Equation (3.2), can be diagonalized, meaning that Σ is similar to a diagonal matrix Σ D whose entries outside the main diagonal equal zero. Similar means that both matrices describe the same linear transformation but have different bases. Thus, a matrix Σ can be diagonalized by finding an invertible matrix E such that Σ = EΣ D E 1. (3.4) From the spectral theorem also follows that above equation describes the eigendecomposition of Σ. That means that E can be expressed as a matrix of eigenvectors e n of Σ and the main diagonal entries of Σ D are the corresponding eigenvalues λ n. 40

53 3.2 Color Clustering An eigenvector e of a matrix Σ is a vector that, if linearly transformed by Σ, will only change its magnitude but not its direction [Fis03]: Σe = λe. (3.5) The eigenvalue λ expresses the change in magnitude of the corresponding eigenvector e. Since every real symmetric n n matrix has n linearly independent eigenvectors, E can be chosen such that the eigenvectors e n are orthogonal, making E an orthonormal basis if the eigenvectors have unit length. Then the eigenvectors e n are called principal components of Σ. We summarize that the covariance matrix is diagonal in eigenvector coordinates. Since Σ is the sample covariance matrix of an image I = {p 1,..., p N }, consiting of possibly correlated 3-dimensional color vectors p = (p x, p y, p z ), the diagonalization of Σ is equal to the transformation of the color data into a coordinate system where the color coordinates are uncorrelated. This follows because Σ D has the form Σ D = diag(λ 1, λ 2, λ 3 ) = λ λ λ 3 = σ σ σ 2 3, (3.6) showing the uncorrelation since the covariances Σ i,j D, i j are all zero. An important fact in Equation (3.6) can be noticed: The eigenvalues λ i of Σ are equal to the variances σi 2 of the orthogonally projected color data points along the corresponding eigenvector e i. From that follows that the variance of the points is greatest along the axis formed by the eigenvector e i with the largest eigenvalue λ i. Note that the above procedure corresponds to the attempt to minimize the total squared error (TSE) of projecting the points onto the principal components: T SE = K k=1 p C k p µ k 2 (3.7) where K is the total number of clusters and µ k is the mean of cluster C k. However, the above technique only works well if the data points are linearly correlated e.g. like those subject to a Gaussian distribution. This is due to the procedure trying to build principal components by variance of the data. As the newfound basis is just a linear combination of the original basis, this will not produce meaningful results if the variance is along a non-linear path. Once the cluster C k that should be splitted is determined, the splitting is done by constructing a plane that is perpendicular to the largest eigenvector e k of C k passing through the cluster mean µ k. That splits the cluster by reducing its variance along the direction of its largest eigenvector e k. Algorithm 5 lists the clustering process for the foreground pixels of an image p n I into K clusters, indicated by 41

54 3 Graph-Based Color Image Segmentation z x y Figure 3.7: Sketch of Orchard-Bouman clustering. Left: Random sample cluster of normal distributed 3D points with the largest eigenvector e (blue) of the corresponding covariance matrix. Right: A plane perpendicular to e passing through the cluster mean depicts the splitting of the cluster along the axis of greatest variance. the corresponding pixels α n = 1 of the preliminary segmentation A. Note that the eigenvectors and eigenvalues of the covariance matrices are computed using the singular value decomposition (SVD). When used for symmetric, positive-semidefinite matrices like covariance matrices, the SVD provides an eigenvalue problem solution, see Appendix A Algorithm 5: Orchard-Bouman-Clustering(I, K, A) C 1 {p n I : α n = 1}; Calculate mean µ 1 ; Calculate covariance matrix Σ 1 ; for l 2 to K do Perform SVD to find eigenvectors of C k, k = (1,..., l 1); Find cluster C k with largest eigenvector e k ; C k {p C k : e T k p et k µ k}; C l {p C k : e T k p > et k µ k}; Calculate means µ k, µ l and covariance matrices Σ k, Σ l ; The clustering process is essentially a vector quantization as introduced in Section 2.3.3, where the cluster means µ k are the codevectors forming the codebook, the color vectors p I are the training sequence, and the resulting set of clusters is the desired partioning that minimizes the error TSE. 42

55 3.3 Gaussian Mixture Model Initialization 3.3 Gaussian Mixture Model Initialization After the initial clusters have been created by the methods described in the previous section, the next step in the segmentation is the initialization of the statistical model representing foreground and background. Hence, each of the K foreground clusters is used to create a component for a foreground Gaussian mixture GM F G, the same is done for the background, creating GM BG. We will denote a single Gaussian mixture component by GMα k with α {0, 1}, k = (1,..., K) and the corresponding cluster is given by Cα. k Note that in this step the matching K is not changed, as each component with index k is directly created from the cluster with index k. When looking at Equations (2.21) and (2.22), we see that the single components, each representing a Gaussian density, are defined by the following four elements: The mean value µ, The inverse of the covariance matrix Σ 1, The determinant of the covariance matrix det Σ, The component weight ω. The mean value and covariance matrix for a component GMα k are already given by the sample color mean µ k α and sample covariance matrix Σ k α produced in the clustering step for the corresponding cluster Cα. k A n n matrix M is called invertible if the inverse n n matrix M 1 exists for which holds M M 1 = M 1 M = I, (3.8) where I = diag(1,..., 1) is the n n identity matrix. If no such inverse matrix exists, the matrix M is called singular. This is exactly the case if the determinant det M equals zero. Since the dimension of our sample covariance matrices is 3 3, their determinant can be calculated by [Fis03] det Σ = det a b c d e f g h i = aei + bfg + cdh afh bdi ceg. (3.9) The inverse Σ 1 of Σ is calculated by dividing the adjugate matrix of Σ by its determinant [Fis03] Σ 1 = adj(σ) det Σ, (3.10) with adj(σ) = [c j,i ], c i,j = ( 1) i+j det Σ i,j, where Σ i,j is the matrix that results from deleting row i and column j from Σ. As the determinant equals zero for 43

56 3 Graph-Based Color Image Segmentation singular matrices, Equation (3.10) in that case is not defined because of the then zero denominator. In our case it cannot be ruled out that a sample covariance matrix Σ becomes singular. The least significant interference of its statistical properties that can be utilized to make the matrix invertible again is adding a small constant to its main diagonal Σ i,i, as this only negligibly changes the variances. The weight ω k α of a Gaussian component GM k α is simply computed as the fraction of total foreground (α = 1) or background (α = 0) pixels that were assigned to the corresponding cluster C k α in the previous clustering step. 3.4 Iterative Segmentation The current section will describe the iterative learning part of the algorithm. Its goal is the consecutive enhancement of the Gaussian mixture models to improve the distribution models for foreground and background regions. Each iterative segmentation step consists of three different steps: At first all pixels are assigned to appropriate GMM components. Then the GMM s are updated based on this assignment. At last the segmentation is performed by finding a minimum cut of a graph representation of the segmentation problem Assignment of Pixels to GMM Components Now that each of the two Gaussian mixture models is completely defined by initializing all of their components, we can express the pixel color distributions in terms of those models. Given our color image p I, our preliminary segmentation α A, and our mixture of Gaussians GM F G, GM BG for foreground and background, we assign each pixel to the most likely component of its respective GMM. In other words, we assign a pixel p n I with α n = 1 to the component k of the foreground model GM F G with the highest probability of producing the pixels color. The same is done for each background pixel. Thus, in this step, no pixel changes its classification from foreground to background or vice versa. As each GMM component represents a probability density function, the probability that a given pixel p n is produced by a certain component GMα k with mean µ k α and covariance matrix Σ k α is thus given by prob α (p n, k) = 1 det Σ k α e ( 1 2 [pn µk α] T (Σ k α) 1 [p n µ k α]), (3.11) where α = 1 denotes foreground pixels, and α = 0 indicates a background pixel, respectively. The new assignment of pixels to GMM components updates the matching K = {k 1,..., k N }, so that each k n now represents the index of the assigned component k. Together with the in this step unchanged preliminary segmentation A = {α 1,..., α N }, the assingment of each pixel is completely defined. For example, 44

57 3.4 Iterative Segmentation a pixel p n belonging to the foreground (α n = 1) is assigned to the foreground GMM component GM k F G, if k n = k Updating the Gaussian Mixture Models Besides the fact that the pixel region classification A in the previous step did not change, however, the above process shifted some pixels from their initial clusters to different ones in the same region. This is due to the GMM s using a different distribution representation than the clustering algorithms did. As the goal of the iterative part of the algorithm is to improve the GMM s through a process of learning, we want to make use of the updated pixel assignments for our models. That means that the GMM s created from the initial clustering are now discarded and new GMM s are created based on the updated pixel assignments of the previous GMM assignment step. That ensures that our following first segmentation approach uses the latest distribution modelling information available. Thus, the mean, the covariance matrices and its inverse and determinant are all recalculated, based on the updated component assignments K = {k 1,..., k N }, updating each component of each GMM. Those calculations are done exactly as described in Section Segmentation by Graph Cut With the updated GMM s we are now ready to perform the segmentation. This is done by formulating an energy function E in such a way, that the minimum of E corresponds to a good segmentation. Given a segmentation A and a model G, we can obtain such energy similar to Equation (2.41) for a color image I, I = N with E(A, G, I) = E data (A, G, I) + λ E smooth (A, I). (3.12) Here E data is supposed to evaluate how the segmentation A fits to the color data I, given the model G. The smoothness term E smooth penalizes discontinuities resulting from the choice of A. The factor λ again controls the relative importance of both data terms. Since our model G is a GMM with K clusters for each region, we can denote it by G = {µ k α, Σ k α, ω k α, α = (0, 1), k = (1,..., K)} (3.13) with means µ, covariance matrices Σ, component weights ω, and pixel classifications α A. Recall that each pixel is assigned a component of our GMM, represented by our matching K = {k 1,..., k N } with k n {1,..., K}. As the definition of G depends on this matching K we include it in our energy function as part of the optimization: 45

58 3 Graph-Based Color Image Segmentation E(A, G, K, I) = E data (A, G, K, I) + λ E smooth (A, I). (3.14) As already mentioned, E data should penalize a bad choice of A, that is a segmentation that does not comply with the GMM G. With Equation (3.11) we already denoted the probability that a pixel p n belongs to a certain GMM component. Thus, the likelihood that pixels p n belong to the foreground or background is given by the product of their respective component probabilities of either foreground or background. In order to properly penalize a bad fit, we have to take the negative likelihood, so that E data is low for a high probability. Finally, to get rid of the product and replace it by a sum, a logarithmic scale is used that simplifies the minimization while not changing the extrema of the energy [GPS89]. This results in the following definition for the data term : E data (A, G, K, I) = N D(α n, G, k n, p n ), (3.15) n=1 with D being the negative log-likelihood of pixel p n belonging to component k n of the foreground (α n = 1) or background (α n = 0) GMM. For a single pixel p with classification α, this function is given by [RKB04] D(α, G, k, p) = log ω k α 1 det Σ k α e ( 1 2 [p µk α ]T (Σ k α ) 1 [p µ k α ]). (3.16) Here, the exponent [p µ k α] T (Σ k α) 1 [p µ k α] is the Mahalanobis distance of the pixel p to the mean µ k α. This metric, compared to the Euclidean distance, provides an improved distance measurement when dealing with non-spherical multivariate normal distributions, as the correlations between the data points are considered. In contrast to the data term, the smoothness term E smooth should measure the fit of the segmentation in terms of neighboring pixels, disregarding the GMM s. Let N denote the set of all pairs (p n, p m ) of neighboring pixels of image I. Then the smoothness term is given by [RKB04] E smooth (A, I) = (p n,p m) N γ d(p n, p m ) e β pn pm 2 δ(α n, α m ), (3.17) where d(p n, p m ) is the Euclidean distance of pixels p n and p m in image coordinates, and with the indicator function δ given by δ(α n, α m ) = { 1 if α n α m, 0 otherwise. 46

59 3.4 Iterative Segmentation β is a user-adjustable variance parameter of the Gaussian function. γ is also a parameter, corresponding to λ in Equation (2.41). It controls the relative importance of the two energy terms and is here included in the summation. With the energy function fully defined, we can now start to build a network N = (G, s, t, c) based on a directed image graph G = (V, E), that will allow us to find a good segmentation of an image by performing a minimum s-t cut. The process of creating the graph [BJ01], is described in the remainder of this section. For each pixel p n I, n = (1,..., N) one nonterminal vertex v n V is created. Two additional terminal vertices s V and t V are created, where s represents the foreground, and t represents the background. The 8-neighborhood N p for a pixel p consists of all pairs (p, q), where each pixel q is adjacent to p horizontally, vertically, or diagonally in terms of pixel coordinates (see Figure 3.8). Figure 3.8: Sketch of the pixels creating the 8-neighborhood of the center pixel p(x, y), with their image coordinates displayed. The edges in the set E created for our graph can be classified into two types: terminal links (t-links) and nonterminal links (n-links): t-links: For each vertex v n V \ {s, t}, the two directed edges (s, v n ) and (v n, t) are created. n-links: For each vertex v n V \ {s, t} (a maximum of) 8 edges are created, one for each pair (p n, q) N pn. Note that the number of n-links is limited for pixels at image borders, as not all pixels have 8 neighbors. Also note that for every edge (u, v) E a reverse edge (v, u) E is also created in order to maintain skew symmetry throughout execution of the used maximum flow algorithms, which require a directed graph as input. However, as positive flow is not 47

60 3 Graph-Based Color Image Segmentation allowed on t-links (v, s) and (t, v) forall v V due to the definition of a network, the capacity of those edges will be set to zero. The capacity for a n-link (v n, v m ) should correspond to a penalty for sorting the represented neighboring pixels (p n, p m ) in different regions if they are similar in color. From Equation (3.17) we can derive the capacity for a pair (p n, p m ) N pn as N(p n, p m ) = γ d(p n, p m ) e β pn qm 2. (3.18) The capacity for a t-link, connecting a vertex v n with one of the terminals s,t, depends on the classification of the pixel in one of the three classes indicated by the trimap value. As the cut represents the segmentation, e.g. for foreground pixels p n T F G, we want the capacity of the t-link (s, v n ) to be very high, so that the cut will not include that edge, because that would seperate p n from the foreground. Accordingly, the capacity of the other t-link (v n, t) should be set to zero, definitely cutting the edge (v n, t), seperating p n from the background. In order to prevent a t-link of a vertex to be included in the cut, its capacity must exceed the sum of all n-links of that particular vertex. To avoid calculating this capacity for every vertex, a constant c max can be defined as c max > arg max p I q: (p,q) N N(p, q). (3.19) where N is the set of all neighborhoods of image I. Derived from Equation (3.16), the capacity for pixels classified as unknown p n T UN is given by D α (p n ) = log K k=1 ω k α 1 det Σ k α e ( 1 2 [p µk α ]T (Σ k α ) 1 [p µ k α ]). (3.20) where α indicates that the summation is either over the foreground GMM components, if α = 1, or over the background GMM components, if α = 0. Thus, D α (p n ) denotes the negative log-likelihood that p n belongs to either the foreground or background GMM, respectively. See Table 3.1 for an overview of all edge capacities. Note that the minimum cut C = (S, T ) calculated on the network as defined above will be an s-t cut. That means to sort the terminals s and t into S and T respectively, at least one t-link must be cut for every nonterminal vertex. Otherwise there would be a path from s to t. Also, because the cut is minimal, it will not include both t-links of any particular vertex, since then a subset of C would already be a minimum cut. Thus, for each nonterminal vertex, exactly one t-link will be cut. Also C will cut a n-link (v n, v m ), if the corresponding pixels p n, p m are assigned to different terminals due to its minimality, but not otherwise, due to c max. As c max is greater than the sum of all n-links of a background vertex v n, p n T BG, it is cheaper to cut all n-links 48

61 3.4 Iterative Segmentation edge type capacity c pixel type n-link (v n, v m ) N(p n, p m ) (p n, p m ) N pn c max p n T F G t-link (s, v n ) 0 p n T BG D 0 (p n ) p n T UN 0 p n T F G t-link (v n, t) c max p n T BG D 1 (p n ) p n T UN Table 3.1: Overview of edge capacities for Graph Cut. of v n and t-link (s, v n ) with c(s, v n ) = 0 than the t-link (v n, t) with capacity c max. Hence, the cut results in an unique assignment of all pixels to either foreground, if their background t-link is cut, or the background, if their foreground t-link is cut. In other words, a resulting segmentation A for an image I = {p 1,..., p N } represented by vertices v 1,..., v N can be derived from the cut C = (S, T ) as { 0 if v n T, A = {α 1,..., α N }, α n = 1 if v n S. It can be shown that the minimum s-t cut of a network as given above indeed minimizes the energy denoted by Equation (3.14) [BJ01]. Figure 3.10 provides an example graph setup for a 3 3 image. Instead of doing the above calculation only once, the segmentation can be improved iteratively by updating the GMM s from the segmentation results and using that information for the next iteration. This is repeated until the segmentation meets a certain convergence criteria. A final summary of the whole iterative segmentation procedure is given in Figure

62 3 Graph-Based Color Image Segmentation Input : color image I = {p 1,..., p N }, depth image D, number of clusters K Output: segmentation A = {α 1,..., α N } of color image I Segmentation algorithm: 1. Create trimap with T F G, T BG, T UN T from D as described in Section Create preliminary segmentation A = {α 1,..., α N } from T : - p n T F G p n T UN α n = 1 - p n T BG α n = 0 and K back- 3. Perform clustering to create K foreground (α = 1) clusters CF k G ground (α = 0) clusters CF k G. 4. Create Gaussian mixture models GM F G, GM BG with K components each, initialized from the clusters created in the previous step. 5. Repeat the following steps, until segmentation A converges: 5.1. Assign each foreground pixel {p n : α n = 1} to the most likely foreground GMM component. Assign each background pixel {p n : α n = 0} to the most likely background GMM component Discard both GMM s and create new ones, based on the assignments made in the previous step Build an image network and calculate a minimum s-t cut as described in the current section, updating the segmentation A. Figure 3.9: Overview of the segmentation algorithm, following [RKB04]. Steps 5.1. and 5.2. present an incremental update of the GMM s in order to avoid having to rerun one of the clustering algorithms presented in Section 3.2, which would negatively affect performance. 50

63 3.4 Iterative Segmentation Figure 3.10: Example of a graph created for an image with 3 3 pixels, displaying the connectivity of the vertices by n-links (green) and t-links (red,blue) with their respective edge capacities. The blue vertex represents a pixel marked as background in the trimap, whereas the red vertex indicates a foreground pixel. Note that for visualization purposes, each line represents a pair of edges. 51

64 3 Graph-Based Color Image Segmentation 52

65 4 Performance Improvements In this section several possibilities of improving the performance of the segmentation algorithm are investigated. Included are several heuristics that can be used to speed up the basic maximum flow algorithms introduced in Sections and The actual impact on the run-time of the segmentation algorithm will be investigated in Section Reusing the Graph In the iterative part of the segmentation algorithm (step 5 of Figure 3.9) the minimum s-t cut is calculated on a graph built from scratch each time. This is unnecessary, as the mapping of pixels to vertices does never change during the algorithm. Also note that the function assigning the capacities to the n-links, given by Equation (3.18), is derived from the smoothness term of the energy given in Equation (3.14), which does not depend on the Gaussian mixture models. In fact the n-link capacities are just functions of the Euclidean distances of the represented pixels and will, once computed, never change. The only thing that needs to be updated are the t-link capacities, since they describe the likelihood of the represented pixels belonging to the constantly updated GMM s. Thus, to save execution time, only the t-links will be updated during execution of the segmentation algorithm, while the rest of the graph data structure will be reused. 4.2 Reusing the Flow In Section 4.1 we saw that most parts of the graph created for the minimum cut calculation can be reused for consecutive iterations. But still the flow on all edges is reset to zero. Not all of the edges will have their capacities changed in each iteration. Thus, what can happen with those edges is that the flow is first reset and then incremented back to the original value. This of course impacts performance, since e.g. the run-time of the Kolmogorov maximum flow algorithm depends on the number of augmenting paths. Hence, if the maximum flow calculation could start on the residual graph resulting from a previous iteration, this would reduce the number of augmenting paths that need to be checked. We already noticed that the capacity of the n-links will not change throughout the algorithm, only the t-link capacities do. Also note that increasing a t-link capacity 53

66 4 Performance Improvements does not invalidate any of the flow properties given by Equations (2.1), (2.2), and (2.3). Even reducing a t-link s capacity is not problematic if the updated capacity is still greater than or equal to the flow on that edge. Thus, the only problematic case is if an updated t-link capacity c(u, v) is smaller than the flow f(u, v) on that edge. This results in a negative residual capacity, invalidating the capacity constraint (Equation (2.1)). In each iteration of the segmentation algorithm the energy function changes, as the GMM s are updated, thereby changing the t-link capacities. If additional modifications of the graph are necessary to reuse the flow from the previous iteration, we have to be careful not to change its representation of the updated energy function. Thus, the flow can only be reused if the necessary capacity changes do not modify the minimum s-t cut. The modifications done to keep the flow consistent are based on the following fact: Let N = (G, s, t, c) with G = (V, E) be our image network as defined in Section Given a nonterminal vertex v V and its t-links (s, v), (v, t) E, the addition of a positive constant δ on the capacities of both t-links will never change the minimum s-t cut. In fact the capacity of the cut will change, but not the cut itself. This is true because the minimum s-t cut severs exactly one of the two t-links for every nonterminal vertex. Hence, if the same value is added to both, the cut will still select the same edge with the lower cost, see Figure 4.1. Figure 4.1: Left: A residual s-t network with minimum s-t cut C = ({s, 2}, {t, 1}), depicted by the red dashed line. Right: The same constant δ is added to the capacities of both t-links of vertex 1. This will only change the cut capacity, but not the cut itself. Thus, for the purpose of reusing the flow, the t-link capacities are updated as given in Algorithm 6. By setting δ to the difference of flow and updated capacity, it is ensured that δ is always positive. Also δ equals the least amount of capacity that can be added to make the flow consistent again, saturating the t-link with the pre- 54

67 4.3 Improvements for the Kolmogorov Algorithm viously inconsistent residual capacity [BJ01, KT07] Algorithm 6: U pdate-t-links(g) Update all t-link capacities; foreach source t-link (s, v) E do if f(s, v) > c(s, v) then δ f(s, v) c(s, v); c(s, v) c(s, v) + δ ; c(v, t) c(v, t) + δ; // (s, v) is now saturated foreach sink t-link (v, t) E do if f(v, t) > c(v, t) then δ f(v, t) c(v, t); c(v, t) c(v, t) + δ ; c(s, v) c(s, v) + δ; // (v, t) is now saturated 4.3 Improvements for the Kolmogorov Algorithm Several modifications of the Kolmogorov maximum flow algorithm will be presented in this section that gradually or significantly increase performance. Most of them are also implemented in the publicly available version of the algorithm 1, some of them are suggested by the version published in the boost C++ libraries Augmenting Terminal Paths Each nonterminal vertex of our image network N has a source and a sink connection. Thus, for each nonterminal vertex v V the terminal path (s, v, t) is a possible augmenting path. During the growth phase of the Kolmogorov algorithm, each of those paths would have to be explored. Hence, to reduce the number of available augmenting paths, before the actual algorithm starts, all paths of the form (s, v, t) are augmented by increasing the flow on the corresponding edges (s, v), (v, t) by an amount equal to max(r(s, v), r(v, t)), where r(.) denotes residual capacity. That means that after the operation, at least one t-link of every nonterminal vertex is saturated, significantly reducing the number of possible augmenting paths

68 4 Performance Improvements Also note that at the beginning of the growth phase, s and t are the only active vertices. Remember that only active vertices are allowed to add vertices to their search tree. Thus, the first steps in the growth phase consist of activating all vertices adjacent to the terminals. To prevent this, after the augmentation of the terminal paths, first all nonterminal vertices v having an unsaturated edge (s, v) are added to the source tree. Similarly all vertices with an unsaturated edge (v, t) are added to the sink tree. Additionally, all nonterminal vertices connected to one of the terminals by an unsaturated edge are marked as active. In exchange, the terminals s and t are not added to the list of active vertices. This prevents the initial activation of all nonterminal vertices by the search tree roots s and t in the growth phase Saving Visited Edges At the beginning of the growth phase an active vertex v is selected and each of its outgoing edges is checked whether it has available residual capacity. Is is often the case, that this examination is interrupted because an augmenting path that includes v has been found. The algorithm then proceeds to the augmentation stage. Note that v is not set to passive, if not all of its outgoing edges have been checked. Thus, in the next growth phase, v is still the currently selected active vertex. The algorithm would now start to check all outgoing edges of v again. In order to prevent this, the last checked edge e is saved for every active vertex currently processed. So if an augmenting path has interrupted the edge iteration, the next growth phase can restart from the last checked edge. This greatly reduces the number of edge checks. Since our image graph has a very high edge density, this has the potential to significantly improve performance Distance Heuristic One important performance factor for augmenting path algorithms is the length of the selected paths. For example, the maximum flow algorithms of Dinic or Edmonds and Karp [CLRS01] do not select an arbitrary augmenting path, but the path with the shortest length, which is found by performing a breadth-first search. Due to this fact, for a graph G = (V, E), their worst case run-time complexities can be given by O( V 2 E ) for the Dinic algorithm, and O( V E 2 ) for the Edmonds-Karp algorithm, respectively. The augmenting paths found in the growth phase of the Kolmogorov algorithm are not necessarily shortest paths, so the time complexity for finding a shortest path is not applicable. In order to reduce the length of the augmenting paths to improve the run-time, for each vertex its distance d to its search tree root, which is either s or t, is saved. The distance of a vertex v is the number of edges traversed from the search tree root to reach v. For instance, a vertex adjacent to a root has distance 1, whereas the roots themselves have distance 0. As already explained, in the growth phase active 56

69 4.3 Improvements for the Kolmogorov Algorithm vertices try to add neighboring vertices connected by residual edges to their search tree. Thus, when an active vertex v finds a vertex u connected by a residual edge, three different situations can occur: 1. u does not belong to any search tree. Then u is added to the same search tree as v and is marked as active. 2. u belongs to the other search tree. Then an augmenting path has been found and the algorithm proceeds to the augmentation phase. 3. u belongs to the same search tree as v. Normally, if the third case occurs, nothing has to be done at all. Note that for each vertex, its parent vertex is saved. The parent of a vertex v is the vertex w which aquired v to its search tree by visiting the residual edge (w, v). Thus, if an augmenting path has been found in the growth phase, it is retraced in the augmentation phase by traversing the edges from the parent vertices, starting with the vertices connecting the search trees. This is done until the search tree roots have been reached. To shorten the augmenting paths, the saved distances d v, d u of v and u can be utilized. If d u is smaller than d v, then u has a shorter distance to the root as v. In this case the current parent vertex w of v is replaced by u and the distance is updated accordingly. Hence, if retracing the augmenting path in the augmentation phase, or checking whether a vertex has a path to a terminal in the adoption phase, fewer edge traversals will have to be performed [KB01] Timestamp Heuristic In the adoption phase it is tried for each orphaned vertex to find a new parent vertex from its previous search tree. A valid parent is not only connected by a residual edge but must also have a path to its search tree root. This last requirement is necessary because in the augmentation stage it is common that not only single vertices are disconnected from the search trees, but connected components consisting of several vertices. For instance, if the edge (u, v) gets saturated, seperating v from its search tree, the vertices whose parent vertex is v will also become orphans. Thus, if a vertex w is supposed to be the new parent of v we have to make sure that the search tree root is the origin of w. To test this we first visit the parent x of w, then the parent y of x, and so on, until the search tree root is reached or not. This process can include a large number of edge checks and is performed for every parent candidate of every orphan encountered in the adoption phase. To speed up this test, in addition to the distance d v, a timestamp t v is saved for every vertex v. The timestamp equals the iteration number of the algorithm in which the distance d v to the terminal has been updated. 57

70 4 Performance Improvements If now in the adoption phase a vertex v tries to make w his new parent vertex, the path from w to the terminal has to be traced as usual. If the terminal is reached, the timestamps and distances of the vertices along the path are updated and w becomes the new parent of v. This updated timestamp and distance information can be utilized for future origin checks by comparing the timestamp of encountered vertices to the current iteration number. If now during another origin check, a vertex u is encountered with its timestamp t u equal to the current iteration number, we know that the origin of u was already checked this iteration. Thus, the process can be stopped early and the whole path up to the search tree root does not need to be traversed [KB01] Reusing the Search Trees In Section we explained that one advantage of the Kolmogorov algorithm over other augmenting path algorithms is that it reuses the search trees build in previous iterations, resulting in increased performance. In Section 4.2 we presented a method to reuse the flow in subsequent executions of the maximum flow algorithms during iterative segmentation. However, even when reusing the flow with the Kolmogorov algorithm, we still discard the search trees after it terminates. Hence, the next logical step is to reuse the search trees not only throughout iterations of a single execution, but also reuse the last known search trees from a previous execution, since rebuilding the search trees from scratch consumes a lot of execution time for large graphs. The method for reusing the flow presented in Section 4.2 leaves one t-link of a vertex v saturated if its capacity was decreased below its flow value, see Algorithm 6. For example, if for the t-link (s, v) of a vertex v before the update it hold that f(s, v) > c(s, v), then after the t-link update r(s, v) = 0 applies. The problem here is that if s was the parent vertex of v then v will become an orphan along with all its children. This problem could be solved by simply performing an additional adoption stage after Algorithm 6. However, this is not optimal, as any applicable neighbor u of v could become his new parent, while the distance from u to its search tree root could be arbitrarily large. That lenghtens a possible augmenting path along these vertices, negatively affecting the performance of the augmentation stage, which has to traverse said path. However, we also know that if (s, v) is now saturated after execution of Algorithm 6, then (v, t) must have available residual capacity. This is true because the same constant δ is added to the capacities of both t-links. The same holds, if (v, t) became saturated, as then (s, v) will have positive residual capacity. If we denote the set of all vertices v which were affected by Algorithm 6 with G U G, meaning that each v G U has exactly one saturated t-link, Algorithm 7 is performed to repair the search trees. Note that the search tree roots are directly assigned as parent vertices, thereby decreasing the length of possible augmenting paths. If a vertex v 58

71 4.3 Improvements for the Kolmogorov Algorithm changes the search tree, all its children must become orphans and cannot remain children of v. That is because the search trees use different edges (outgoing edges for S and incoming edges for T ) to access their children, as already mentioned in Section The adoption stage in Algorithm 7 is performed to find new parents for such orphans. Also note that, after this update has been performed, possible augmenting paths in the next execution of the Kolmogorov algorithm can only pass through vertices whose edges underwent a capacity change [KT07]. This limits the set of vertices that have to be activated to G U. This limitation further improves performance, as the growth phase processes all active vertices and normally all vertices adjacent to source or sink will become active, see also Section Algorithm 7: Repair-trees(G) foreach v G U do if v belongs to source tree S then if r(s, v) > 0 then parent(v) s; else if r(v, t) > 0 then parent(v) t ; // v now belongs to sink tree T foreach u G U with parent(u) = v do Add u to orphans; else if v belongs to sink tree T then if r(v, t) > 0 then parent(v) t; else if r(s, v) > 0 then parent(v) s ; // v now belongs to source tree S foreach u G U with parent(u) = v do Add u to orphans; Perform adoption stage; Order of Orphan Processing As already pointed out in Section 2.1.2, the augmentation phase can create orphans, which need to be adopted to make the search trees consistent again. However, it is also possible that the adoption stage itself creates orphans: If no parent vertex for an orphan v can be found, then v will not belong to any of the trees S or T at the start of the next iteration. If v is also a parent vertex to any number of children, 59

72 4 Performance Improvements they will also become orphans, as their parent has no valid connection to any of the trees. The orphans created in the augmentation phase are processed in order of increasing distance to their previous search tree root. This is done because if a valid parent for those orphans is found, all its children will also be adopted by the same search tree. As for each potential parent vertex it is tested whether it has a connection to its terminal. Thus, starting with the vertices closest to the terminals will reduce the number of edge checks needed to validate the terminal origin, especially in combination with the timestamp and distance heuristics [KB01]. Orphans created in the adoption stage are processed before the other orphans. That means if a orphan v does not find a valid parent, all its children will be processed before any other orphans. Also in [Kol04] it is suggested that these children should be processed in a FIFO order to further increase performance. 4.4 Improvements for the Push-Relabel Algorithm The basic push-relabel algorithm given by Algorithm 3 has a worst case time complexity of O( V 2 E ) for a network N = (G, s, t, c) with graph G = (V, E). The push operations P ush(u, v) performed by the algorithm can be categorized into saturating pushes, which saturate the edge (u, v), and nonsaturating pushes, which leave the edge (u, v) with residual capacity (r(u, v) > 0). The number of saturating pushes performed is at most V E, while the number of nonsaturating pushes is at most 4 V 2 E [CLRS01]. The performance of the algorithm is thus mostly dependent on the number of nonsaturating push operations. In the following paragraphs several methods to improve the performance of the push-relabel algorithm by reducing the number of nonsaturating pushes are investigated Order of Vertex and Edge Processing The basic algorithm does not impose any order of vertices or edges on which the push and relabel operations are performed. To impose an ordering of the vertices we will categorize all vertices v V \ {s, t} into active and inactive vertices. A vertex v is called active, if e(v) > 0 and is called inactive otherwise. The discharge operation is introduced by Algorithm 8. It works by selecting an active vertex u and then performs push operations P ush(u, v) to applicable neighboring vertices v, 60

73 4.4 Improvements for the Push-Relabel Algorithm until u becomes inactive. Algorithm 8: Discharge(u) foreach edge e (u, v), resuming from recent edge(u) do try P ush(u, v); if e(v) > 0 inactive(u) then remove v from inactive list I h(v) ; add v to active list A h(v) ; if all neighbors v visited e(u) > 0 then Relabel(u); reset recent edge(u); if e(u) = 0 then remove u from active list A h(u) ; add u to inactive list I h(u) ; recent edge(u) e; break; Which active vertex is selected by discharge is now imposed by a certain selection strategy. There are two selection strategies available which both are able to reduce the time complexity of the push-relabel algorithm to O( V 3 ) [GT86]. The first uses a first-in-first-out (FIFO) strategy. Here the algorithm maintains a queue of active vertices. The discharge operation always selects the first vertex in the queue, while vertices becoming active are inserted at the rear of the queue. Another strategy is called highest-label-first. Here the active vertex being selected by discharge is always the vertex u with maximum height. The latter strategy does in most cases gain a larger profit from the additional use of the gap relabeling heuristics [CG94] and has been determined several times as the implementation with the best performance [CG94, KB01]. It works by maintaining one list of active vertices A h and one list of inactive vertices I h for every set of vertices with height h. If a vertex becomes active during the algorithm, it is added to the front of the corresponding active list. Now the discharge operation always selects the vertex u with height h from the front of the list A h, if for h it holds that v V \ {s, t} : h h(v). After the preflow has been initialized as usual, the push-relabel algorithm now consists of executing the discharge operation until no more active vertices are left, resulting in a maximum flow represented by the excess value e(t) [GT86]. Also crucial for the performance of the push-relabel algorithm is the limitation of the number of unproductive edge checks. If discharge selects a vertex u it will eventually terminate when u has no excess flow left. If another vertex processed by discharge activates u again, it is added to the front of the respective list and will eventually 61

74 4 Performance Improvements get selected by discharge again. Now discharge will iterate over all outgoing edges of u again, including the saturated edges already checked. To prevent this the most recently checked outgoing edge (u, v) of vertex u is saved before discharge terminates (recent edge(u)). When u is selected again, discharge will resume the edge iteration from this edge. This strategy is necessary to limit the time complexity for all edge checks done to O( V E ) [GT86], which is in turn required to meet the bound O( V 2 E ) for the whole algorithm Global Relabeling Heuristic The relabel operation given by Algorithm 2 is a local operation and will only adjust the height of a vertex v in order to make local push operations applicable. Note that the third condition of a valid height function (Definition 13) ensures that flow is only pushed in the direction of the sink. Due to this condition, the height function represents a lower bound to the length of a shortest path to the sink. However, due to the locality of the relabel operation, even the heights of adjacent vertices will often vary greatly. This leads to unnecessary push operations, as shown in Figure 4.2. Ideally, the height h(v) of a vertex v is equal to the shortest path distance from v to the sink t. That would allow the excess of v to be routed to the sink without any necessary relabel operations and nonsaturating pushes. To get shortest path distances from any vertex v V \{s, t} to the sink t, a breadthfirst search (BFS) originating from t can be performed. For a directed graph G of the network N = (G, s, t, c), the BFS is done in the residual graph G f induced by the preflow f currently present in the network, as only the edges in G f can still increase the flow. Note that the BFS has to be performed backwards from t, traversing the incoming edges. This is done because we want to find the shortest distances to the sink from all other vertices in the residual graph. If, after the BFS, the height of every vertex v visited is set to the number of edges traversed to reach v, this can be regarded as a global relabeling to optimize the height function. The backwards breadth-first-search has a worst case time complexity of O( V + E ), since at most all vertices and all edges will be visited [CLRS01]. It is however rarely the case that all edges are present in the residual graph. Still, if performed to often, the practical performance of the push-relabel algorithm would suffer due to the size of the graph. We already noted that the local relabel operations cause the suboptimal values of the height function. Therefore the frequency of the global relabeling should be dependent on the number of conventional relabel operations performed since the last global relabeling. Significant practical performance improvements have been accomplished e.g. by performing a global relabeling after every V local relabelings [CG94]. 62

75 4.4 Improvements for the Push-Relabel Algorithm (a) (b) (c) (d) (e) (f) Figure 4.2: Example of unnecessary push operations prevented by global relabeling. Displayed are excess (green) and height (brown) for every vertex, as well as flow (blue) and capacity (red) for every edge. (a) through (e): Flow is pushed back and forth between vertex 1 and vertex 2. (f): Only after the height of vertex 2 finally exceeds the height of vertex 3 the circulation ends Gap Relabeling Heuristic The gap relabeling heuristics is one way to reduce the number of unsaturating push operations, which are critical for the time complexity of the algorithm. This heuristics tries to restrict the vertices subject to push operations to those from which the sink is reachable in the residual graph. This is based on the following context: If at any time during the execution of the push-relabel algorithm a vertex v with height h(v) exists for which holds h(v) > i > 0 (4.1) 63

76 4 Performance Improvements for a height i not taken by any vertex, then the sink is not reachable from v in the residual graph [GT86]. This can be explained as follows. As given by Algorithm 1, a vertex u with positive excess and height h(u) will only push flow to a vertex v with P ush(u, v), if h(u) = h(v) + 1. Also a relabel operation, given by Algorithm 2, will only increase the height of a vertex to one above the height of the neighbor with the lowest height. Given our network N = (G, s, t, c) with G = (V, E), let V >i = {v V : h(v) > i} and V <i = {v V : h(v) < i} be the partition of V based on the gap i. Note that t V <i as h(t) is fixed at zero. If now t would be reachable from a vertex u V >i in the residual graph, then there exists an edge (u, v) with v V <i and r(u, v) > 0. However, then h(u) would have been relabeled to h(v) + 1 at most. This contradicts our assumption h(v) < i < h(u) h(u) > h(v) + 1. Thus, if a height is not taken by any vertex, higher vertices will not be able to rout flow to the sink. To use this heuristics, after any relabel operation on a vertex v with previous height h has been performed, it is tested whether v was the only vertex with height h. As for every height value there is one list of active vertices with that height, this is just a matter of testing if the list corresponding to height h is empty. If now vertex v satisfies Equation (4.1), its height will be increased to V. Together with the early termination presented in Section 4.4.4, the vertices which have their height increased to V will not have to be tested for applicability of push operations and can be removed from the corresponding list of active vertices. It has been shown that, especially in combination with the highest-label-first vertex selection rule, that the gap relabeling can significantly improve the practical performance of the push-relabel algorithm [CG94] Early Termination As already stated in section 2.1.3, the push-relabel algorithm terminates with a maximum flow f, if none of the vertices except the sink t has remaining excess. Then the excess of t is equal to the maximum flow value. However, it is also true that the excess of the sink does not change any more, if the following condition holds: v G \ {s, t} : h(v) >= h(s). (4.2) Remember that the height of the source s is fixed at V for a graph G = (V, E). In order to inrease the flow f and thus increase the maximum flow value e(t), a vertex v must have a path to the sink in the residual graph G f. That means v must have a path to t along edges with available residual capacity. However, due to the properties of the height function h, see Definition 13, said path must include at least V + 1 vertices. As the graph only has V vertices, the sequence of vertices must include a cycle and thus can not be a path (Definition 8). 64

77 4.4 Improvements for the Push-Relabel Algorithm As we are only interested in the minimum cut induced by the maximum flow, which does not change if above condition holds, we can terminate the calculation if all vertices have reached the height of the source. Also any vertex v reaching the source height V can immediately be disregarded for the purpose of increasing the flow. 65

78 4 Performance Improvements 66

79 5 Experimental Results In this chapter experimental results produced by the segmentation algorithm described in Chapter 3 are presented and evaluated. Available for the evaluation are RGB images from three video sequences captured by the camera setup described in Section 3.1. They provide a challenging segmentation task due to the similarity of foreground and background colors, as visible in the example images displayed in Figure 5.3. This color similarity not only results from the choice of certain foreground colors similar to the background, but also from a varying amount of exposure and suboptimal lighting that leaves some regions over- or underexposed. The evaluation begins with a comparison of the clustering results of the two algorithms presented in Section 3.2. Then the convergence behavior of the iterative segmentation approach is investigated. Afterwards a closer look is taken at the parameters of the energy function and the influence of parameter changes on the segmentation results. Then follows a comparison of segmentation results for RGB and L u v images. After this an evaluation on the performance of the maximum flow algorithms is presented, as these are supposed to be the main performance factor of the whole segmentation algorithm. Here the impact on performance of the improvements presented in Chapter 4 is presented. Finally, segmentation results for video sequences are given. All the segmentation results and run-times were obtained using Linux on a PC with a 3.16 GHz Intel Core 2 Duo CPU (E8500) and 7.8 GB of RAM. If not stated otherwise, the parameters for the segmentation algorithm were set to K = 5 and γ = 50, as proposed by Rother et. al. [RKB04]. 5.1 Comparison of Clustering Algorithms The two clustering algorithms explained in Section 3.2 work differently to cluster the pixels of an image. While the K-means algorithm creates initial random clusters that are refined iteratively, the quantization approach begins will all pixels in a single cluster, which is partitioned hierarchically. The drawback of the used K-means algorithm is its heuristic nature. Due to the initial cluster centers chosen randomly at the beginning, there is no guarantee that the algorithm converges to the optimal clustering, as it only iteratively optimizes a candidate solution. Thus, the clustering result is dependent on the initial random values for the cluster means. 67

5 Experimental Results Another drawback is its tendency to produce clusters that are similar in size because each pixel is simply assigned to the nearest cluster center.

An example of suboptimal clustering results produced by K-means is displayed in Figure 5.1. Here the used lighting for the scene in Figure 5.1(a) results in overexposed regions e.g. of the hands and legs of the person sitting in the foreground.

While the color quantization approach manages to partition the overexposed pixels from the wall pixels in the uncertainty region, they are put in the same cluster by the K-means algorithm.

80 5 Experimental Results Another drawback is its tendency to produce clusters that are similar in size because each pixel is simply assigned to the nearest cluster center. This can result in an imprecise partition of pixels, especially if the colors of an image are distributed unevenly. An example of suboptimal clustering results produced by K-means is displayed in Figure 5.1. Here the used lighting for the scene in Figure 5.1(a) results in overexposed regions e.g. of the hands and legs of the person sitting in the foreground. Hence, these regions appear similar in color to the white wall in the background of the scene. While the color quantization approach manages to partition the overexposed pixels from the wall pixels in the uncertainty region, they are put in the same cluster by the K-means algorithm. (a) Original image. (b) Color quantization. (c) K-means. Figure 5.1: Comparison of the clustering results (K = 5) for the (preliminary) foreground region of image (a) with clusters denoted by color. Notice that K-means puts overexposed regions of the hands and legs in the same cluster (blue) as the background wall. Due to the iterative energy minimization approach of the segmentation algorithm, variability of the initial clustering is likely to be compensated. However, experiments show that the initial clusters can have noticeable influence on the segmentation results, as can be seen in Figure 5.2. For example, the foreground of the K-means result displayed in Figure 5.2(b) includes several underexposed background regions at the bottom of the image. In contrast, those parts are successfully segmented to the background by the quantization approach given by Figure 5.2(a). The quantization approach splits a cluster directly at its center, while simultaneously accounting for the direction of the clusters greatest variance. Especially for large spherical Gaussian clusters this method provides an optimal clustering [BO91]. However, the quantization algorithm performs a singular value decomposition of the covariance matrix of each cluster in each of its iterations, which might negatively affect performance. It it thus of interest if the use of the quantization algorithm over K-means implies a trade-off between quality and performance. 68

81 5.1 Comparison of Clustering Algorithms (a) K = 5, γ = 150, color quantization. (b) K = 5, γ = 150, K-means. (c) K = 5, γ = 91, color quantization. (d) K = 5, γ = 91, K-means. Figure 5.2: Influence of the initial clustering on the segmentation results. Although the segmentation parameters remained constant, the results from K-means clustering show several differences (red circles), which vary with each execution. 69

5 Experimental Results To compare the performance of the clustering algorithms, one image of each of the three sequences was selected for evaluation, as pictured in Figure 5.3.

Table 5.1. Note that the number of iterations of K-means was limited to 10. (a) fg: 958687, bg: 311130. (b) fg: 854699, bg: 255991. (c) fg: 602584, bg: 214596. Figure 5.

82 5 Experimental Results To compare the performance of the clustering algorithms, one image of each of the three sequences was selected for evaluation, as pictured in Figure 5.3. Performing both algorithms on these images reveals that the quantization approach outperforms the K-means algorithm, especially for a large number of clusters, see the run-times in seconds given by Table 5.1. Note that the number of iterations of K-means was limited to 10. (a) fg: , bg: (b) fg: , bg: (c) fg: , bg: Figure 5.3: Input images ( pixels) used for the performance analysis of the clustering algorithms (Table 5.1). Given below each image is the number of foreground (fg) and background (bg) pixels that are subject to clustering. algorithm K-means (10 iter.) color quantization resolution: pixels K = 5 K = 8 K = 10 K = 16 K = Table 5.1: Performance overview for the clustering algorithms with run-times given in seconds. Note that the times are given for the clustering performed to initialize the Gaussian mixture models. Thus, e.g. K = 5 actually means that 2 K = 10 clusters are created. The times given are an average over the three images displayed in Figure 5.3. As shown in this section, using the quantization algorithm over K-means not only provides an increase in quality of the segmentation, but also increases performance. Additionally, the heuristic nature of the K-means algorithm leads to another problem: the quality of the segmentation results will vary with each K-means clustering due to its randomness. For example, over a series of runs, segmentation results like those displayed in Figures 5.2(b) and 5.2(d) were only produced by about half of the runs. The other half delivered results equal to those displayed in Figures 5.2(a) and 5.2(c). This circumstance makes further evaluations difficult when using K-means clustering, as e.g. the influence of segmentation parameter changes are sometimes not noticeable due to the general variance in quality. 70

83 5.2 Convergence of the Iterative Minimization In consequence of above reasons, the quantization approach was used to provide the initial clusters for all of the segmentations performed in this chapter, if not stated otherwise. 5.2 Convergence of the Iterative Minimization As already explained in Section 3.4, the presented segmentation algorithm uses an iterative approach to energy minimization [RKB04], representing an extension of the singular minimization approach presented in [BJ01]. In each iteration the image pixels are assigned to the GMM component that most likely produced that pixels color. Here foreground pixels are only assigned to a foreground GMM component, the same holds for background pixels. Hence, this step is only optimizing the representations of the respective color distributions and does not directly optimize the segmentation, as the region classification of pixels does not change but only their assignment to components within the same GMM. In order to use this updated assignment to improve the segmentation, the GMM parameters like mean and covariance matrix need to be updated to reflect the new color distributions. Instead of rerunning one of the clustering algorithms to fulfill this purpose, the GMM parameters can now be discarded and recomputed based on the updated component assignments. The updated GMM s define an updated energy function, which is then minimized to arrive at an optimized segmentation. This minimization step is the only step where pixels will change their classification from foreground to background, or vice versa. This optimized segmentation can now be used to again improve the color distributions within each region. It can be shown that above iterative approach guarantees a convergence of the minimized energy function, at least to a local minimum [RKB04]. This holds because each of its three single steps (updating GMM component assignments, recomputing GMM s, and updating the segmentation) minimize the energy E defined by Equation (3.14), with E decreasing monotonically. If the segmentation derived from the energy minimization step, as given by Equation (3.4.3), remains constant after a certain iteration, it means that no pixel changed its classification from foreground to background, or vice versa. As the energy E decreases monotonically, very few classification changes can be expected after after a certain point, only insignificantly changing the segmentation. Therefore, we end the iterative segmentation procedure if only 0.01% or less of the total image pixels change their classification. For an image with pixels, this corresponds to about 207 pixels or less. To analyse the convergence behavior of the segmentation algorithm with above termination criterion, a total of nine test images were selected, three from each of the three available image sequences. The parameter K was set to 5 for all images while different values for the parameter γ were selected for the three sequences (sequence 71

84 5 Experimental Results 1: γ = 91, sequences 2 and 3: γ = 150). The iterative segmentation was aborted as described above, if the number of pixel classification changes was equal or lower as 0.01% of the total pixels. The Kolmogorov maximum flow algorithm was used with all improvements enabled (see Section 5.4). The results of the analysis are displayed in Figure 5.4. They show that the number of pixel classification changes converge after 5 to 13 iterations, with an average of 9 iterations over all images. pixel classification changes image 0192 image 0243 image 0342 pixel classification changes image 0700 image 0755 image iteration (a) Sequence 1 (γ = 91) iteration (b) Sequence 2 (γ = 150). pixel classification changes image 1292 image 1340 image iteration (c) Sequence 3 (γ = 150). Figure 5.4: Example of the convergence behavior of the iterative minimization for nine images from the three different image sequences. The last iteration shown in each graph marks the required number of iterations until all of the three images reach their convergence criterion in terms of pixel classification changes. 72

85 5.3 Segmentation Parameters The abortion criterion stated above was used to produce all segmentation results in this chapter. Our experiments featured rare cases where the result from an earlier iteration was of higher quality than the result from the last iteration. However, this behavior mostly results from an inappropriate choice of parameters, leading to a covergence of the segmentation energy to an inadequate minimum. 5.3 Segmentation Parameters In this section a closer look is taken at the parameters of the smoothness term, which is defined by the energy in Equation (3.14). Given an image I and a segmentation A, the smoothness term of the aforementioned energy is given by E smooth (A, I) = (p n,p m) N γ d(p n, p m ) e β pn pm 2 δ(α n, α m ), where N is the set of all neighboring pixels p n, p m I (see Equation (3.17) for further details). The first parameter we will discuss is the variance parameter β. When disregarding β, the smoothness term simply corresponds to a penalty for discontinuities between similarly colored pixels p n, p m. However, if the pixels are very different, this penalty should ideally be very small, relaxing the tendency for coherence at object borders. To support this behavior, β is set to [BJ01, RKB04] β = 1 2 (p n p m ) 2, (5.1) where p n, p m N and. denotes an expected value. Hence, (p n p m ) 2 denotes the expected value of the Euclidean distance over all neighboring pixels p n, p m N. Hence, if the Euclidean distance of two pixels is lower than this expected value, the penalty of the smoothness term for a discontinuity between those pixels is very high, and vice versa. Thus, β functions as a global parameter representing the image noise between neighboring pixels, resulting in a reduced weighting of coherence in image regions of high color contrast. Also present in the smoothness term is the Euclidean distance d(p n, p m ) in pixel coordinates, found in the denominator of the first fraction. The numerator of that fraction is the parameter γ, controlling the relative importance of the data term and the smoothness term, see Equation (3.14). As d(.) measures distance in pixels coordinates, it equals 1 for pixels that are vertically or horizontally adjacent, and equals for diagonally adjacent pixels in a 8-neighborhood. Thus, the parameter γ is weakened for diagonal pixel neighbors, diminishing the weight of the smoothness term for these pixels. This relaxes the tendency of the segmentation 73

86 5 Experimental Results algorithm to prefer diagonal cuts over horizontal or vertical cuts [BJ01, RKB04, TX06]. The remaining parameter γ is treated as an arbitrary parameter and its influence on the segmentation results is discussed in the following section Influence of Parameter γ The parameter γ corresponds to the weighting factor λ in the energy subject to minimization and thus controls the weight of the smoothness term relative to the data term. As the smoothness term encourages coherence for regions with similar color values, e.g. a high value of γ will more likely segment similarly colored regions together, even if this segmentation conflicts with the color distributions of foreground and background. On the other hand, a low value of γ will support a segmentation complying with the color cluster information of the Gaussian mixture models, disregarding region coherence. Rother et. al. [RKB04] propose a constant value of 50, which has been obtained by comparing the segmentation results from 15 images to ground truth data. This value is supported by experiments described in [BRB + 04] performed on a variety of different images. A first example of the influence of γ on the segmentation results is given in Figure 5.5, where a constant number of clusters K = 5 was used to produce the results. The input image and the used trimap are displayed in Figures 5.5(a) and (b), respectively. Here a value of γ at or below the proposed value of 50 produces a noticeable relaxation of coherence, resulting in discontinuities of the table in the foreground, see Figures (c) and (d). Also many small underexposed regions of the background behind the plant are segmented into the foreground region, as they are similar in color to the clothes of the persons and to the lower, underexposed region of the chair. Values of 90 to 150 tend to produce quite similar results, with very slightly increasing quality getting nearer to 150. Here only a small part of the illuminated upper region of the table is segmented to the background due to its strong color similarity to the wall in the background, see Figure (e). Note that parts of the lower background behind the plant are still segmented to the foreground. This mostly results from the erosion of the depth image, which still leaves some pixels of this region marked as definite foreground, visible in the trimap (b). However, an erosion with a larger structuring element would remove further foreground regions from the trimap, also negatively affecting the segmentation results. Values above 150 did not show any significant changes of the segmentation until reaching a value of 300, where the arm of the person in the foreground is segmented to the background (f). This results from the now prevalent coherence of similar colored regions due to the large weight of the smoothness term, as the color of the arm is similar to the surrounding background region. 74

87 5.3 Segmentation Parameters (a) Input image. (b) Input trimap. (c) γ = 20. (d) γ = 50. (e) 90 γ 150. (f) γ = 300. Figure 5.5: Influence of parameter γ on segmentation results with K = 5. the red circles mark segmentation errors. Another example of the influence of γ is shown in Figure 5.6. Notice that due to unfavorable lighting conditions during capture of the image, some areas are overexposed. For example, the stripes of the right person s sweatshirt are actually white but seem bluish like the background wall, especially at the border region of the foreground, visible in the original image (a). Despite the suboptimal lighting conditions and the large uncertainty region of the trimap, the algorithm was mostly able to correctly segment the persons from the background. Even for low values of γ like 30 or 50, only a minimal amout of discontinuities of the right person s sweatshirt is visible in the region of the right arm, 75

88 5 Experimental Results see (c) and (d). While the results for γ values below 50 erroneously segment a part of the background at the head of the right person into the foreground, this is not the case once the weighting factor of the smoothness term is increased. The best result is achieved with γ = 91 (e), as now also the overexposed part of the left person s head is correctly segmented. Higher γ values above 100 start to include the enclosed background region between the magazine and the right arm of the left person (f). The algorithm was not able to correctly segment the nose regions of the two persons into the foreground for any value of γ, because they appear very similar to the background color due to the lighting conditions. (a) Input image. (b) Input trimap. (c) γ = 30. (d) γ = 50. (e) γ = 91. (f) γ > 100. Figure 5.6: Influence of parameter γ on segmentation results with K = 5. The red circles mark segmentation errors. 76

89 5.3 Segmentation Parameters Influence of Parameter K Another parameter of the algorithm is the number of clusters K used for the clustering of the image, prior to segmentation. These clusters are also used to initialize the components of the two Gaussian mixture models, thus a total of 2 K clusters are created. The number of components of a GMM limits the complexity and precision of the modelling of underlying color distribution data. For example, assume that some foreground pixels of an image are similar in color to a background cluster. Further assume that the number of these foreground pixels is not large enough to form a separate foreground cluster, because K is too small. Then either those foreground pixels will be sorted into an existing foreground cluster, shifting its mean value, or they will be erroneously sorted into the similar background cluster. Experiments show, that the influence of the parameter K on the segmentation results is limited compared to the influence of γ described in the previous section. An example is given by Figure 5.7. The input image and trimap used are the same as given in Figure 5.5(a) and 5.5(b), respectively. A constant γ value of 150 was used to produce the results. For numbers of clusters below the proposed value of 5, the arms of the two persons in the foreground are segmented into the background. This happens because their color is very similar to the adjacent background and there are not enough other similar foreground pixels to form a separate cluster. Also some parts of the lower background around the plant are segmented into the foreground, because of their similarity to the clothes of the left person and the underexposed regions below the chair. The experiments also show that increasing the number of clusters does not automatically imply a higher quality segmentation, as can be seen by comparing Figures 5.7(b) and (c). For K = 12 a small part of the background is segmented into the foreground, which is not the case for K = 5. However, a further increase in quality can be observed for the result with K set to 16 (d). Here the top part of the table is correctly segmented into the foreground, despite its strong similarity to the background. Note that some of the upper table pixels are marked as background in the trimap (5.5(b)), resulting from depth image errors and the amount of erosion and dilation necessary to correctly segment the plant into the foreground. For most of the images from the different sequences, the proposed value of K = 5 confirms to deliver satisfying segmentation results. To improve the quality of a nonsatisfying result, it is advisable to first modify γ to adjust the balance between the fit of the Gaussian mixture models and coherence. Once an appropriate value for γ has been found, sometimes segmentations can be improved slightly by increasing the number of clusters from 5 to 16. However, this will of course increase the run-time of the segmentation and the clustering algorithms, see Table

90 5 Experimental Results (a) K = 3. (b) K = 5. (c) K = 12. (d) K = 16. Figure 5.7: Influence of parameter K on segmentation results for the image and trimap displayed in Figures 5.5(a) and 5.5(b). The parameter γ is set to 150 for all images. The red circles mark segmentation errors. 78

91 5.4 Segmentation Performance Evaluation 5.4 Segmentation Performance Evaluation One of the ambitions of this work is not only to obtain high quality segmentation results with limited user input, but also to perform the segmentation efficiently to enable the processing of large image sequences. The graph data structures needed to perform the maximum flow calculations were created using the publicly available Boost Graph Library (BGL) 1. This library also features implementations of the push-relabel and the Kolmogorov maximum flow algorithms. However, several modifications like reusing the flow for successive iterations are not implemented. Therefore both algorithms were implemented with the possibility to enable or disable the performance improvements introduced in Chapter 4. As the maximum flow algorithms are supposed to be main performance factor of the whole segmentation process, we did a performance evaluation of both algorithms on a test image displayed by Figure 5.8(a) with the corresponding trimap 5.8(b). For all of the following performance information, the image graph is only built for the pixels corresponding to non-black regions of the trimap. The number of graph vertices subject to a 8-connectivity thus corresponds to the number of pixels found in the regions T F G (white), T UN (green), and T BG (red). For the aforementioned image these pixels make up about 56% of the total image pixels. The segmentation parameters were fixed at K = 5 and γ = 91 for all segmentations in this section. To obtain a better overview of the performance behaviour of the algorithms, we also executed them on downsampled versions of our image. In this case the parameters for the erosion, dilation, and the background strip width of the trimap were scaled down accordingly. The downsampling was performed by using a mean shift filter. (a) Input image. (b) Input trimap. Figure 5.8: Left: input image for the performance analysis of the maximum flow algorithms. Right: corresponding trimap used, with unknown region width of 48 pixels (green) and background strip width of 30 pixels (red). 1 (11 January 2011) 79

92 5 Experimental Results Table 5.2 displays the performance results for the push-relabel algorithm in seconds. The times given are an average over the single runs performed in each iteration of the segmentation algorithm until the covergence criterion is reached (see Section 5.2). The first row (w/o heuristics) gives the run-times without the gap relabeling and global relabeling heuristics, but with the order of vertex and edge processing defined by Algorithm 8 (Discharge) in Section The early termination explained in Section is also used. The second and third row display the run-times if only one of the two heuristics is added, while both of them are combined for the results displayed in row (4). push-relabel (1) w/o heuristics (2) w. gap relabeling (3) w. global relabeling (4) w. gap + global rel. resolution n/a n/a n/a n/a n/a n/a n/a n/a Table 5.2: Performance overview for the push-relabel algorithm (K = 5, γ = 91) with run-times given in seconds. The results show that global relabeling provides a higher performance increase than the gap relabeling heuristic and that a combination of both further increases performance. However, also visible is that the run-times from pixels to pixels increase by a factor greater than 10. Accordingly, the run-times for higher resolutions, including the original resolution of the image, exceed several minutes and are thus not given in the table (n/a). This identifies the implemented pushrelabel algorithm as unqualified to segment high definition images in the current framework. Note that it would be possible to remove the early termination mechanism to leave the image graph with a valid maximum flow after each iteration of the segmentation algorithm. This would enable us to reuse the flow in subsequent runs of the pushrelabel algorithm, as described in Section 4.2. However, this would only significantly increase performance of all subsequent runs but the first one, which still leaves this maximum flow algorithm too slow to use it for our high definition images. Displayed in Table 5.3 are the results for the Kolmogorov algorithm. The times given are again an average over all the single runs occuring during the iterations of the segmentation algorithm. In the first row all improvements are disabled, including the saving of edges already visited in the growth phase of the algorithm, as explained in Section Here the run-time almost reaches one minute for a resolution of pixels and increases to several minutes for higher resolutions. The times given in row (2) show that saving visited edges in the growth stage and restarting 80

93 5.4 Segmentation Performance Evaluation from the last visited edge dramatically increases the run-time of the algorithm. As a consequence, this mechanism is enabled for all other results displayed in rows (3) to (7), in addition to the denoted improvement. Kolmogorov (1) w/o improvements (2) sav. visited edges (3) dist.+ time heur. (4) term. path augment. (5) reusing flow + trees (6) all improvements (7) all (no orphan ord.) resolution n/a n/a (0.01) 0.11 (0.02) 0.46 (0.07) 2.10 (0.25) 0.02 (0.01) 0.07 (0.02) 0.28 (0.08) 1.20 (0.30) 0.02 (0.01) 0.06 (0.02) 0.22 (0.06) 1.03 (0.24) Table 5.3: Performance overview for the Kolmogorov algorithm (K = 5, γ = 91) with run-times given in seconds. Rows (5) to (7): the first number gives the run-time of the first run only, the number in brackets denotes the average run-time of subsequent runs during iterations of the segmentation algorithm. The distance and time heuristic (see Sections and 4.3.4) and the augmentation of terminal paths (see Section 4.3.1) provide a marginal increase in performance, while the terminal path augmentation produces a slightly better result, as visible in rows (3) and (4). In row (5) the flow already present on the graph from a previous iteration of the segmentation algorithm is reused for the next iteration (see Section 4.2). This of course implies that the graph is reused too and not built from scratch again (see Section 4.1). Also the source and the sink search trees are reused during subsequent runs of the Kolmogorov algorithm (see Section 4.3.5). While this does not improve the run-time of the first run, it significantly improves the run-time of subsequent runs, indicated by the number in brackets. Row (6) gives the results for the combination of all of the improvements, including a FIFO processing order for orphans generated in the adoption stage, as described in Section Row (7) displays the run-times for all improvements, but without the FIFO processing order. Instead new orphans generated in the adoption stage are added to the front of a single orphan list also containing all other orphans, corresponding to a LIFO processing order. The results show that the run-time is slightly improved using this LIFO order, as can be seen by comparing rows (6) and (7). Enabling the FIFO processing order in addition to the saving of visited edges without any other improvements also confirmed this, as the run-times increased compared to those given in row (2). 81

94 5 Experimental Results The improved Kolmogorov algorithm finishes its maximum flow calculation in 1.03 seconds for an image with pixels, while reusing the flow decreases the run-time of subsequent runs to 0.23 seconds. Assuming an average of nine iterations necessary until convergence of the segmentation algorithm (see Section 5.2), the run-times of the nine corresponding runs of the Kolmogorov algorithm sum up to a total run-time of approximately 3.2 seconds. In Table 5.4 a performance overview for the whole segmentation algorithm is given, using the Kolmogorov algorithm with the improvements discussed above. Note that also the graph is reused for subsequent iterations, only updating the t-link capacities (see Section 4.1) because otherwise the flow cannot be reused. Subject to segmentation were three different images from each of the three different sequences for a total of nine images with different values of γ, while K was fixed at K = 5. The given run-times are an average over these nine test images. The first number denotes the average run-time for all first iterations, while the number in brackets denotes the average run-time of all subsequent iterations over all images. The first iterations take longer because the graph has to be initialized and the flow cannot be reused from a previous iteration. Thus, a comparison of the first number to the second number reveals how reusing the graph affects performance of the segmentation. segmentation w. reusing graph resolution (0.04) 0.37 (0.12) 1.24 (0.30) 5.12 (1.06) Table 5.4: Performance overview for the segmentation algorithm for K = 5 for single iterations, using the Kolmogorov algorithm with all improvements enabled. The run-times, given in seconds, are an average over nine test images. The run-times show that, without reusing the graph, a single iteration takes about 3 to 5 times longer compared to the times achieved when reusing the graph. Assuming an average of nine iterations until convergence, the segmentation algorithm finishes in about 13 seconds using the highest resolution of pixels. This time diminishes to under 4 seconds for resolutions of pixels. The experiments show that segmentations performed on a smaller image resolution tend to converge much earlier than after the assumed nine iterations, thus the actual run-times are even shorter. The above run-times also show that the maximum flow calculation by the improved version of the Kolmogorov algorithm is but a fraction of the total execution time of each iteration. Thus, further ambitions to improve performance should concentrate on a more efficient implementation of the used data structures. 82

95 5.4 Segmentation Performance Evaluation Segmentation of Downsampled Images When taking a look at the execution times in the previous section, we see that the use of downsampled images is another possibility to further improve performance of the segmentation process. We thus want to know how the quality of the segmentation results is affected if performed on downsampled versions of an image. Figure 5.9 shows an example comparison of segmentation results for the image used during performance analysis of the maximum flow algorithms (Figure 5.8). Displayed are the results for all of the resolutions used above. The used trimap (Figure 5.8(b)) was scaled down accordingly by adjusting the amount of erosion and dilation performed on the corresponding depth image and by adjusting the width of the background strip. The same parameters were used (K = 5, γ = 150) for all resolutions. When looking at the results 5.9(a) and (b) there are very few differences visible. Part of the white background adjacent to the door is segmented into the foreground and a small part of the armrest of the chair is missing from the foreground. The segmentation of the magazine in the center of the foreground is actually better at the lower resolution. Further decreasing the resolution to pixels (c) results in an erroneous segmentation of the overexposed head region of the left person. Also the face region of the right person is not segmented correctly due to the general loss of detail. The results for the resolution of pixels (d) are still of acceptable quality. However, fine image structures like the armrests of the chair suffer from the loss of detail and already tend to merge with the background in the input image due to the downsampling process. Experiments show that reducing the resolution from to pixels rarely has a negative effect on the quality of the segmentation. This reduction in resolution not only decreases the run-time of the Kolmogorov algorithm by a factor of about 5, also the time for the initialization process of the graph is reduced, improving performance for the segmentation s first iteration. Thus, a reduction of the resolution can be a worthwile action to further improve the performance of the segmentation with only slight degradations in quality. Also, to possibly improve the segmentation of a high resolution image, at first a segmentation on a much lower resolution version of the image could be performed. The corresponding result could be upsampled again and used to improve the trimap for the segmentation of the higher resolution. This method could possibly deliver smaller unknown regions around foreground objects and a larger number of definite foreground pixels. Notice that this process is similar to what the algorithm already performs when updating the GMM s based on a preliminary segmentation of a previous iteration. However, if segmentation errors occur on the lower resolution result, like the erroneously segmented background region inside the chair in Figure 5.9(d), this will lead to errors in the higher resolution result, because the corresponding pixels would be marked as definite foreground pixels in the trimap (T F G ). 83

5 Experimental Results (a) 1920 1080. (b) 960 540. (c) 480 270. (d) 240 135. Figure 5.9: Segmentation results (K = 5, γ = 150) for the image in Figure 5.

96 5 Experimental Results (a) (b) (c) (d) Figure 5.9: Segmentation results (K = 5, γ = 150) for the image in Figure 5.8, which was downsampled to all sizes used during performance analysis. The trimaps used were scaled down accordingly from the one displayed in Figure 5.8(b). The red circles mark segmentation errors. 84

97 5.5 Results for L u v Images 5.5 Results for L u v Images The energy minimized during execution of the segmentation algorithm combines color and coherence information, while the latter is also ultimately based on the similarity of pixel colors. This similarity is measured by calculating the Euclidean distance of colors in the respective color space of the image. So far we only used RGB images to produce the segmentation results. In Section 2.2 we also introduced the L u v color model as an alternative to the RGB color model. We mentioned that the L u v model was designed with the intention of perceptual conformity, which is not the case for the device-dependent RGB color model. As we judge segmentation results by comparing it to the segmentation performed by our visual perception system, the use of the L u v color model seems more appropriate for segmentation purposes. The L u v input images used in this section were aquired by first converting our available RGB images to the XYZ color space. These images were then subsequently converted to the L u v color space as explained in Section As the L u v color model uses a different range and representation of color values as the RGB color model, the optimal values for the segmentation parameters γ and K determined for RGB images will most likely not apply. Figure 5.10 shows an example of the influence of γ on the segmentation results for L u v images. For γ = 10 (c) many parts of the lower background are segmented into the foreground. Also the top part of the table is completely missing due to the reduced coherence weight. The segmentation is only slightly improved for γ = 50 (d). When increasing γ to 70 (d), the table is almost correctly segmented, but now more of the background visible inside the plant is segmented into the foreground. By further increasing γ to 100 or above, almost all background regions enclosed inside the plant are erroneously segmented, despite their different color. The example shows that an optimal value of γ is not easily determined, as no value seems to offer the right balance between coherence and color distribution information. It seems that an excessive weight of coherence information must be selected to compensate for the errors resulting from the fit of the Gaussian mixture models. Experiments show that the parameter K has little scope to improve the segmentation due to the large weight of the smoothness energy necessary to achieve acceptable results. Using lower values of γ to mitigate the region coherence have shown to increase the influence of K to about the same degree as for RGB images. Values of K between 8 to 12 produced the best results. Figure 5.11 shows two input images and their corresponding trimaps that are used for an example comparison of RGB and L u v segmentation results. The actual comparison is displayed in Figure The RGB result 5.12(a) for the image in Figure 5.11(a) delivers an almost perfect segmentation. The only visible error is that the left part of the person s glasses are not segmented into the foreground. Here the limitation of the present segmentation technique becomes obvious. As we 85

only sort pixels to either foreground or background, transparent objects will often be segmented incorrectly, because the colors of their corresponding pixels are a mixture of foreground and

98 5 Experimental Results (a) Input image. (b) Input trimap. (c) γ = 10. (d) γ = 50. (e) γ = 70. (f) γ = 100. Figure 5.10: Example of the influence of γ on the segmentation results for L u v images (K = 5). The red circles mark segmentation errors. only sort pixels to either foreground or background, transparent objects will often be segmented incorrectly, because the colors of their corresponding pixels are a mixture of foreground and background colors. Thus, due to the translucency of the glasses, they are consequently segmented into the foreground. In the L u v result 5.12(b), the translucent part of the glasses is simply cut off, also the left armrest of the chair is segmented incorrectly. Figures 5.12(c) and (d) display the RGB and L u v segmentation results for the image in Figure 5.11(c). The results are quite similar: both methods fail to correctly 86

99 5.5 Results for L u v Images segment the chair due to its similarity to the adjacent background, while the L u v result is slightly better regarding the chair. However, the L u v result also cuts off part of the shoulder of the sitting person and part of the table, which are both visible in the RGB result. In Section 5.3 we took a closer look at the smoothness term of the energy subject to minimization. Here we explained that the Euclidean distance in pixel coordinates is used to mitigate the tendency of the segmentation algorithm to prefer diagonal cuts over horizontal and vertical ones. When changing the color model from RGB to L u v a different measure of similarity is used that possibly does not need such compensation. The expected behavior of this distance term is a slight improvement of segmentations at object corners, which is confirmed by our experiments. However, the influence has been shown to be very marginal, for both L u v and RGB images. Overall the experiments have shown that using the RGB color space resulted in higher quality segmentations compared to the L u v model, contrary to expectations. As the presented segmentation technique is designed for RGB images, maybe some additional modifications would be necessary to employ the measure of color similarity provided by the L u v model. (a) Input image 1. (b) Input trimap 1. (c) Input image 2. (d) Input trimap 2. Figure 5.11: Input images and trimaps for the comparison of segmentation results for different color models given by Figure

100 5 Experimental Results (a) RGB result (K = 5, γ = 91). (b) L u v result (K = 12, γ = 70). (c) RGB result (K = 12, γ = 91). (d) L u v result (K = 12, γ = 70). Figure 5.12: Comparison of segmentation results for the images given in Figure 5.11, using the RGB and L u v color models. The red circles mark differences between the segmentations. 88

101 5.6 Video Segmentation 5.6 Video Segmentation In Section 5.4 we showed that the present segmentation algorithm is capable of efficiently processing a large number of images. However, when segmenting continuous sequences of images, not only the quality of each single segmentation result is important, but also the stability of the results. For example, if a certain segmentation error is only present on every other frame of a video sequence, the viewer of the sequence will experience an annoying flickering effect, because the corresponding region will change back and forth between foreground and background. As already mentioned, the depth data captured by our camera system is unreliable at border regions of foreground objects. Thus, similar instability errors as mentioned above can also occur on the depth images of our video sequence. For example, depth and color image pairs can also be used as an input for 3-dimensional displays, enabling depth perception for the observer. Here the depth map determines which of the color image regions should be displayed as foreground and background. However, if the depth data features errors like those mentioned above, the corresponding flickering effects will be even more annoying for the observer, because objects seem to change their position continuously from foreground to background. Figure 5.13 shows segmentation results for several frames of a video sequence. For this sequence a very large size for the erosion structuring element (45 pixels) had to be chosen. This was necessary because the time-of-flight cameras were not properly synchronized with the color camera during capturing. This resulted in a mismatch of depth data and color data, as the time-of-flight cameras captured their data a little earlier than the color camera. The sizes for erosion, dilation (20 pixels), and the background strip (30 pixels) remained constant for all images. Visible in all results of Figure 5.13 are segmentation errors at the border regions of the foreground objects. This does not only result from similar colors but also from the movement of the persons. Hence, the color of the border pixels receive contribution from foreground and background. Thus, to segment these pixels correctly the segmentation technique would have to be expanded to include alpha matting [RKB04]. An example of a flickering artefact is given in the result found in the second row of Figure Here part of the white background wall is segmented into the foreground, while this is not the case for the result in the first row. Also, the result in the last row erroneously segments part of the table into the foreground, which was segmented correctly in all other results. Methods to prevent this behaviour try to maintain consistency of the segmentation over time. This works according to the following principle: If an object was e.g. segmented into the background for all of the previous frames, it is highly unlikely that it now belongs to the foreground in the current frame. This can be realized e.g. by linking the graphs of several images together. The edges connecting the graphs are then able to carry information about previous segmentations, attempting to stabilize the results over time [LSyS05]. 89

102 5 Experimental Results Figure 5.13: Example of video segmentation results (K = 5, γ = 91). Left: input frames of video sequence. Right: segmentation result for the corresponding frame on the left. The red circles mark temporal artefacts. 90

Maximum Flow Algorithms

Maximum Flow Algorithms Network Algorithms Georgia Kaouri Contents Applications, special cases Flow networks Ford Fulkerson algorithm Preflow push algorithms Lift to front algorithm Material flows Applications