Graphcut-based Interactive Segmentation using Colour and Depth cues

Size: px

Start display at page:

Download "Graphcut-based Interactive Segmentation using Colour and Depth cues"

Candace Strickland
5 years ago
Views:

1 Graphcut-based Interactive Segmentation using Colour and Depth cues Hu He University of Queensland,Australia David McKinnon Queensland University of Technology,Australia Michael Warren University of Queensland,Australia Abstract Segmentation of novel or dynamic objects in a scene, often referred to as background subtraction or foreground segmentation, is critical for robust high level computer vision applications such as object tracking, object classification and recognition. However, automatic realtime segmentation for robotics still poses challenges including global illumination changes, shadows, inter-reflections, colour similarity of foreground to background, and cluttered backgrounds. This paper introduces depth cues provided by structure from motion (SFM) for interactive segmentation to alleviate some of these challenges. In this paper, two prevailing interactive segmentation algorithms are compared; Lazysnapping [Li et al., 2004] and Grabcut [Rother et al., 2004], both based on graphcut optimisation [Boykov and Jolly, 2001]. The algorithms are extended to include depth cues rather than colour only as in the original papers. Results show interactive segmentation based on colour and depth cues enhances the performance of segmentation with a lower error with respect to ground truth. 1 Introduction Object recognition is an integral component for building robots capable of interacting in human environments. However, real-time object recognition in real scenes remains one of the most challenging problems in computer vision. Many issues must be overcome such as tracking, robustly identifying objects in complex cluttered backgrounds, viewpoint changes, etc [Schulz et al., 2001; Batavia and Singh, 2001; Salembier et al., 1997]. Furthermore, computational complexity must be reduced to a minimum for mobile robotic applications as object information can rapidly become obsolete in a dynamic world. This paper moves towards robust object recognition using reliable object segmentation based on graph Ben Upcroft University of Queensland, Australia ben.upcroft@uq.edu.au cut optimisation which combines both colour and depth cues. Here, we only consider interactive segmentation requiring input from a human operator. However, we propose that depth combined with colour cues can provide rich information enabling reliable automatic segmentation and eventual object recognition for mobile robotics. This paper is motivated by the Lazysnapping [Li et al., 2004] and Grabcut [Rother et al., 2004] algorithms which are based on graphcut optimization [Boykov and Jolly, 2001]. Both algorithms require user input for initialisation. For Lazysnapping [Li et al., 2004], users need to draw strokes on the original colour image indicating foreground and background (Figure 3). Pixels in the strokes are collected to model energy terms in an energy function framework. The graphcut algorithm is then employed to minimise the energy function resulting in the segmented foreground pixels stored as a label vector L. Grabcut [Rother et al., 2004] initialisation requires users to draw a rectangle around the objects of interest. The pixels internal and external to the rectangular boundary are used to model the energy terms. The graphcut algorithm is run iteratively until the resulting label vector L remains constant. The common attribute for both algorithms is the use of colour as the only cue to model the energy terms. Thus both algorithms are prone to failure when foreground and background colours are similar. In this paper, we use both colour and depth cues to model the energy terms, while using graphcuts to minimise the new energy function. Depth cues are acquired using 3D reconstruction from structure from motion, an increasingly common vision technology in mobile robotics. If there are considerable depth differences between foreground and background even though they have similar colour, segmentation should be improved. Our experimental results show segmentation based on both colour and depth cues outperform those relying purely on colour based on an error evaluation with respect to ground truth.

Figure 1: Statue dataset. The colour image and the corresponding depth image. White pixel indicates closer range, while black one indicates more distant range.

2002] and/or boundary (colour or intensity contrast) [Gleicher, 1995; Mortensen and Barrett, 1995; 1999] information to construct an objective function, often referred to as an energy function.

For completeness we briefly summarise the graphcut optimisation algorithm as follows.

2 Figure 1: Statue dataset. The colour image and the corresponding depth image. White pixel indicates closer range, while black one indicates more distant range. 2 Colour-Based Segmentation Currently, graphcut based interactive segmentation uses region (colour or intensity similarity) [Agarwala and Dontcheva, 2004; Barrett and Cheney, 2002; Reese and Barrett, 2002] and/or boundary (colour or intensity contrast) [Gleicher, 1995; Mortensen and Barrett, 1995; 1999] information to construct an objective function, often referred to as an energy function. Hard constraints are then introduced by a user, e.g., by selecting foreground and background pixels. Soft constraints refer to the inherent properties, e.g., assumption on elsewhere smoothness or piecewise smoothness across pixels. For completeness we briefly summarise the graphcut optimisation algorithm as follows. The original image is represented as the corresponding graph G =< V, E > which is defined as a set of nodes (V) and a set of unordered edges (E) that connect these nodes. The nodes relate to the pixels in the original image, while the edges relate to the relationship between the adjacent pixels. Let L = (L 1,, L i,, L V ) be the binary vector whose elements L i specify assignments to pixels i in V. Each L i can be either background (L i = 0) or foreground (L i = 1). Here the optimised vector L defines the final segmentation. The energy function used in this paper is similar to the Gibbs energy function described in [Geman Figure 2: Camel dataset. The colour image and the corresponding depth image. White pixel indicates closer range, while black one indicates most distant range. and Geman, 1984]: E(L) = i V E R (L i ) + λ (i,j) E E B (L i, L j ) (1) where E R (L i ) is the region-based energy encoding the cost when the label of the node i is L i, and E B (L i, L j ) is the boundary-based energy, representing the cost when the label of adjacent nodes (pixels) i and j are L i and L j respectively, and λ [0, 1] indicates the relative importance of the region-based energy versus the boundarybased energy. Note that when we make a negative logarithm to both sides of the basic Bayesian formula (2), we can get Eq. (3) as follows: P (L D) P (D L)P (L) (2) ln(p (L D)) ln(p (D L)) + ( ln(p (L))) (3) where L denotes the label for all pixels, D encodes the observed data, i.e., the pixels with pre-defined labels, and P (L D) is a conditional probability. Comparing (1) and (3), the term E R (L i ) in the energy function can be seen as the likelihood probability, while the term E B (L i, L j ) can be seen as the prior probability. So we also refer E R (L i ) and E B (L i, L j ) as the likelihood energy and prior energy respectively.

3 Figure 3: Lazysnapping GUI. Red strokes select the foreground seed, and blue strokes select the background seed. Using the colour or intensity similarity and contrast, we can obtain the weights between each node in the graph corresponding to the original image. Then the combinatorial optimisation algorithm max-flow/min-cut [Boykov and Kolmogorov, 2004] is employed to minimise the energy function. This results in nodes grouped into different classes (e.g., source and sink in binary graph theory) which is equivalent to assigning the corresponding pixels into different classes (e.g., foreground and background in bi-layer segmentation). 3 Depth and Colour-Based Segmentation In contrast to the above formulation, we use both colour and depth cues to construct the energy function (1), and can be formed as follows: " # X X d c E(L) = θ ER (Li ) + (1 θ) ER (Li ) i V i V +λ θ X (i,j) E c EB (Li, Lj ) + (1 θ) X d EB (Li, Lj ) (i,j) E (4) where θ denotes the relative importance among the colour and depth terms, the superscript c, d of the energy term encode the colour term and depth term respectively,and the other variables have the same meaning with that in (1). The method to construct the depth term is as in the framework (4), and the tuning parameters except θ are set to be the same value described in [Li et al., 2004; Rother et al., 2004]. 3.1 Colour Data and Depth Data Modelling In the implementation of Lazysnapping, we select some pixels as foreground and others as background using pen Figure 4: GrabCut GUI. The rectangle is drawn by the user, while the internal thin red line indicates the graph cut between foreground and background generated by Grabcut. strokes (Figure 3). The corresponding pixels in the depth image are also selected. This results in two sets; foreground set F and background set B. All the known labelled pixels in F and B are used to construct the colour and depth energy terms respectively. With respect to the region-based energy term, we use the distance between each unlabelled pixel and the centroid of the selected foreground and background pixels precomputed using K-means [Kwatra et al., 2003]. Specifically, the selected pixels of foreground and background are used to construct the models of foreground and background by K-means based on colour distribution. While for the boundary-based energy term, we use the gradient or contrast between two adjacent pixels. In the implementation of Grabcut, we use a rectangle to select the internal and external pixels of the colour image denoting the uncertain set U (used to model the foreground) and background set B respectively (Figure 4). The corresponding pixels in the depth image are chosen as well. Here we construct a Gaussian Mixture Model (GMMs) based on the selected set to represent the region-based energy term, while the boundary-based energy term is still constructed using the gradient or contrast between two adjacent pixels. Further details on user input and modelling are shown in [Li et al., 2004; Rother et al., 2004]. 3.2 Segmentation using Lazysnapping and GrabCut Image resolution highly affects computational efficiency. To improve efficiency, we use the Watershed algorithm [Vincent and Soille, 1991] to divide the entire image into several new patches which still locate boundaries well and preserve small differences between each patch. This process is referred to as pre-segmentation in [Li et al., 2004].

4 (a) Weight θ = 0 (b) Weight θ = 0.9 (c) Weight θ = 1 Figure 5: Lazysnapping results with varying weights between the colour and depth term. Note that as θ increases which means the weight of colour cue increases, the segmentation performance gets better due to the depth ambiguity. Lazysnapping runs the graphcut algorithm once to acquire the final segmentation result, while Grabcut runs iteratively until the final segmentation result remains constant. 4 Results This section presents the results from an implementation of the Lazysnapping and Grabcut algorithms using depth and colour cues on two datasets of a statue and a toy camel. Modifications to freely available code provided by Gupta were used to produce the following results [Gupta, 2005]. Each dataset consists of colour images and corresponding depth map images. The depth map for the colour images is obtained from SFM. To a pair of colour images, the 3D position of a dense set of visual features was estimated using SFM algorithms [Ullman, 1979; Zhang et al., 2006]. Here we use GPU to extract the SURF [Bay et al., 2006] features and then match them to make the features trackable across multi-views. RANSAC [Fischler and Bolles, 1981] will be used to get rid of outliers, i.e., mismatched feature pairs. Finally,depth map will be generated by triangulation and interpolation over multi-views. The tuning parameter θ in (4) is varied from 0 to 1. If θ = 0, segmentation depends only on depth information, while θ = 1, forces a dependence on colour information only - identical to the original Lazysnapping and Grabcut algorithms. Both colour and depth cues contribute to the segmentation when θ (0, 1). The images used in this paper are shown in Figure 1 and Figure 2. In Figure 1, there is a high contrast in depth information which can complement the drawback of colour only. While in Figure 2, the dataset has similar depth information, i.e., the objects lie in an approximate plane, so the colour information should be more useful in this scenario. 4.1 Lazysnapping based on Colour and Depth Here the pixels intersecting with the red stroke are foreground set F, and the ones intersecting with the blue stroke are background set B. Figure 3 illustrates the GUI for Lazysnapping. The final segmentation result is shown in Figure GrabCut based on Colour and Depth With Grabcut, we use a red rectangle to divide the pixels into background set B (external to the rectangle) and the uncertain set U (internal to the rectangle). The Grabcut GUI is shown in Figure 4. Figure 6 illustrates the final segmentation result with different weights for the colour term. Note that Grabcut algorithm is an iterative energy minimisation. The energy should decrease with the number of iterations, which is shown in Figure 7. Figure 7: The energy E for the segmentation converges over 13 iterations..

5 (a) Weight θ = 0 (b) Weight θ = 0.55 (c) Weight θ = 1 Figure 6: GrabCut results with varying weights between the colour and depth term. Note that as θ increases which means the weight of colour cue increases, the segmentation performance gets worse due to the colour ambiguity. 4.3 Evaluation To evaluate the segmentation performance, we manually segmented both images to form a ground truth (Figure 8). Results were then compared using two methods of evaluation; 1) the L2 distance between the segmentation result from the proposed method and ground truth segmentation, referred to as ε1 in (5), 2) the number of pixels mislabelled as compared to the ground truth as a ratio with the total pixels in the original image, denoted by ε2 in (6). ε1 = P P G 2 (5) ε2 = N error N total (6) where P is the intensity of pixels in the segmentation derived by the proposed method, while P G is the intensity of pixels in the ground truth segmentation. In equation (6), N error is the number of misclassified pixels comparing to the ground truth and N total is the number of total pixels in original image. 4.4 Result Discussion In Figure 7, the energy converges after approximate 10 iterations. Particularly, it decreases dramatically at the beginning couple of iterations. It indicates Grabcut algorithm still converges quickly even after introducing the depth cue. Figure 9 indicates the best results are obtained when the weight θ is set to 0.55 for the statue dataset using Grabcut algorithm. For the camel dataset using Lazysnapping algorithm, the weight θ is set to 0.9 meaning colour information is more important than depth information in this case. The quantitative evaluations show that jointly using colour and depth cues in general, achieves better accuracy than using colour alone. Furthermore, the performance of segmentation can be refined by determining the weight θ based on the discriminative capabilities of the colour and depth cues. By further investigating Figure 9 we can find that the colour information is quite ambiguous for the statue segmentation. The depth information seems to be a reliable cue and demonstrates good performance, especially when the weight for depth term reaches But the accuracy of segmentation on statue slightly decreases as the weight for depth term continuously increases. For this case, the discriminative capability of depth cue achieves a maximum at 0.55, and the increased importance of the depth information causes the background objects to be mislabelled as foreground. However, depth information is more ambiguous for the camel dataset as all the small objects lie on the table, and the depth distributions of the toy camel and the books are quite similar. In this case, increasing the weight of colour cue refines the segmentation results. 5 Conclusion and Future Work We have introduced the depth cue into the basic energy function framework for interactive segmentation in a static image. As mentioned before, better segmentation results will be obtained based on colour and depth cues rather than using only one. With an appropriate weight θ, the proposed method outperforms Lazysnapping and Grabcut methods based only on colour cue. In the future, we would like to explore several directions in which the accuracy of segmentation can be further improved. Develop a novel model to automatically construct the foreground and background data selected by the user Develop an adaptive way to adjust the weight θ Extend the current method to video sequences Use both depth and colour to design an automatic segmentation method with reasonable accuracy for object tracking for mobile robotics [Wang et al., 2006; Zhao et al., 2008]

(a) Groundtruth for camel segmentation (b) Groundtruth for statue segmentation Figure 8: Groundtruth dataset. 1.6 x 106 The L2 distance between the segmentation result and groundtruth 0.

6 (a) Groundtruth for camel segmentation (b) Groundtruth for statue segmentation Figure 8: Groundtruth dataset. 1.6 x 106 The L2 distance between the segmentation result and groundtruth 0.03 The ratio of mislabling pixels verse the total pixels The L2 distance for all pixels ratio of mislabeling(wrong segmentation) weight for depth term in the energy function(%) (a) ɛ1 over different depth weights θ for statue dataset weight for depth term in the energy function(%) (b) ɛ2 over different depth weights θ for statue dataset 4.6 x 106 The L2 distance between the segmentation result and groundtruth 0.02 The ratio of mislabling pixels verse the total pixels The L2 distance for all pixels ratio of mislabeling(wrong segmentation) weight for depth term in the energy function(%) (c) ɛ1 over different depth weights θ for camel dataset weight for depth term in the energy function(%) (d) ɛ2 over different depth weights θ for camel dataset Figure 9: Segmentation error verse varying weights θ between colour and depth. (a,b) shows error for the statue segmentation, while (c,d) indicates error for the camel segmentation. As can be seen error decreases with the weight of depth term increasing to some extent for the statue dataset, however, error increases for the camel dataset.

7 Acknowledgement This research is supported by CRCMining and CSC scholarship. The authors acknowledge helpful discussions with Mohit Gupta. References [Agarwala and Dontcheva, 2004] A. Agarwala and M. et.al Dontcheva. Interactive digital photomontage. ACM Transactions on Graphics (TOG), 23(3): , [Barrett and Cheney, 2002] W.A. Barrett and A.S. Cheney. Object-based image editing. ACM Transactions on Graphics, 21(3): , [Batavia and Singh, 2001] P.H. Batavia and S. Singh. Obstacle detection using adaptive color segmentation and color stereo homography. In IEEE International Conference on Robotics and Automation, volume 1, pages Citeseer, [Bay et al., 2006] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. Computer Vision ECCV 2006, pages , [Boykov and Jolly, 2001] Y. Boykov and M.P. Jolly. Interactive graph cuts for optimal boundary and region segmentation of objects in ND images. In International Conference on Computer Vision, volume 1, pages Citeseer, [Boykov and Kolmogorov, 2004] Y. Boykov and V. Kolmogorov. An experimental comparison of mincut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages , [Fischler and Bolles, 1981] M.A. Fischler and R.C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6): , [Geman and Geman, 1984] S. Geman and D. Geman. Stochastic relaxation,gibbs distributions, and the Bayesian restoration of images. IEEE trans on pattern analysis and machine inteligence, [Gleicher, 1995] M. Gleicher. Image snapping. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, page 190. ACM, [Gupta, 2005] M. Gupta. Interactive segmentation toolbox. In http: // www. cs. cmu. edu/ ~ mohitg/ Research/ segmentation. htm, [Kwatra et al., 2003] V. Kwatra, A. Schodl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: Image and video synthesis using graph cuts. ACM Transactions on Graphics, 22(3): , [Li et al., 2004] Y. Li, J. Sun, C.K. Tang, and H.Y. Shum. Lazy snapping. ACM Transactions on Graphics (TOG), 23(3): , [Mortensen and Barrett, 1995] E N Mortensen and W A Barrett. Intelligent scissors for image composition. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages ACM, [Mortensen and Barrett, 1999] E N Mortensen and W A Barrett. Toboggan-based intelligent scissors with a four-parameter edgemodel. In Computer Vision and Pattern Recognition, IEEE Computer Society Conference on., volume 2, [Reese and Barrett, 2002] LJ Reese and WA Barrett. Image editing with intelligent paint. In Proceedings of Eurographics, volume 21, pages Citeseer, [Rother et al., 2004] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM SIGGRAPH 2004 Papers, page 314. ACM, [Salembier et al., 1997] P. Salembier, F. Marques, M. Pardas, J.R. Morros, I. Corset, S. Jeannin, L. Bouchard, F. Meyer, and B. Marcotegui. Segmentation-based video coding system allowing the manipulation ofobjects. IEEE Transactions on Circuits and Systems for Video Technology, 7(1):60 74, [Schulz et al., 2001] D. Schulz, W. Burgard, D. Fox, and A.B. Cremers. Tracking multiple moving targets with a mobile robot using particle filters and statistical data association. In IEEE international conference on robotics and automation, Proceedings 2001 ICRA, volume 2, [Ullman, 1979] S. Ullman. The interpretation of structure from motion. Proceedings of the Royal Society of London. Series B. Biological Sciences, 203(1153):405, [Vincent and Soille, 1991] L. Vincent and P. Soille. Watersheds in digital spaces: an efficient algorithm based onimmersion simulations. IEEE transactions on pattern analysis and machine intelligence, 13(6): , [Wang et al., 2006] L. Wang, M. Liao, M. Gong, R. Yang, and D. Nister. High-quality real-time stereo using adaptive cost aggregation and dynamic programming. In 3D Data Processing, Visualization, and Transmission, Third International Symposium on, pages , [Zhang et al., 2006] G. Zhang, J. Jia, W. Xiong, T.T. Wong, P.A. Heng, and H. Bao. Moving object extrac-

8 tion with a hand-held camera. In Proc. International Conference on Computer Vision, pages 1 8. Citeseer, [Zhao et al., 2008] T. Zhao, R. Nevatia, and B. Wu. Segmentation and tracking of multiple humans in crowded environments. IEEE transactions on pattern analysis and machine intelligence, 30(7): , 2008.

Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing

Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing Tomoyuki Nagahashi 1, Hironobu Fujiyoshi 1, and Takeo Kanade 2 1 Dept. of Computer Science, Chubu University. Matsumoto 1200,