Superpixel Optimization using Higher-Order Energy

Size: px

Start display at page:

Download "Superpixel Optimization using Higher-Order Energy"

Garry Merritt
5 years ago
Views:

JOURNAL OF L A TEX CLASS FILES Superpixel Optimization using Higher-Order Energy Jianteng Peng, Jianbing Shen, Senior Member, IEEE, Angela Yao, and Xuelong Li, Fellow, IEEE Abstract A novel

1 JOURNAL OF L A TEX CLASS FILES Superpixel Optimization using Higher-Order Energy Jianteng Peng, Jianbing Shen, Senior Member, IEEE, Angela Yao, and Xuelong Li, Fellow, IEEE Abstract A novel superpixel extraction algorithm using a higher-order energy optimization framework is proposed in this paper. We first adopt the k-means clustering technique to quickly get an initial superpixel result. Then a higherorder energy function is employed to optimize and refine these initial superpixels. We use a more general higher-order energy function which includes a first-order data term, a second-order smoothness term and a higher-order term. The pre-segments are employed to provide the prior information of sufficient edges and segment regions for our higher-order energy term. According to the texture measurement in different local regions, our algorithm adaptively computes the proper ratios of different energy terms to obtain a better superpixel performance. The experimental results demonstrate that our method using the higher-order energy generates better results with well-aligned boundaries and homogeneous effects than the existing superpixel algorithms. Index Terms Superpixel, higher-order energy, optimization. I. INTRODUCTION Superpixels [2] group together pixels of similar characteristics and are widely used in a variety of image processing algorithms such as segmentation [9], [20], [30], [3], [34], synthesis [8], [6], saliency detection [23], [25], [34] and tracking [4]. The grouping of pixels allows for a more compact representation of images, which in turn improves computational efficiency. The challenge of extracting superpixels is to preserve sufficient levels of detail that adhere to image boundaries while being as compact as possible. One class of algorithms for extracting superpixels [2], [7], [22], [32] is based on a graph topology and optimization framework and includes the well known normalized cut [2] and graph-cut [7] superpixel algorithms. The original image is mapped to a graph; superpixels are then extracted from connected subgraphs formed by cutting graph edges. Veksler et al. [7] introduced the use of both a data term and a smoothness term that are typical in graph-cut based optimization. This work was supported in part by the National Basic Research Program of China (973 Program) (No. 203CB328805), the National Natural Science Foundation of China (Nos and 62506), the Program for New Century Excellent Talents in University (NCET--0789), and Shaanxi Key Innovation Team of Science and Technology (No. 202KCT-04). Specialized Fund for Joint Building Program of Beijing Municipal Education Commission. (Corresponding author: Jianbing Shen.) J. Peng and J. Shen are with the Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing 0008, P. R. China. shenjianbing@bit.edu.cn. A. Yao is with the Institute of Computer Science II, University of Bonn, Germany. yao@cs.uni-bonn.de. Xuelong Li is with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), the School of Computer Science, Northwestern Polytechnical University, Xi an 7029, Shaanxi, P. R. China. xuelong li@nwpu.edu.cn Copyright (c) 205 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an to pubs-permissions@ieee.org. Fig.. An illustration of the proposed superpixel optimization method. The higher-order adaptive energy optimization (step (b)) is applied on the initial superpixel results (I init ), which are obtained by applying the fast k-means clustering (step (a)) on the input image. The pre-segmentation image (I seg ) by mean-shift algorithm is used as a form of prior information to construct our higher-order energy term. Zhang et al. [22] further extended this by using a pseudoboolean optimization with only two labels to achieve squarelike superpixels with similar shape and size. In both works, the ratio of the energy terms remains fixed; our algorithm adapts the ratio based on local image statistics. Non-graph based superpixel algorithms include the TurboPixel method [3] and the SLIC algorithm [5], [33]. The TurboPixel algorithm [3] formulates the superpixel problem as a set of local region-growing processes and is based on a geometric-flow level-set evolution algorithm. The SLIC algorithm et al. [5] represents each image pixel as a feature point and clustering the feature points into superpixels using fast k-means. The fast k-means, while very efficient, extracts superpixels that are not always well-aligned to image boundaries. We adopt a similar scheme to initialize our superpixels but then further refine them using the higher-order energy term to improve superpixel alignment. Graph-cut [5] based energy minimization algorithms have been very popular for solving various image processing problems [3], [7], [24], including that of superpixel extraction. Most algorithms use energy functions with only unary and pairwise terms (e.g. data and smoothness terms) because higher-order functions are difficult to optimize efficiently. Higher-order terms, however, are necessary to capture the complex and rich statistics of natural scenes [9], [4] and several methods have been proposed for efficient optimization of these terms [5], [0], [4], [2], [35]. Inspired by the improvements seen for higher-order modeling in other image

JOURNAL OF L A TEX CLASS FILES 2 processing tasks such as segmentation, stereo matching and denoising [26], [9], [0], [2], [28], we propose the use of higher-order terms for superpixel extraction.

2 JOURNAL OF L A TEX CLASS FILES 2 processing tasks such as segmentation, stereo matching and denoising [26], [9], [0], [2], [28], we propose the use of higher-order terms for superpixel extraction. Our higher-order term makes use of image segmentation results and encourages the superpixels to be well aligned to these segments. Furthermore, we propose an adaptive weighting of the energy terms so as to emphasize the impact of the different energy terms according to local image texture. The resulting superpixels are homogenous in shape and size but also well aligned to major image boundaries. An overview of the method can be found in Fig.. In the context of previous work on image superpixels, our main contributions can be summarized as follows: Proposal of a novel image superpixel algorithm, the first of its kind, which uses a higher-order energy function to extract superpixels well-aligned to object boundaries. Adaptation of the energy function to be texture-aware so as to yield superpixels homogenous in shape and size. An efficient superpixel optimization of the proposed higher-order energy function is developed, where the fast k-means initialization step is used for acceleration.. II. OUR APPROACH A. Energy Optimization Formulation We formulate superpixel extraction as an energy optimization problem. We initialize based on a fast k-means clustering and then optimize the energy function E(L) defined as Fig. 2. An illustration of the relationship between the pixel p and the center p c of initial superpixel S i. The data term measures the difference between p and p c by gradient feature. R pc is a local window (typically 3 3) with its center p c. at p c, where p c O init, or the set of all superpixel centers in I init. dist xy (p, p c ) is the Euclidean distance between pixel p and center pixel p c with the same superpixel label. When distances exceed d max, f defaults to µ max, a very large upper threshold (µ max = in our experiments). Fig. 2 shows the relationship of p to p c visually. The data term E data is then defined as a summation over P, the set of all pixels in image I: E data (L) = p P f(l p ). E(L) = λ u E data (L)+λ p E smooth (L)+λ h E high (L) () where L is the superpixel label which is initialized by I init (I init is generated by fast k-means). Note that each variable in L gets the value from the set L =, 2,..., N init }, where N init is the total number of superpixels in I init, and it is set manually. From an image with P pixels, L then be a vector with length of P, i.e. L L P. E data and E smooth are the standard unary data term and pairwise smoothness term used in other graph-cut based energy functions [5], [3], [7], [24], while E high is a high-order term. λ u, λ p and λ h are the weight parameters for the unary, pairwise and higher-order energy terms respectively. When λ h is set to 0, then Eq.() degenerates to the standard lower-order energy function. B. Lower-order terms E data and E smooth We compute E data according to image gradients rather than the standard approach of using intensity since gradients are less sensitive to lighting variations in the scene. The more similar a pixel p is to the centering pixels of the superpixel, based on its image gradient value, the more likely it is that p will belong to that superpixel. We define such a similarity measure f as f(l p ) = Ip R pc, if dist xy (p, p c ) < d max µ max, where I p is the gradient value of pixel p. R pc denotes the average gradient of the pixels in local window R pc centered (2) The initial superpixel set in I init tend to be irregular in shape as the fast k-means clustering is extremely sensitive to pixel intensity. We handle the shape irregularity with the smoothness term E smooth, where E smooth (L) = (p,q) N V pq ω pq φ ( L init(p), L init(q) ). The smoothness term is computed over N, or the set of all pairs of neighbouring pixels in image I. V is an indicator variable equal to when p and q have differing superpixel labels, i.e. if Lp L V pq = q, (3) 0. while ω measures similarity between neighbouring superpixels p and q according to feature vector f = hue, saturation, intensity}: ω pq = dist xy (p, q) exp f p f q 2 2σ 2 }. φ( ) controls the extent to which the initial clustering I init influences the smoothness term and is defined as if Linit(p) = L φ(l init(p), L init(q) ) = init(q). β, where(0 < β < ). (4) Here, L init(p) and L init(q) are the labels of pixels p and q from L init. The function φ allows for a preference of preserving

3 JOURNAL OF L A TEX CLASS FILES 3 superpixel boundaries from I init between pixels p and q, while β influences the probability that these boundaries will change (we set β as 0.8 in our experiments). C. Higher-order term E high We formulate E high to encourage the superpixel boundaries to respect the edges of objects, while reducing boundaries which cut across objects. To approximate object boundaries, we use mean shift [] to obtain a crude segmentation I seg to act as a prior. We choose mean-shift for its speed and edge preserving capabilities; though other alternatives for generating I seg exist [6], [], [27]. Note that I seg need not be very accurate since the E high can automatically select the right edges from the segments to follow during the optimization process. Let M be the set of all pre-segments in I seg and C M be a specific pre-segment of M. The higher-order energy term E high is then defined as the sum over all the pre-segments in M: E high (L) = C M ψ(l C ). (5) where ψ( ) is a penalty term which encourages alignment of superpixel edges to pre-segmentation boundaries:, N(L C ) > ψ(l C ) = N(L (6) C),. where N(L C ) represents the number of pixels belonging to the set of superpixels intersecting with the boundary of presegment C, whose formula is shown below. ψ( ) is proportional to N(L C ) when N(L C ) is not larger than. is an upper limit on the number of pixels which can influence this term. We set = C /0 in our experiments. C denotes the number of pixels in pre-segment C. N(L C ) = S L C S L C,S L C where S L is the set of pixels belonging to a superpixel from L. As shown in Eq.(6), ψ( ) is set to a maximum value of defined as: = C [ + q(c)], where q(c) is a measure of segmentation quality defined as exp var(c)}, C > S q(c) = L (7) 0, based on var(c), or the variance of the pixel intensity in presegment C. S L represents the average size of superpixels in L. As such, when pre-segment C is too small, then q(c) = 0 and limits and the impact of the higher-order penalty, thereby preventing small and fragmented pre-segments from impacting the superpixels. The data and smoothness energy terms can be converted into st-graph structures [5] which can be then by optimized with graph-cut [5], [24]. To optimize the higher-order energy term in an efficient manner, however, requires some approximations, such as order reduction, to transform the higher-order energy terms into a lower-order term. Furthermore, if the variables of the higher-order energy function has multiple solutions, the variables will need to be simplified into binary variables after the order reduction. Move-making algorithms are one of the most popular ways of minimizing energy functions with multiple discrete variables and we use the α-expansion move algorithm [5]. In α- expansion, each variable is assigned some initial label; for each successive iteration, some label α is selected and all non-α members can either change to α or keep its own label. The algorithm continues to iterate through each possible label of α until it reaches convergence. More formally, let transformation function T be defined as α, if t i = 0 T α (x i, t i ) = x p i, if t (8) i = where x i denotes a given variable in the energy function and x p i is its former label before the expansion move. t i is the operation for x i ; when 0, the label changes from x p i to α and when, it stays the same. We now describe how to apply α-expansion towards minimizing our high-order energy term. We have defined C as a particular pre-segment from I seg, then ψ(x C ) is the corresponding higher-order term. We use the notation ψ(x C ) to replace ψ(l C ) from Eq.(5) here in order to emphasize its variable, and actually these two notations have the same meaning. C sp be the set of all pixels belonging to superpixels which cross the boundary of pre-segment C. Let C r be the intersecting set of pixels C r = C C sp and C c be the complementary set of pixels C c = C C sp. Note that we do not assume C c to be non-empty necessarily during the α-expansion. When C is a very small pre-segment which only contains superpixels crossing its boundaries, then N(L C ) = C and ψ(l C ) = from Eq.(6). Therefore, regardless of how the labels in C change during the expansion, the higher-order penalty will stay the same. After a single expansion move, let the set R contain the previously non-α pixels now assigned to α, with superpixel labels x p C and xn C before and after the expansion respectively. The higher-order energy term can then be reformulated into α-expansion terms as follows: ψ(t α (x i, t i )) = ψ(x n C) = min N(x C), } (9) In traditional α-expansion, the superpixel labels of all pixels belonging to C, have a chance to be assigned as α and expand. However, the expansion move is significantly more complex when the superpixels which cross the boundary of C are also able to expand. As such, we restrict expansion to only labels of superpixels which are entirely present in C. According to the locations of the newly expanded region R, there are two situations in our expansion moves (see Fig. 3): (s.) when R lies entirely in segment C (R C) and (s.2) when R lies on the boundary of segment C (R C

JOURNAL OF L A TEX CLASS FILES 4 Fig. 4. The st-graph obtained by the Theorem in Eq.(3). Fig. 3. Different situations in the α-expansion moves.

Situation 2 represents that the group of pixels with newly obtained label α is laying on the boundary of segment C., R C ).

Comparisons between data-smoothness energy and data-smoothness higher-order energy optimization.

4 JOURNAL OF L A TEX CLASS FILES 4 Fig. 4. The st-graph obtained by the Theorem in Eq.(3). Fig. 3. Different situations in the α-expansion moves. Situation means that the pixels with their labels chosen as α are entirely in segment C. Situation 2 represents that the group of pixels with newly obtained label α is laying on the boundary of segment C., R C ). The energy of an expansion move can then be defined as: ψ(t α (x i, t i )) = More specifically, max min Cr R, }, s. min ( C r + R C sp ), }, s.2 (0) (a) Fig. 5. Comparisons between data-smoothness energy and data-smoothness higher-order energy optimization. (a) is the result by lower-order energy optimization; (b) is the result by our higher-order energy optimization. (b) C r R, if C r R < function: and R C = 0 ψ(t α ) = max [ C r + R C sp ], if C r + R C sp < and R C 0, () Here we denote t C = t i i C}, and let fk m(t C) be the number of variables in segment C taking the operation k (k 0, }). We then define fk m(t C) as follows: fk m f0 m (t C ) = i C (t C ) = ( t i), if k = 0 f m (t C ) = i C t (2) i, if k = We can therefore transform our higher-order energy by the α-expansion move algorithm to: f m (t Cr ), if f m (t Cr ) < and f0 m (t Csp Cr ) = 0 ψ(t α ) = max [f 0 m (t Cc ) + C r ], if f0 m (t Cc ) < C r and f0 m (t Csp Cr ) 0, (3) By introducing the additional binary variables m 0 and, the above higher-order energy function can be approximately transformed into a lower-order pairwise energy function (see Eq. (4)) and allow our higher-order energy function to be solved with graph-cut. Theorem. The α-expansion move energy can be approximately transformed into the second-order pairwise energy ψ(t α (x i, t i )) = min θ t i ( m 0 ) + r 0 m 0 + m 0, i C r θ ( t i ) + r ( ) + δ} (4) where θ = /, r 0 =, r = ( C r ), and δ = ( Cr ). The α-expansion move energy can be upper-bounded by the second-order pairwise energy. By minimizing this upperbound, we can make expansion moves that always decrease the original energy. The detailed proof of this transformation is given in Appendix A. We build the st-graph by the α- expansion move algorithm as shown in Fig. 4. The comparison results between the lower-order energy and our higher-order energy are given in Fig. 5. The quality of superpixels is improved by adding the higher-order term in our energy function. D. Texture-aware adaptive energy optimization By selectively setting the λs in Eq.() to zero, one can determine the impact of each term; Fig. 6 shows an example. The superpixels extracted from using only the data term tend to match object edges better but are less homogeneous in size and shape, while the superpixels extracted from using only the smoothness term (Fig. 6(b)) have smooth boundaries but are less aligned to image edges. Using both the data and smoothness terms, Fig. 6(c) offers some tradeoff between the two but the superpixels lack regularity and some are still fragmented. Including the higher-order energy term, however

JOURNAL OF L A TEX CLASS FILES 5 (a) (b) Fig. 8. Strategy of our image patch-based adaptive optimization: (a) calculate G mean and G var for patch p(i); (b) use the piecewise functions in Fig.

clustering center points; (e) repeat these steps from (a) to (d) for patch p(i + ). (c) (d) Fig. 6.

) = (,, 0) (d) data, smoothness and higher-order terms, (λ u, λ p, λ h ) = (,, ). (a) (b) Fig. 9. The piecewise functions for computing λ u and λ h. Fig. 7.

5 JOURNAL OF L A TEX CLASS FILES 5 (a) (b) Fig. 8. Strategy of our image patch-based adaptive optimization: (a) calculate G mean and G var for patch p(i); (b) use the piecewise functions in Fig. 9 to compute the corresponding λ u, λ p and λ h for energy function; (c) optimize the energy function and get the superpixel result of patch p(i); (d) select the next patch p(i + ) according to the clustering center points; (e) repeat these steps from (a) to (d) for patch p(i + ). (c) (d) Fig. 6. Impact of the different terms in the energy function: (a) data term only (λ u, λ p, λ h ) = (, 0, 0) (b) smoothness term only, (λ u, λ p, λ h ) = (0,, 0) (c) data and smoothness terms, (λ u, λ p, λ h ) = (,, 0) (d) data, smoothness and higher-order terms, (λ u, λ p, λ h ) = (,, ). (a) (b) Fig. 9. The piecewise functions for computing λ u and λ h. Fig. 7. Image patches (from p() to p(6)) have the same size. The ratio of each energy term in our adaptive energy function is determined by both the mean and variance values of their gradients. Fig. 6(d), results in superpixels regular in size which adhere nicely to objects in the image. We observe that local image regions can be divided into three types according to its local texture: plain regions with little to no texture, e.g. cloudless sky, monochrome backgrounds, highly textured regions, e.g. fur or feathers or grass, and edge regions with some texture that contain boundaries of objects. For plain regions, the smoothness term will have a strong role in the optimization in order to obtain homogenous superpixels and as such, the ratio of λ p should be much higher than textured regions. In regions with object boundaries however, the relevance of the data and higher-order terms will have a much bigger influence and the more complicated these regions are, the higher that the ratios of λ u and λ h should be with respect to λ p. Based on our observations of these three types of image TABLE I patch p() p(2) p(3) p(4) p(5) p(6) G mean G var *G mean and G var are the mean and variance of gradients in image patches. regions, we adjust the λs according to the texture of image patches centered on respective superpixels (see Fig. 8 for an overview). Measuring the mean and variance values of image patch gradients is a good way of classifying image region into one of these three types. Plain regions have low means and variances, while highly textured regions have higher mean gradients but still relatively low variance. Finally, edge regions have both high mean and high variance gradients. We illustrate this in Fig. 7 and Table I. We fix λ p to 0 in our experiments and then adjust λ u and λ h according to gradient mean G mean and gradient variance G var. We use a piecewise linear function based on four parameters: ξ m, ξ m2, ξ v, and ξ v2 (see Fig. 9). To set these parameters, we chose 50 images from the Berkeley segmentation dataset and randomly cut 9 (3 of each texture type) image patches from each. We compute the mean and variance of image gradients in each of these patches and use the computed minima and maxima to set ξ m =.5, ξ m2 = 5.5, ξ v =.0, ξ v2 = 2.5 for our experiments. We summarize the overall texture-adaptive higher-order energy optimization in Algorith. E. Discussion of superpixel absorption In the process of α-expansion, it is possible that some of the superpixels get absorbed by other expanding superpixels. This absorption phenomenon is rare; in most iterations, the superpixels only change shape and become better aligned to the object. However, when a pixel does get absorbed, we still continue using its center from I init for computational simplicity in finding patch. Fig. 0 gives an example of such a case. On the top left, the red and blue dots denote the centers of

JOURNAL OF L A TEX CLASS FILES 6 Algorith Texture-adaptive higher-order energy optimization for extracting superpixels initialize: superpixels I init and pre-segments M for each superpixel S i in I

9) 3) determine superpixel label L i of p i by minimizing Eq.() 4) I init L i end for output: I final = I init as the new center of the absorbed superpixels.

Choose c as α and expand: (c) (d) 4. Label d is absorbed by superpixel c: (d) (e) 5. Choose e as α and expand: (e) (f) 6.

6 JOURNAL OF L A TEX CLASS FILES 6 Algorith Texture-adaptive higher-order energy optimization for extracting superpixels initialize: superpixels I init and pre-segments M for each superpixel S i in I init do ) find the center pixel (x i, y i ) of S i and the associated image patch p i (see Fig. 8). 2) compute G mean and G var of p i and the associated λ i u, λ i p and λ i h (see Fig. 9) 3) determine superpixel label L i of p i by minimizing Eq.() 4) I init L i end for output: I final = I init as the new center of the absorbed superpixels. The α-expansion process with superpixel absorption (Fig. 0). the i-th iteration in the red patch:. Choose a as α and expand: (a) (b) 2. Choose b as α and expand: (b) (c) 3. Choose c as α and expand: (c) (d) 4. Label d is absorbed by superpixel c: (d) (e) 5. Choose e as α and expand: (e) (f) 6. Go back to step 2, choose the label a, b, c, e as α to do the α-expansion, until we get the minimum energy. the (i + )-th iteration in the blue patch: 7. Move the optimization patch from p(i) to p(i + ): (f) (g) 8. Choose a as α and expand: (g) (h) 9. Choose f as α and expand: (h) (i) 0. Choose c as α and expand: (i) (j). Choose b as α and expand: (j) (k) 2. Choose e as α and expand: (k) (l) 3. Go back to step 8, choose the label a, f, c, b, e as α to do the α-expansion, until we obtain the minimum energy. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Fig. 0. Illustrating the α-expansion process with superpixel absorption and label distribution in the i-th and (i + )-th iterations. The purple dot in (g) is the new center of the newly fused superpixel after the absorption. two initial superpixels before the optimization process. We first optimize the red patch p i in the i-th iteration, which absorbs the blue superpixel, making a newly fused superpixel centered at the purple dot (Fig. 0(g)). However, we will continue using the blue dot as the center of p(i + ) on the (i + )-th patch optimization iteration. This simplification is based on the assumption that the centers of absorbed superpixels will be relatively close to the centers of the newly created ones, thereby having little influence on the local texture measurements and the associated λs. As shown in Fig. 0, the blue superpixel is absorbed in (d), and the purple dot in (g) is the new center of the fused superpixel, which is close to the blue dot. Therefore, the textures around the purple dot and the blue dot are similar. It is an approximate computation if we use the initial center III. EXPERIMENTAL RESULTS We conduct a variety of experiments to evaluate the effectiveness of the proposed algorithm. We use the test images from the Berkeley segmentation dataset (BSD), which includes three hundred images with human-labeled segmentations as ground-truth and compare the results of our proposed algorithm against that of the SLIC [5], graph-cut (GC) [7] and pseudo-boolean optimization (PB) [22] methods. A. ualitative comparison The experimental results by the SLIC, GC and PB are all generated from implementations provided by the authors on their respective websites. To achieve as fair of a comparison as possible, we use the generic parameter settings provided by the authors to run their algorithms. We make a qualitative comparison of the superpixels extracted by our method to that of these three methods in Fig.. The SLIC approach [5] generates superpixels which are well-aligned to the edges of objects, but the superpixels themselves have non-smooth boundaries (see first column of Fig. ). The reason is that the SLIC method only considers the color and location features in the k-means clustering stage, which makes the boundaries of superpixels sensitive to the color of pixels. The GC method [7] uses both a data and smoothness term and generates superpixels with smooth boundaries but are poorly aligned to objects in the scene (see second column of Fig. ). The PB method [22] initializes the superpixels as narrow vertical and horizontal strips of equal width before the pseudo-boolean optimization which strongly restricts the basic shape of the resulting superpixels (see third column of Fig. ). In comparison, our method, by adopting both lower-order and higher-order energy terms, is able to align the superpixel boundaries to object edges adaptively (see last column of Fig. ), resulting in smooth, uniformly

shaped superpixels in the sky regions and more well-aligned superpixels in regions containing object boundaries. B.

We adopt three quantitative measurements [3], [7], [2]: boundary recall (BR), achievable segmentation accuracy (ASA) and undersegmentation error (UE).

Furthermore, let S k denotes a superpixel in S and G i a segment in G, i.e. S k S = S, S 2,, S ns } and G i G = G, G 2,, G ng }.

Boundary Recall measures the percentage of image segmentation boundary pixels p δg that are within a distance ϵ away from some superpixel boundary pixel q δs:

As such, the better the superpixels are aligned to the ground truth segments, the larger BR will be. We set ϵ = 2 in our experiments.

$For every superpixel S k, we can find a segment G i that has the largest overlap; the fraction of the overlap is the achievable part and can be calculated as:$ Under-segmentation Error measures the number of pixels within a superpixel which are located outside of a segmentation boundary.

Under-segmentation Error measures the number of pixels within a superpixel which are located outside of a segmentation boundary.

7 JOURNAL OF L A TEX CLASS FILES 7 Fig.. The superpixels by SLIC [5], graph-cut (GC) [7], pseudo-boolean optimization (PB) [22] and our method respectively (from left to right). shaped superpixels in the sky regions and more well-aligned superpixels in regions containing object boundaries. B. uantitative comparison In addition to the qualitative comparison, we evaluate and compare our method to other superpixel approaches quantitatively. We adopt three quantitative measurements [3], [7], [2]: boundary recall (BR), achievable segmentation accuracy (ASA) and undersegmentation error (UE). Let S and G denote the labels in the superpixel result and ground truth respectively and n s and n g be the number of segments in each set. Furthermore, let S k denotes a superpixel in S and G i a segment in G, i.e. S k S = S, S 2,, S ns } and G i G = G, G 2,, G ng }. The boundaries of superpixels and ground truth segments are expressed as δs and δg. Boundary Recall measures the percentage of image segmentation boundary pixels p δg that are within a distance ϵ away from some superpixel boundary pixel q δs: BR G (S) = p δg ϕ(min q δs dist xy (p, q) < ϵ), (5) δg where ϕ( ) is an indicator function equaling when it e- valuates to be true and 0. As such, the better the superpixels are aligned to the ground truth segments, the larger BR will be. We set ϵ = 2 in our experiments. Achievable Segmentation Accuracy measures the percentage of pixels in superpixels S that are achievable to form the ground truth segments G. For every superpixel S k, we can find a segment G i that has the largest overlap; the fraction of the overlap is the achievable part and can be calculated as: k ASA G (S) = max i S k Gi i G (6) i where the denominator i G i is the total number of pixels in the image. Under-segmentation Error measures the number of pixels within a superpixel which are located outside of a segmentation boundary. If a superpixel S k is entirely contained within a ground truth segment G i, it is considered a correct superpixel and does not contribute to the value of UE. Therefore, the smaller UE is, the better superpixel extraction will be. UE is computed as follows: i k:s UE G (S) = S k Gi k G i i G (7) i We plot the quantitative measures of our method in comparison to the SLIC [5], GC [7], PB [22], and TurboPixel (TP) [3] in Fig. 2. The measures are determined by averaging all test image results in the BSD benchmark. Our proposed method outperforms the others according to the BR and ASA measures (see Fig. 2(a) and (b)) across the entire

JOURNAL OF L A TEX CLASS FILES 8.5 0.95.4 Boundary Recall 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 00 200 300 400 500 600 GC SLIC Turbo PB Ours Achievable Segmentation Accuracy 0.98 0.96 0.94 0.92 0.

0 300 400 500 600 GC SLIC Turbo PB Ours Undersegmentation Error.2 0.8 0.6 0.4 0.2 GC SLIC Turbo PB Ours 00 20

8 JOURNAL OF L A TEX CLASS FILES Boundary Recall GC SLIC Turbo PB Ours Achievable Segmentation Accuracy GC SLIC Turbo PB Ours Undersegmentation Error GC SLIC Turbo PB Ours (a) (b) (c) Fig. 2. Comparison of boundary recall, achievable segmentaiton accuracy and undersegmentation error between SLIC [5], GC [7], PB [22], TurboPixel (TP) [3] and our method respectively (from left to right) on the BSD benchmark. Higher BR and ASA values indicate a better performance in superpixel alignment; lower UE values represents less error in the superpixelling results. BR Value data and smooth adaptive data, smooth and higher order ASA Value data and smooth adaptive data, smooth and higher order UE Value data and smooth adaptive data, smooth and higher order (a) (b) (c) (d) Fig. 3. An illustration that shows the advantage of our adaptive higher-order optimization. (a) gives the comparison results between the higher-order energy (top) and our adaptive higher-order energy (bottom). The quantitative evaluations (BR, ASA, UE) of the top and bottom images in (a) are (0.8289, 0.946, ) and (0.8255, , ), respectively. (b), (c), and (d) are the BR, ASA, and UE curves between the lower-order energy optimization and our adaptive higher-order energy optimization on the elephant image in (a). range of superpixel densities. The high BR values indicates the superiority of our method in aligning the superpixels, while the high ASA values suggests that our adaptive segmentation can be further extended to other unsupervised image segmentation applications. In terms of UE, our method achieves comparable results as the best-performing TurboPixel for UE curve (see Fig. 2(c)) though we are better at lower superpixel densities. C. Texture-adaptive results By including a higher-order energy term, we ensure that extracted superpixels are well-aligned to object edges and the difference can be seen in the BR, ASA and UE curves in Fig. 3(a), (b) and (c) comparing the results of using only the lower-energy terms and our adaptive higher-order energy. In particular, the texture-adaptation allows for the extraction of superpixels more homogenous in shape and size in highly texture regions. This is due to the use of a smaller of λ h, thereby limiting the contribution of the higher-order term. For example, in Fig. 3(a), the superpixels using non-adaptive and adaptive λs respectively are similar in the smooth and object boundary regions. In highly texture regions, however, the adaptive algorithm is able to generate superpixels more homogeneous in shape and size (see superpixels on body of elephant and in the grass). (a) Fig. 4. Run-time comparison: (a) our approach without k-means clustering step (run-time: seconds); (b) our full approach with k-means clustering initialization (run-time: seconds). The quantitative evaluations (BR, ASA, UE) of (a) and (b) are (0.887,0.9895,0.064) and (0.8858,0.9900,0.054), respectively. (b)

JOURNAL OF L A TEX CLASS FILES 9 As stated in the overview, we use a fast k-means clustering initialization step to accelerate computation speed. Fig.

4(a), where the input image is first divided into regular patches like [7] and then the boundaries of these patches are treated as the boundaries of initial superpixels for computing N(L C ).

Finally, our method is insensitive to the pre-segmentation method. We show in Fig.

Note that the main role of the pre-segments is to provide prior information for the higherorder energy term; as such, as long as the segments contain sufficient edges and segment regions, we can

9 JOURNAL OF L A TEX CLASS FILES 9 As stated in the overview, we use a fast k-means clustering initialization step to accelerate computation speed. Fig. 4 compares the superpixels of direct optimization (a) and the fast k-means initialization (b). We use regular image patches as initial superpixels to generate Fig.4(a), where the input image is first divided into regular patches like [7] and then the boundaries of these patches are treated as the boundaries of initial superpixels for computing N(L C ). The two results are virtually identical, even though using k-means requires only seconds of computation time in comparison to seconds by the direct optimization. Finally, our method is insensitive to the pre-segmentation method. We show in Fig. 5 pre-segmentation results from different approaches [], [6], [], [27] as well as the corresponding superpixel results. Note that the main role of the pre-segments is to provide prior information for the higherorder energy term; as such, as long as the segments contain sufficient edges and segment regions, we can extract fairly consistent superpixels even from very different pre-segments. We propose the use of the classic mean-shift algorithm [] due to its computational efficiency. (a) (c) (e) (b) (d) (f) IV. CONCLUSION We have presented a novel superpixel algorithm using higher-order energy optimization. In order to quickly obtain an initial superpixel result, fast k-means clustering is performed on the input image. We then apply our higher-order energy optimization on the initial superpixels to produce the final superpixel results. By using adaptive energy terms according to local texture information, our extracted superpixels are well aligned to object boundaries yet still remain relatively homogenous in shape and size. Our experimental results show that our approach achieves better superpixel performance than the state-of-the-art approaches. For future work, we first intend to design new metrics for supervoxels in alignment and regularity by extending the evaluation metrics in [29]. To make our algorithm perform in real-time, we also plan on exploring parallel acceleration implementations [8]. Finally, it would be interesting to model different forms of higher-order potentials and other image priors for superpixel segmentations. Appendix A (g) Fig. 5. Comparison with different pre-segments. (a), (c), (e) and (g) are the pre-segments by mean-shift [], efficient graph [6], edge-preserving method [], and probabilistic method [27], respectively. (b), (d), (f) and (h) are the corresponding superpixel results by our method using the pre-segments in (a), (c), (e) and (g), respectively. The quantitative evaluations (BR, ASA, UE) of (b), (d), (f), and (h) are (0.794,0.9606,0.3936), (0.7892,0.9609,0.3855), (0.7904,0.9608,0.3862), and (0.794,0.9607,0.3942), respectively. D. Computation speed and Pre-Segmentation (h) Theorem. The α-expansion move energy can be approximately transformed into the pairwise function: ψ(t α (x i, t i )) = min θ t i ( m 0 ) + r 0 m 0 + m 0, i C r θ ( t i ) + r ( ) + δ} (8) where θ = /, r 0 =, r = ( C r ), and δ = ( Cr ). Proof. We first decompose the above energy function as: ψ(t α (x i, t i )) = F 0 (t Cr ) + F (t Cc ) F 0 (t Cr ) = min θ t i ( m 0 ) + r 0 m 0 m 0 i C r F (t Cc ) = min θ ( t i ) + r ( ) + θ C r Then we transform F 0 (t Cr ) and F (t Cc ) as follows: F 0 (t Cr ) = min θ m 0 = min m 0 = min m 0 t i ( m 0 ) + r 0 m 0 i C r t i ( m 0 ) + m 0 i C r f m (t Cr )( m 0 ) + m 0 (9)

10 JOURNAL OF L A TEX CLASS FILES 0 In the same way, we obtain the following equations: F (t Cc ) = min θ ( t i ) + r ( ) + θ C r = min θ ( t i ) + ( C r )( ) + θ C r = min θ ( t i ) + ( θ C r )( ) + θ C r = min θ ( t i ) + θ C r + = min θ[ = min ( t i ) + C r ] + ( ) [f 0 m (t Cc ) + C r ] + ( ) (20) We now define a new set Co = Csp Cr, which contains the pixels in Csp but not in Cr. After making z(f m 0 (t Co )) = minf m 0 (t Co ), } and adding the constraint conditions that relate to Co, the following optimization functions are obtained: Because the constraint conditions f m 0 (t Csp Cr ) = 0 and f m 0 (t Csp Cr ) 0 are exclusive, we then add Eq.(2) and Eq.(22) together: F 0 (t Cr, t Co ) + F (t Cc, t Co ) = + max f m (t Cr ), iff m (t Cr ) < andf0 m (t Csp Cr ) = 0 + max [f 0 m (t Cc ) + C r ], iff0 m (t Cc ) < C r andf0 m (t Csp Cr ) 0 2, (24) Since our higher-order term is built for every segment C and the region Co is not in C, we would like to delete the constraint of Co to derive Eq.(2) and Eq.(22) approximately. Also, investigating Eq.(2) and Eq.(22), we can find that Co is not even in the value of F 0 (t Cr, t Co ) and F (t Cc, t Co ). Co only appears in the constraints of Eq.(2) and Eq.(22). Therefor we then relax the constraint conditions and weaken the constraint of Co as follows: F 0 (t Cr, t Co ) = min m 0 f (t Cr )( m 0 )[ z(f m 0 (t Co ))] + m 0 F (t Cc, t Co ) = min s.t. m 0 + [ z(f m 0 (t Co ))] > 0 [f m 0 (t Cc ) + C r ] z(f m 0 (t Co )) + ( ) s.t. ( ) + z(f m 0 (t Co )) > 0 (2) From the above constraint in F 0 (t Cr, t Co ), we can conclude that m 0 = 0 and z(f0 m (t Co )) = can not be satisfied simultaneously. If m 0 = 0, then there must be z(f0 m (t Co )) =, F 0 (t Cr, t Co ) = f m (t Cr ); if m 0 =, F 0 (t Cr, t Co ) =, and we further have: F 0 (t Cr, t Co ) max f m (t Cr ), if f m (t Cr ) < = and f0 m (t Co ) = 0, max f m (t Cr ), if f m (t Cr ) < = and f0 m (t Csp Cr ) = 0, In the same way, we obtain F (t Cc, t Co ) as follows: (22) F (t Cc, t Co ) max [f 0 m (t Cc ) + C r ], if f0 m (t Cc ) < C r = and f0 m (t Co ) =, max [f 0 m (t Cc ) + C r ], if f0 m (t Cc ) < C r = and f0 m (t Csp Cr ) 0, (23) F 0 (t Cr ) F 0 (t Cr, t Co ), F (t Cc ) F (t Cc, t Co ) (25) Finally, we get the result as follows: ψ(t α (x i, t i )) = F 0 (t Cr ) + F (t Cc ) = f m (t Cr ), iff m (t Cr ) < andf0 m (t Csp Cr ) = 0 [f 0 m (t Cc ) + C r ], iff0 m (t Cc ) < C r (26) andf0 m (t Csp Cr ) 0, This is the solution of Eq.(3) in Section II. C. When the labels in one pre-segment C expand, they will also affect the higher-order values of other neighboring pre-segments. We assume that the penalty of other pre-segment does not increase, when the labels in C expand. That is to say, Eq.(4) in Section II. C is a pairwise transformation of the upper bound of the higher-order expansion energy. Therefore, the α-expansion move energy (Eq.(3)) can be approximately transformed into the pairwise function (Eq.(4)) as its upper bound. REFERENCES [] D. Comaniciu and P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp , [2] X. Ren and J. Malik, Learning a classification model for segmentation, In: Proceedings of IEEE ICCV, pp. 0-7, [3] C. W. Ngo, Y. F. Ma, and H. J. Zhang, Video summarization and scene detection by graph modeling, IEEE Trans. on Circuits and Systems for Video Technology, vol. 5, no. 2, pp , [4] Y. Yuan, J. Fang,. Wang, Robust superpixel tracking via depth fusion, IEEE Trans. on Circuits and Systems for Video Technology, vol. 24, no., pp. 5-26, 204. [5] V. Kolmogorov, and R. Zabih, What energy functions can be minimized via graph cuts? IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp , [6] P. Felzenszwalb and D. Huttenlocher, Efficient graph-based image segmentation, International Journal of Computer Vision, vol. 59, no. 2, pp. 67-8, 2004.

JOURNAL OF L A TEX CLASS FILES [7] J. M. Bioucas-Dias and G. Valadao, Phase unwrapping via graph cuts, IEEE Trans. on Image Processing, vol. 6, no. 3, pp. 698-709, 2007. [8] Y. P. Tsai, C. H. Ko, Y.

Paragios, Beyond pairwise energies: efficient optimization for higher-order MRFs, In: Proceedings of IEEE CVPR, pp. 2985-2992, 2009. [0] S.

Galatsanos, Edge-preserving spatially varying mixtures for image segmentation, In: Proceedings of IEEE CVPR, pp. -7, 2008. [2] H.

11 JOURNAL OF L A TEX CLASS FILES [7] J. M. Bioucas-Dias and G. Valadao, Phase unwrapping via graph cuts, IEEE Trans. on Image Processing, vol. 6, no. 3, pp , [8] Y. P. Tsai, C. H. Ko, Y. P. Hung, and Z. C. Shih, Background removal of multiview images by learning shape priors, IEEE Trans. on Image Processing, vol. 6, no. 0, pp , [9] N. Komodakis, N. Paragios, Beyond pairwise energies: efficient optimization for higher-order MRFs, In: Proceedings of IEEE CVPR, pp , [0] S. Ramalingam, P. Kohli, K. Alahari, and P. H. S. Torr, Exact inference in multi-label CRFs with higher order cliques, In: Proceedings of IEEE CVPR, pp. -8, [] G. Sfikas, C. Nikou, and N. Galatsanos, Edge-preserving spatially varying mixtures for image segmentation, In: Proceedings of IEEE CVPR, pp. -7, [2] H. Ishikawa, Higher-order clique reduction in binary graph cut, In Proceedings of IEEE CVPR, pp , [3] A. Levinshtein, A. Stere, K. Kutulakos, D. Fleet, S. Dickinson, and K. Siddiqi, Turbopixels: fast superpixels using geometric flows, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 3, no. 2, pp , [4] P. Kohli, L. Ladicky, and P. Torr, Robust higher order potentials for enforcing label consistency, International Journal of Computer Vision, vol. 82, no. 3, pp , [5] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, SLIC superpixels, EPFL Technical Report no , 200. [6] L. C. Tran, C. J. Pal, and T.. Nguyen, View synthesis based on conditional random fields and graph cuts, In: Proceedings of IEEE ICIP, pp , 200. [7] O. Veksler, Y. Boykov, and P. Mehrani, Superpixels and supervoxels in an energy optimization framework, In: Proceedings of ECCV, pp , 200. [8] P. Strandmark and F. Kahl, Parallel and distributed graph cuts by dual decomposition, In: Proceedings of IEEE CVPR, pp , 200. [9] Y. Mu, B. Zhou, and S. Yan, Information-theoretic analysis of input strokes in visual object cutout, IEEE Trans. on Multimedia, vol. 2, no. 8, pp , 200. [20] M. B. Salah, A. Mitiche, and I. B. Ayed, Multiregion image segmentation by parametric kernel graph cuts, IEEE Trans. on Image Processing, vol. 20, no. 2, pp , 20. [2] M. Y. Liu, O. Tuzel, S. Ramalingam, and R. Chellappa, Entropy rate superpixel segmentation, In: Proceedings of IEEE CVPR, pp , 20. [22] Y. Zhang, R. I. Hartley, J. Mashford, and S. Burn, Superpixels via pseudo-boolean optimization, In: Proceedings of IEEE ICCV, pp , 20. [23] H. Li and K. N. Ngan, A co-saliency model of image pairs, IEEE Trans. on Image Processing, vol. 20, no. 2, pp , 20. [24] E. Bae, J. Shi, and X. C. Tai, Graph cuts for curvature based image denoising, IEEE Trans. on Image Processing, vol. 20, no. 5, pp , 20. [25] Z. Liu, R. Shi, L. Shen, Y. Xue, K. N. Ngan, and Z. Zhang, Unsupervised salient object segmentation based on kernel density estimation and two-phase graph cut, IEEE Trans. on Multimedia, vol. 4, no. 2, pp , 202. [26] K. Park and S. Gould, On learning higher-order consistency potentials for multi-class pixel labeling, In Proceedings of ECCV, pp , 202. [27] B. Andres, J. H. Kappes, T. Beier, U. Kothe, and F. A. Hamprecht, Probabilistic image segmentation with closedness constraints, In: Proceedings of IEEE ICCV, pp , 20. [28] J. Peng, J. Shen, Y. Jia, and X. Li, Saliency cut in stereo images, In Proceedings of IEEE ICCV Workshop, pp , 203. [29] C. Xu and J. J. Corso, Evaluation of super-voxel methods for early video processing, In: Proceedings of IEEE CVPR, pp , 202. [30] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, Contour detection and hierarchical image segmentation, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 33, No. 5, pp , 20. [3] J. Shen, Y. Du, and X. Li, Interactive segmentation using constrained Laplacian optimization, IEEE Trans. on Circuits and Systems for Video Technology, vol. 24, no. 7, pp , 204. [32] J. Shen, Y. Du, W. Wang, X. Li, Lazy random walks for superpixel segmentation, IEEE Trans. on Image Processing, vol. 23, no. 4, , 204. [33] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, SLIC superpixels compared to state-of-the-art superpixel methods, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 34, no., pp , 202. [34] W. Wang, J. Shen, F. Porikli, Saliency-aware geodesic video object segmentation, In: Proceedings of IEEE CVPR, 205. [35] A. Fix, A. Gruber, E. Boros, and R. Zabih, A graph cut algorithm for higher-order Markov Random Fields, In: Proceedings of IEEE ICCV, pp , 20. Jianteng Peng is currently working toward the PhD degree in the School of Computer Science, Beijing Institute of Technology, Beijing, China. His current research interests include image and video segmentation using higher-order energy optimization. Jianbing Shen (M -SM 2) is a Full Professor with the School of Computer Science, Beijing Institute of Technology, Beijing, China. He has published about 50 journal and conference papers such as IEEE TIP, IEEE TCSVT, IEEE TCYB, IEEE TMM, IEEE CVPR, IEEE ICCV, IEEE ICME. He has also obtained many flagship honors including the Fok Ying Tung Education Foundation from Ministry of Education, the Program for Beijing Excellent Youth Talents from Beijing Municipal Education Commission, and the Program for New Century Excellent Talents in University from Ministry of Education. His research interests include computer vision and multimedia processing. He is on the editorial boards of Neurocomputing. Angela Yao is currently an Assistant Professor at the Institute of Computer Science at the University of Bonn. She received a BASc degree in Engineering Science from the University of Toronto in 2006 and a PhD in Information Technology and Electrical Engineering from ETH Zurich in 202. Her research interests are in computer vision and machine learning, with special focus on human pose estimation and action recognition. Xuelong Li (M 02-SM 07-F 2) Xuelong Li (M 02-SM 07-F 2) is a full professor with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), the School of Computer Science, Northwestern Polytechnical University, Xi an 7029, Shaanxi, P. R. China.

Superpixel Segmentation using Depth

Superpixel Segmentation using Depth Information Superpixel Segmentation using Depth Information David Stutz June 25th, 2014 David Stutz June 25th, 2014 01 Introduction - Table of Contents 1 Introduction