A Region-based MRF Model for Unsupervised Segmentation of Moving Objects in Image Sequences

Size: px

Start display at page:

Download "A Region-based MRF Model for Unsupervised Segmentation of Moving Objects in Image Sequences"

Berenice Rogers
6 years ago
Views:

1 A Region-based MRF Model for Unsupervised Segmentation of Moving Objects in Image Sequences Yaakov Tsaig, Amir Averbuch Department of Computer Science School of Mathematical Sciences Tel-Aviv University, Tel-Aviv 69978, Israel Abstract This paper addresses the problem of segmentation of moving objects in image sequences, which is of key importance in content-based applications. We transform the problem into a graph labeling problem over a region adjacency graph (RAG), by introducing a Markov random field (MRF) model based on spatio-temporal information. The initial partition is obtained by a fast, color-based watershed segmentation. The motion of each region is estimated and validated in a hierarchical framework. A dynamic memory, based on object tracking, is incorporated into the segmentation process to maintain temporal coherence. The performance of the algorithm is evaluated on several real-world image sequences. 1. Introduction Segmentation of moving objects in image sequences is one of the key issues in computer vision, since it lies at the base of virtually any scene analysis problem. In particular, segmentation of moving objects is a crucial factor in content-based applications such as interactive TV, contentbased scalability for video coding, content-based indexing and retrieval, etc. Obviously, such applications require an accurate and stable partition of an image sequence to semantically meaningful objects. Commonly, two approaches exist for detecting moving objects in a sequence of images. The first approach relies on change detection as the source of temporal information [1, 2, 3]. This approach is motivated by the assumption that moving objects usually entail intensity changes between successive frames. Hence, moving objects can be identified by applying a decision rule on the intensity differences between successive frames in a sequence. Various decision rules have been suggested to identify moving objects, based on Markov random fields [1, 3], or a statistical F -test [2]. Yet, this approach suffers from two major drawbacks. First, unless moving objects are sufficiently textured, only occlusion areas (covered background) will be marked as changed, while the interior of objects will remain unchanged. Second, uncovered background will be marked as changed in the process as well, thus the boundaries of the moving objects are likely to be inaccurate. To overcome this drawback, a post-processing step that distinguishes between moving objects and uncovered background has to be applied [1, 3]. In contrast, several researchers choose to rely directly on motion information to identify moving objects [4, 5, 6, 7]. The techniques [4, 5] begin with estimation of a dense motion field, and partition it to a set of regions with parametric motion representations, by clustering in the parameter space [5], or through the use of a MRF model [4]. The technique in [6], on the other hand, begins with a spatial partition of the frame, and then merges regions with similar motion using a k-medoid clustering technique. A similar approach was taken in [7], where an initial partition is obtained by a MRF-based texture segmentation, and a MRF model is employed to merge regions with uniform motion. Most of the techniques in this category were designed to address the needs of second generation video coding schemes, namely, to partition each frame in a sequence to a set of regions, such that each region can be efficiently represented by a parametric motion model. Yet, most physical objects usually exhibit non-rigid motion, and cannot be characterized by a parametric motion model, thus these techniques usually result in a finer partition than actually required. In addition, these techniques rely on an accurate estimation of the motion in the scene, thus inaccurate motion estimation will inevitably lead to inaccurate segmentation. This paper presents a new algorithm for automatic segmentation of moving objects in image sequences. The approach underlying our algorithm is to classify regions ob-

2 I (k) I (k+1) Global motion estimation & compensation Initial partition Gradient approximation Watershed segmentation Spatio-temporal merging Classification Initial classification based on a significance test Hierarchical motion estimation & validation MRF-based classification using HCF Object tracking & memory update caused by occlusion, an iterative validation scheme examines the estimated motion vectors, and corrects them, if necessary. The estimated motion vectors are then incorporated into a Markov random field model, along with spatial information and information gathered from previous frames. The optimization of the MRF model is performed using highest confidence first (HCF) [11], leading to a classification of the regions in the initial partition. Finally, a dynamic memory is introduced into the algorithm to ensure temporal coherency of the segmentation process. The memory is updated using the estimated motion vectors and the MRF-based classification. Each region is tracked to the next frame in the sequence, and the memory is updated accordingly. The paper is organized as follows. Section 2 describres the global motion estimation and compensation stage. Section 3 discusses the initial partition, and Section 4 explains the classification process. Experimental results are presented in Section 5, and concluding remarks are given in Section Global Motion Estimation and Compensation Figure 1: Block diagram of the proposed algorithm. tained in an initial partition as foreground or background, based on motion information. A block diagram of the proposed algorithm is depicted in Fig. 1. We formulate the segmentation problem as detection of moving objects over a static background. Thus, the first step in the algorithm is to compensate the motion of the camera. The global motion is modeled by an eight-parameter perspective motion model, and estimated using a robust gradient-based technique [8]. An initial spatial partition of the current frame is obtained by applying the watershed segmentation algorithm. For this purpose, the spatial gradient is first estimated in the color space through the use of Canny s gradient [9]. Then, the optimized rainfalling watershed algorithm is applied [10]. To reduce the oversegmentation, small regions are merged together in a post-processing step based on spatiotemporal information. Based on the initial partition, the classification stage labels regions as foreground or background. It begins with an initial classification based on a statistical significance test, which marks regions as foreground candidates. The motion of each foreground candidate is estimated by region matching in a hierarchical framework. To avoid false movements For the estimation of the global motion, we utilize a gradient-based parametric motion estimation technique suggested in [8]. The camera motion is modeled by the eight-parameter perspective motion model, x i = (a 1 + a 2 x i + a 3 y i )/(a 7 x i + a 8 y i + 1) y i = (a 4 + a 5 x i + a 6 y i )/(a 7 x i + a 8 y i + 1) and the global motion estimation (GME) is carried out using the Levenberg-Marquardt (LM) non-linear minimization algorithm. To increase the robustness of the estimation process, as well as to reduce computation time, the LM algorithm is applied within a hierarchical framework, using a 3-level multi-resolution pyramid, and the estimation is carried out only within background regions of the current frame. Background regions are determined according to the tracked mask of the previous frame (the tracking mechanism is described in Sec. 4.4). To assure convergence of the LM algorithm, an initial stage is performed that computes a coarse estimate of the translation component of the displacement, by applying a matching technique at the coarsest level of the pyramid. Following the GME, the camera motion between frames I k and I k+1 is compensated by warping frame I k+1 to frame I k according to the estimated motion parameters of Eq. (1). (1)

3 3. Initial Partition watersheds activity slope 3.1. Gradient Approximation The spatial gradient of the current frame is estimated using Canny s gradient approximation [9]. Specifically, the image is convolved with the first derivative of a Gaussian Ġ(x) in both directions, where G(x) = 1 2πσ 2 G e x2 /2σ 2 G (2) and the standard deviation of the Gaussian is chosen to be σ G = 1. Due to the seperability property of Gaussians, the 2D convolution can be implemented by applying the 1D filter (2) in the horizontal and vertical directions, and performing the differentiation afterwards. To increase the accuracy of the initial partition, and to ensure that moving objects are separated from the background, we incorporate color information into the segmentation process. For this purpose, Canny s gradient is first computed on each of the three color components, resulting in the gradient magnitudes G Y, G Cr and G Cb. Then, the common gradient image G col is generated, by combining the information of the three color components, according to { } G Y G Cr G Cb G col = max w Y, w Cr σcr, w Cb σcb σy (3) where σ Y, σ Cr, σ Cb are the variances of the gradient magnitudes of the three color components, and w Y, w Cr, w Cb are the associated weight coefficients. A weighted maximization was used in (3) rather than a weighted sum in order to avoid spurious local maxima due to the scaling of the chrominance components. The weights in (3) where empirically defined as w Y = 0.5, w Cr = w Cb = Watershed Segmentation To obtain an initial partition of the current frame, the watershed algorithm is applied on the gradient image G col. Watershed segmentation draws its origins from mathematical morphology, and is in fact a region growing algorithm that treats the input image as a topographic surface, and through the intuitive process of water-filling, creates a partition of the image. For the implementation of the watershed algorithm, we use a fast method based on rainfalling simulations, suggested in [10]. It has several advantages over the traditional implementation in terms of immersion simulations [12]. First, it does not require the input image to be discrete, in contrast to [12]. Moreover, the optimized implementation is 3 to 4 times faster than the one presented in [12]. In addition, the algorithm introduces a new drowning step, that removes some of the weakest edges, and helps drowning level local minima lakes Figure 2: A topographic watershed cross-section, after the drowning process; lakes are formed by merging neighboring pixels below the drowning threshold. reduce the influence of noise. This drowning step can be thought of as flooding the surface with water at a certain level. This process will create a number of lakes grouping all the pixels that lie below the drowning level (see Fig. 2). While the watershed algorithm in [10] produces very accurate results, it has one inherent drawback; it is extremely sensitive to gradient noise, and usually results in oversegmentation. In order to eliminate this oversegmentation, we employ a post-processing step that merges small regions, based on spatio-temporal information Spatio-Temporal Merging The goal of the merging step is to reduce the number of regions in the partition by eliminating small regions, while maintaining the accuracy of the partition. To ensure that moving objects are not merged with background regions, we consider temporal information as well as spatial information in the merge process. Suppose that the output of the watershed segmentation consists of N regions {R 1,..., R N }. We define a spatial distance measure between adjacent regions R i and R j as D ij = A ij + G ij 2 where A ij is the sum of differences between the average intensity values of the three color components within R i and R j, and G ij is the difference of intensities on the common boundary of R i and R j. In addition, we define a temporal distance measure based on the absolute luminance differences d abs k (x, y) = I k+1 (x, y) I k (x, y) as B ij = 1 N ij (x i,y i ) R i,(x j,y j ) R j d abs k (4) (x i, y i ) d abs k (x j, y j ) (5) where the summation in (5) is carried out over all pairs of 4-connected pixels (x i, y i ) R i and (x j, y j ) R j, and N ij is their cardinality. The merge process begins by examining all regions that consist of T size pixels or less. Each region R i is merged to the region R j whose spatial distance D ij to R i is minimal, among all regions with temporal distance B ij that is smaller

4 than T B. We start the merging process by setting T sz = 20, and repeat the process iteratively, each time increasing T sz by 20, until its value reaches 100 (a total of 5 iterations). Selection of the threshold T B determines the sensitivity of the merge process. Experimentally, we have found that using T B = 25 leads to satisfactory results. 4. Classification 4.1. Initial Classification The classification stage begins with an initial classification that is based on a statistical significance test [13]. This initial phase serves two main purposes. First, it reduces the computational load of the following motion estimation phase, by eliminating most of the background regions from the estimation process. Second, it increases the robustness of the motion estimation, by eliminating noisy background regions that may be falsely detected as moving. Let d k (x, y) = I k+1 (x, y) I k (x, y) denote the image of luminance differences between frames I k and I k+1. Under the hypothesis than no change occurred at position (x, y) (the null hypothesis H 0 ), the corresponding difference d k (x, y) follows a zero-mean Gaussian distribution p(d k (x, y) H 0 ) = 1 2πσ 2 e d2 k (x,y)/2σ2 (6) where the noise variance σ 2 is equal to twice the variance of the camera noise, assuming that the camera noise is white. Rather than performing the significance test directly on the values d k (x, y), it is better to evaluate a local sum of normalized differences k (x, y) = (x,y ) W (x,y) d 2 k (x, y ) σ 2 (7) where W (x, y) is a window of observation centered at (x, y). Under the assumption that no change occurs within the window, the normalized differences d k /σ obey a Gaussian distribution N(0, 1) and are spatially uncorrelated, thus the local sum k (x, y) follows a χ 2 distribution with N degrees of freedom, N being the number of pixels within the window W (x, y). With the distribution p( k (x, y)) known, we can obtain a decision rule for each pixel by performing a significance test on k (x, y). Specifically, for a given significance level α, we can compute a corresponding threshold T α using α = P r { k (x, y) > T α H 0 } (8) The significance level α is in fact the false alarm rate associated with the statistical test. The higher the value of α is, the more likely we are to classify unchanged pixels as changed. Experimentally, we have found that using a significance level α = 10 2, with a window of 5 5, leads to satisfactory results. It is important to note that the suggested significance test has one major drawback. Like all change-detection based methods, moving pixels will be marked as changed only within areas that are sufficiently textured. In other words, within smooth, uniform areas, only a small percentage of the overall pixels will be marked as changed, mostly due to covered/uncovered background. Therefore, a region is classified as a foreground candidate if more than 10% of its pixels are marked as changed, otherwise it is marked as background. In order to avoid elimination of slowly moving regions, we also consider the segmentation of previous frames of the sequence. For this purpose, regions that a majority of their pixels appear in the tracked mask of the previous frame, are marked as foreground candidates as well (the tracking mechanism is described in Sec. 4.4) Hierarchical Motion Estimation and Validation Following the initial classification stage, the motion of potential foreground candidates is estimated. The estimation takes place within a segmentation-based framework. That is, the motion of each region is determined by estimating a parametric motion model for the region. As the motion model, we selected the simple 2-parameter translational motion model. In other words, we assume that the motion of objects in the scene can be approximated as piecewise constant. Our choice of this motion model is supported by the following facts. First, the regions resulting from the initial segmentation are usually small enough to justify the assumption of piecewise constant motion. Second, the goal of the motion estimation phase is to identify moving regions rather than minimize the motion-compensation errors, thus the use of a complex model which accurately describes the motion of each region is not required. Finally, the use of a simple 2-parameter motion model allows for a very efficient implementation. The motion estimation is carried out by intensity matching in a hierarchical framework [14]. For this purpose, the 3-level multi-resolution pyramid constructed in the global motion estimation stage (Sec. 2) is used. The hierarchical estimation technique suggested above, like all motion estimation techniques, suffers from the occlusion problem. Let us illustrate this problem with a simple example. Fig. 3(a) shows a synthetic scene composed of three objects, where the leftmost object (the black circle) is moving to the right, while partially occluding the middle object, and thereby creating high temporal differences in the occlusion area. Since the ultimate goal of the motion estimation is to minimize the sum of temporal differences, the differences caused by the occlusion induces motion towards the right object in order to compensate for these dif-

5 (a) Figure 3: Illustration of the occlusion problem: (a) a synthetic scene where the left object is moving over the middle object. The numbers denote the intensity values. (b) the estimated motion. The middle object was falsely detected as moving. ferences. Since the intensity differences between the left circle and the middle rectangle are significantly higher than the differences between the middle and right rectangles, the middle rectangle will be identified as moving, as shown in Fig. 3(b). For the segmentation task at hand, this poses a serious problem, since it causes background regions to be detected as moving, which will inevitably lead to their misclassification as foreground. In order to overcome this problem, we propose a motion validation scheme, based on a statistical hypothesis test. Let ( x, y) denote the estimated displacement for the region R i. If we assume that this motion vector is valid, then, according to the brightness change constraint, I k (x, y) = I k+1 (x + x, y + y), (x, y) R i (9) Assuming that the sequence is subject to camera noise, then for pixels (x, y) within the occlusion area, the frame differences d k satisfy d k (x + x, y + y) = = I k+1 (x + x, y + y) I k (x + x, y + y) + n = I k (x, y) I k+1 (x + x, y + y) + n (10) where the noise n obeys a zero-mean Gaussian distribution, p(n) = (b) 1 2πσ 2 e n2 /2σ 2 (11) and its variance σ 2 is equal to twice the variance of the camera noise. Conversely, if the estimated motion vector for the region R i is invalid, then the differences within the occlusion area are attributed only to camera noise d k (x + x, y + y) = n (12) Therefore, we can define the two hypotheses as H 0 : H 1 : d k (x + x, y + y) = n d k (x + x, y + y) = D + n (13) where we define D = I k (x, y) I k+1 (x + x, y + y) to simplify notation. Using Eq. (11) we can write p(d k H 0 ) = 1 2πσ 2 e d2 k /2σ2 p(d k H 1 ) = 1 2πσ 2 e (d k D) 2 /2σ 2 (14) Now, with the a-priori probabilities unknown, the optimal decision rule is the Maximum Likelihood (ML) criterion Decide H 1 if p(d k H 1 ) > p(d k H 0 ) (15) Substituting Eq. (14) into Eq. (15), we obtain Decide H 1 if d k (x, y) > D 2 (16) Thus, a decision rule is obtained for the validation of the motion vector for a pixel in the occlusion area of a given region R i. Assuming that the noise is spatially uncorrelated, the decision for the entire region is obtained by applying the above decision rule for each pixel within the occlusion area, and deciding based on a majority rule. If, based on this decision, we find that the estimated motion vector is invalid, it does not imply that the region is not moving, only that its current displacement estimate is wrong. Hence, we must estimate its motion again, while denying movement to the same location. This is done by applying the hierarchical matching algorithm again, while considering only displacements that do not coincide with the occlusion area. After a new motion vector has been estimated, it is validated in the same manner, and if it is not valid, then the new occlusion area is appended to the previous one, and the entire process is repeated. This leads to an iterative motion estimation and validation scheme. To increase the efficiency of the validation process, only regions that lie on the border of a moving object (have a common boundary with a background region) are validated, since these regions are the most likely to be affected by the occlusion problem MRF-Based Classification Following the motion estimation and validation, a classification of each region to foreground or background is performed. To this end, we consider a Markovian labeling approach applied to the regions in the initial partition. First, we construct a region adjacency graph (RAG) G = (S, E), where the set of nodes S = {S 1,..., S N } corresponds to the set of regions in the initial partition, and E is the set of edges which relate two neighboring connected regions. Using this definition, the segmentation problem is transformed into a labeling problem on the RAG G, using the label set L = {F, B}, in which F denotes foreground and B denotes background. To obtain a labeling of the regions, we define a MRF model over the RAG G. By applying the maximum

6 a-posteriori (MAP) criterion, and using the equivalence between MRF s and Gibbs distributions [15], an optimal labeling ˆω is obtained by minimizing T h f(d) ˆω = arg min ω U(ω O) (17) where U(ω O) is the energy term of the Gibbs distribution, and O is the set of observations for the nodes of the RAG G. For the labeling, we use the set of observations O = {O 1,..., O N }, where the observation O i for the region R i is defined as O i = {MV i, A i, M i }, where MV i = ( x i, y i ) is the estimated motion vector in the region, A i is the sum of the average intensity values of the three color components Y, C r and C b within the region, and M i is the average value of the memory within the region (The memory mechanism is described in Sec. 4.4). We define the energy function U(ω O) as a composition of three terms N U(ω O) = Vi M (ω, O) + Vi T (ω, O) + Vij S (ω, O) i=1 (i,j) E (18) where, for practical reasons, only singleton and pairwise cliques are considered. Vi M (ω, O) is the motion term, which represents the likelihood of the region R i to be classified as foreground/background, based on its estimated motion, αn i, (ω i = F and MV i 0) or Vi M (ω (ω, O) = i = B and MV i = 0) αn i, (ω i = F and MV i = 0) or (ω i = B and MV i 0) (19) where N i is the size (number of pixels) of the region R i. The motion term simply states that moving regions should be classified as foreground, whereas static regions should be classified as background. (ω, O) is a temporal continuity term, V T i { Vi T β Mi N (ω, O) = i, ω i = F β M i N i, ω i = B (20) The temporal continuity term allows us to consider the segmentation of prior frames, thus maintaining the coherency of the segmentation through time. If a region has been classified as foreground several times in the past, its memory value will be high, and it is likely to be classified as foreground again. Finally, Vij S (ω, O) is a spatial continuity term, γ F f(a i A j )N ij, ω i = ω j = F Vij S (ω, O) = γ B f(a i A j )N ij, ω i = ω j = B γ diff f(a i A j )N ij, ω i ω j (21) T l d l d h d Figure 4: The similarity function f(d) where N ij is the length of the common boundary between R i and R j, and the function f(d) is given by [ f(d) = T l T ] Th h T l (d d l ) (22) d h d l T l where the operator [ ] T h T l denotes clipping, as illustrated in Fig. 4. The spatial continuity term expresses the relationships between pairs of regions. A similarity measure is used in order to incorporate the spatial properties of the regions into the optimization process. Specifically, two regions with similar spatial properties are likely to belong to the same moving object, thus more weight is given to an equal labeling for the two regions. Specification of the values T h, T l, d h, d l in the function f(d) will determine the effect of neighboring regions on one another. In our experiments, we have used T h = 2, T l = 0.5, d h = 80, d l = 20. The constants α, β, γ F, γ B and γ diff determine the relative contributions of the three terms to the energy functions. Experimentally, we have found that using α = 1.75, β = 0.75 and γ F = γ B = γ diff = 5 leads to satisfactory results. The minimization of the energy function U(ω O) is performed using an iterative deterministic relaxation scheme known as highest confidence first (HCF), which was presented in [11]. Our implementation uses a slightly modified version of HCF. Rather than starting with an alluncommitted configuration, we set the initial configuration based on the initial classification phase. Specifically, all regions that were detected as foreground candidates are set to the uncommitted state, while all regions that were classified as background are set to the background state. This modification increases the computational efficiency of the algorithm, since the number of uncommitted sites is reduced, especially when the moving objects are small compared to the background Object tracking and memory update To impose temporal coherency on the segmentation process, we incorporate a memory into the algorithm. Rather

than using a static memory, as was used, e.g., in [1], we use a dynamic tracking memory. This memory will contain, for each region, the number of times it was classified as foreground in past frames.

hierarchical motion estimation phase. This enables us track objects as they move through the scene, without accumulating uncovered background in the process.

Thus, if a background region was mistakenly detected as moving once during the sequence, its effect is diminished through time. Let MEM k denote the memory in the k th frame.

For moving regions, that were classified as foreground, the values within the regions are updated according to MEM k+1 (x + x i, y + y i ) = MEM k (x, y) + 1 (23) while for static regions, or regions

25 (24) In order to maintain a constant memory value for each region, and to avoid leaving residues due to inaccuracies in the tracking mechanism, a post-processing step is employed following the

If the majority of values is zero, then all memory values in this region are set to zero, otherwise, all memory values are set to the average value of the memory in this region.

7 than using a static memory, as was used, e.g., in [1], we use a dynamic tracking memory. This memory will contain, for each region, the number of times it was classified as foreground in past frames. However, unlike the static memory in [1], the update of the dynamic memory consists of tracking each region to its new location in the frame, using the displacement vector estimated in the hierarchical motion estimation phase. This enables us track objects as they move through the scene, without accumulating uncovered background in the process. Error propagation is avoided by slowly decreasing memory values of regions that stop moving. Thus, if a background region was mistakenly detected as moving once during the sequence, its effect is diminished through time. Let MEM k denote the memory in the k th frame. The memory is initialized in the first frame by setting it to zero. The update of the memory consists of two steps. For moving regions, that were classified as foreground, the values within the regions are updated according to MEM k+1 (x + x i, y + y i ) = MEM k (x, y) + 1 (23) while for static regions, or regions labeled as background, MEM k+1 (x, y) = MEM k (x, y) 0.25 (24) In order to maintain a constant memory value for each region, and to avoid leaving residues due to inaccuracies in the tracking mechanism, a post-processing step is employed following the segmentation of the next frame. In this step, a constant memory value for each region is determined by averaging the memory values over the entire region. If the majority of values is zero, then all memory values in this region are set to zero, otherwise, all memory values are set to the average value of the memory in this region. Finally, note that in case of camera motion, the memory must be adapted to the global motion as well, in order to maintain an accurate tracking. Therefore, the memory is warped to the next frame in the global motion compensation phase, based on the estimated global motion parameters. 5. Experimental results The performance of the proposed algorithm was evaluated on several MPEG test sequences. All sequences used are in CIF format ( pixels). Fig. 5 illustrates the segmentation process for the sequence Mother & Daughter, which is a typical videoconference scene with a static camera. Due to the use of color information, the initial partition shown in Fig. 5(b), although slightly oversegmented, is accurate enough to obtain a reliable segmentation. Fig. 5(c) shows the result of the initial classification, where white areas correspond to regions detected by the memory, and gray areas correspond (a) (b) (c) (d) (e) (f) Figure 5: Segmentation of the 43 rd frame of Mother & Daughter: (a) original frame, (b) initial partition, (c) initial classification, (d) moving regions, (e) MRF labeling, (f) segmentation mask. to regions detected based on the significance test. Moving regions, as detected by the hierarchical motion estimation and validation scheme, are depicted in Fig. 5(d). Due to the slow movement exhibited by the objects in the scene, only parts of the objects were detected as moving. Nonetheless, the MRF-based classification (Fig. 5(e)) provided a perfect labeling, resulting in an accurate segmentation mask (Fig. 5(f)). Fig. 6 displays the segmentation results for the several other image sequences, namely, Foreman, Table Tennis and Stefan. All three sequences exhibit a moving camera. As we can see, the segmentation mask obtained in the sequence Foreman is quite accurate. Note that the object s boundaries were identified correctly, even though they are not clearly distinct. This is contributed to the use of color information in the spatial partition. Satisfactory results were achieved in the sequence Table Tennis as well. The results in Stefan are somewhat less accurate, mainly due to the fact that the background in this sequence is cluttered. with a stationary background (without application of the GME phase), the execution time of the algorithm is about 2 seconds for each frame in CIF format ( ). For sequences with a moving camera, execution times are in the range of 5-10 seconds for each frame. The experiments were carried out on a Pentium III-500Mhz workstation. 6. Discussion and Conclusions This paper introduces a new algorithm for unsupervised segmentation of moving objects in image sequences. The algorithm is based on a Markov random field model defined over a region adjacency graph, which uses motion information to classify regions as foreground or background.

[3] N. Paragios and G. Tziritas, Adaptive detection and localization of moving objects in image sequences, Signal Processing: Image Commun., vol. 14, pp. 277-296, Feb. 1999. [4] M.M. Chang, A.M. Tekalp, and M.

H. Adelson, Spatio-temporal segmentation of video data, Proc. of the SPIE: Image and Video Processing II, vol. 2182, pp. 120-131, Feb. 1994.

The location of object boundaries is guided by the initial partition, which consists of a color-based watershed segmentation.

A hierarchical motion estimation and validation scheme detects moving regions in the scene, while avoiding misdetections caused by the occlusion problem.

Experimental results demonstrated that our proposed technique can successfully extract moving objects from various sequences, with stationary or mobile camera.

This way, the number of resulting regions can be reduced dramatically, while maintaining the structural integrity of moving objects in the scene.

8 [3] N. Paragios and G. Tziritas, Adaptive detection and localization of moving objects in image sequences, Signal Processing: Image Commun., vol. 14, pp , Feb [4] M.M. Chang, A.M. Tekalp, and M.I. Sezan, Motion-field segmentation using an adaptive MAP criterion, IEEE Int. Conf. on Acoust., Speech, Signal Processing, ICASSP 93, Minneapolis, Apr. 1993, vol. V, pp [5] J.Y.A. Wang and E.H. Adelson, Spatio-temporal segmentation of video data, Proc. of the SPIE: Image and Video Processing II, vol. 2182, pp , Feb (a) (b) (c) Figure 6: Segmentation results for the sequences: (a) Foreman, (b) Table Tennis, (c) Stefan. The location of object boundaries is guided by the initial partition, which consists of a color-based watershed segmentation. Therefore, the proposed technique succeeds in locating objects boundaries that are not clearly distinct, where other techniques fail. A hierarchical motion estimation and validation scheme detects moving regions in the scene, while avoiding misdetections caused by the occlusion problem. A tracking memory ensures that a reliable segmentation is maintained throughout the sequence, even when the objects stop moving. Experimental results demonstrated that our proposed technique can successfully extract moving objects from various sequences, with stationary or mobile camera. Future work should concentrate on incorporating temporal information in the form of change or motion into the initial partition, rather than relying only on spatial information. This way, the number of resulting regions can be reduced dramatically, while maintaining the structural integrity of moving objects in the scene. For instance, the entire background can be segmented as one region, while only moving objects are partitioned to smaller segments. This will surely reduce the computational load of the algorithm, while increasing its robustness at the same time. References [1] R. Mech, M. Wollborn, A noise robust method for 2D shape estimation of moving objects in video sequences considering a moving camera, Signal Processing, vol. 66, no. 2, pp , Apr [2] M. Kim, J.G. Choi, D. Kim, H. Lee, M.H. Lee, C. Ahn, and Y. Ho, A VOP generation tool: Automatic segmentation of moving objects in image sequences based on spatio-temporal information, IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 8, pp , Dec [6] F. Dufaux, F. Moscheni, and A. Lippman. Spatio-temporal segmentation based on motion and static segmentation, IEEE Proc. Int. Conf. Image Processing, ICIP 95, Washington, DC, Oct. 1995, vol. 1, pp [7] M. Gelgon and P. Bouthemy, A region-level motion based graph labeling approach to motion-based segmentation, Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, CVPR 97, Porto-Rico, June 1997, pp [8] F. Dufaux and J. Konrad, Efficient, robust, and fast global motion estimation for video coding, IEEE Trans. Image Processing, vol. 9, no. 3, pp , [9] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-8, no. 6, pp , Nov [10] P. De Smet and D. De Vleeschauwer, Performance and scalability of a highly optimized rainfalling watershed algorithm, Proc. Int. Conf. on Imaging Science, Systems and technology, CISST 98, Las Vegas, NV, USA, July 1998, pp [11] P.B. Chou and C.M. Brown, The theory and practice of Bayesian image labeling, Int. J. Comput. Vision, vol. 4, pp , [12] L. Vincent and P. Soille, Watersheds in digital spaces: an efficient algorithm based on immersion simulations, IEEE Trans. Pattern Anal. Machine Intell., vol. 13, no. 6, pp , [13] T. Aach, A. Kaup, and R. Mester, Statistical model-based change detection in moving video, Signal Processing, vol.31, no. 2, pp , [14] J.R. Bergen, P. Anandan, K.J. Hanna, and R. Hingorani, Hierarchical model-based motion estimation, Proc. European Conf. Computer Vision, ECCV 92, Italy, May 1992, pp [15] J. Besag, Spatial interaction and the statistical analysis of lattice systems, J. Roy. Statist. Soc. B, vol. 36, no. 2, pp , 1974.

Automatic Segmentation of Moving Objects in Video Sequences: A Region Labeling Approach

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 7, JULY 2002 597 Automatic Segmentation of Moving Objects in Video Sequences: A Region Labeling Approach Yaakov Tsaig and Amir