MANY complex video-related vision tasks require the

Size: px

Start display at page:

Download "MANY complex video-related vision tasks require the"

Sheena Young
5 years ago
Views:

1 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER Inferring Segmented Dense Motion Layers Using 5D Tensor Voting Changki Min and Gérard Medioni, Fellow, IEEE Abstract We present a novel local spatiotemporal approach to produce motion segmentation and dense temporal trajectories from an image sequence. A common representation of image sequences is a 3D spatiotemporal volume ðx; y; tþ, and its corresponding mathematical formalism is the fiber bundle. However, directly enforcing the spatiotemporal smoothness constraint is difficult in the fiber bundle representation. Thus, we convert the representation into a new 5D space ðx; y; t; v x ;v y Þ with an additional velocity domain, where each moving object produces a separate 3D smooth layer. The smoothness constraint is now enforced by extracting 3D layers using the tensor voting framework in a single step that solves both correspondence and segmentation simultaneously. Motion segmentation is achieved by identifying those layers and the dense temporal trajectories are obtained by converting the layers back into the fiber bundle representation. We proceed to address three applications (tracking, mosaic, and 3D reconstruction) that are hard to solve from the video stream directly because of the segmentation and dense matching steps but become straightforward with our framework. The approach does not make restrictive assumptions about the observed scene or camera motion and is therefore generally applicable. We present results on a number of data sets. Index Terms Motion analysis, tensor voting, optical flow, segmentation, mosaicking. Ç 1 INTRODUCTION MANY complex video-related vision tasks require the segmentation of moving objects from the background. For instance, tracking systems locate moving objects first and then find their temporal trajectories; surveillance systems need to detect movement in their field of view. Also, videos can be efficiently compressed by providing region-level segmentation and temporal estimation of their motion. In addition to region segmentation, we notice that these tasks require the temporal motion information of the segmented regions. Thus, the two important concepts in motion analysis are motion segmentation and the dense temporal trajectories of pixels and they are studied in this paper within a new framework. The main goal of motion segmentation is to segment an image into regions based on similar motions of nearby individual pixels. Here, the most important constraint, which also defines the meaningful regions, is the coherence of the motion of pixels in a region. The temporal trajectories are created by identifying pixel correspondences across frames and we create dense trajectories by considering all pixels in the reference image. In our proposed approach, these two tasks, motion segmentation and temporal trajectory inference (or correspondence), are solved at the same time and the individual moving objects. C. Min is with Apple Inc., 6 Infinite Loop, MS 306-2CW, Cupertino, CA changkimin@gmail.com.. G. Medioni is with the Institute for Robotics and Intelligence Systems, University of Southern California, PHE 212, MC-0273, Los Angeles, CA medioni@usc.edu. Manuscript received 13 Feb. 2006; revised 10 Aug. 2006; accepted 12 Sept. 2007; published online 10 Oct Recommended for acceptance by J. Weickert. For information on obtaining reprints of this article, please send to: tpami@computer.org, and reference IEEECS Log Number TPAMI Digital Object Identifier no /TPAMI are simply identified by grouping pixels that have similar temporal trajectories. The resulting segmentation and dense trajectory information serve as fundamental building blocks in many other vision problems. A set of pixels in a region with similar motion characteristics constitutes a smooth layer. The layered representation for motion analysis was introduced by Wang and Adelson [33] and adopted by other researchers [4], [2], [34]. Image sequences are decomposed into multiple layers and some maps are associated to the layers to define their motion, opacity, and so on. Most layer-based approaches use an affine motion model, which may fail when the scene contains strong parallax effects or when moving objects cover a large part of the image. Many algorithms for two-frame motion segmentation have been proposed. For instance, Kolmogorov and Zabih [19] proposed a graph-cut approach to minimize an energy function for the correspondence problem. The energy function is constructed in such a way that a solution that violates uniqueness (a pixel in one image should correspond to either one pixel or no pixel in the other image) produces infinite energy. The optimization of this energy function is NP-hard, so the authors use the -expansion algorithm to compute a strong local minimum. Ayer and Sawhney [4] estimated motion parameters based on an ML estimation of mixture models and the MDL encoding principle. However, generalizing the above methods to multiple frames is not trivial. There are many approaches that utilize multiple frames for motion analysis. The factorization method introduced by Tomasi and Kanade [29] is one of the most popular techniques and has been significantly improved by other researchers [12], [17], [32], [36]. For instance, Costeira and Kanade [12] introduced the shape interaction matrix for multibody factorization and some solutions for its limitations have been presented in [32], [36]. Most factorization /08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society

1590 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Fig. 1. Spatiotemporal volume representation. (Left) An image sequence.

The approaches presented in [3], [10], [13], [21], [26] solve both optical flow and segmentation problems simultaneously.

2 1590 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Fig. 1. Spatiotemporal volume representation. (Left) An image sequence. (Right) Its spatiotemporal volume, which is created by stacking all of the images along the time axis. approaches, however, are sensitive to noise and assume linear models such as affine. The approaches presented in [3], [10], [13], [21], [26] solve both optical flow and segmentation problems simultaneously. Amiaz and Kiryati [3] embedded the accurate optical flow estimation framework introduced by Brox et al. [9] within a contour-based segmentation framework for dense optical flow computation and crisp flow boundaries. Cremers and Soatto [13] and Brox et al. [10] incorporated level set and variational methods to minimize energy functions with the spatiotemporal smoothness constraint. Although these approaches produce fairly good optical flow and segmentation results, they are limited to process only two frames or two competing regions, except [10], [13]. Nicolescu and Medioni [25] first proposed using the tensor voting framework with a layered representation for motion segmentation. The initial noisy motion layers are smoothed by using tensor voting and, then, precise motion boundaries are computed by applying 2D intensity cues around candidate motion boundary regions. Although the resulting segmentation is usually accurate, it may fail to find accurate motion boundaries if, for example, the boundary regions have complex textures or there is a stronger intensity cue around the true motion boundaries. This is because the use of a 2D intensity cue is not appropriate except for some cases where the motion boundaries have high contrast with the background. Also, this approach is inherently limited to a pair of images, meaning that rich temporal motion information is not used; thus, it generates only the segmentation without any temporal trajectories. A convenient representation of a video sequence is a spatiotemporal 3D ðx; y; tþ volume, as illustrated in Fig. 1. The ðx; y; tþ spatiotemporal volume on the right side of Fig. 1 is constructed by stacking all of the images on the left side of Fig. 1 in a temporal order. Bolles et al. [7] exploited the spatiotemporal volume to analyze translational motion in image sequences. An epipolar-plane image, or EPI for short, is formed by slicing the spatiotemporal volume, taking into account the epipolar geometry (for example, the top surface on the right side of Fig. 1), and the motion information of the image sequence is obtained by analyzing the line patterns in the EPI. The extended analysis for more general motions is presented in [5]. Although it is easy to visualize the motion in this space, the constraints are difficult to implement. The appropriate mathematical formalism to describe these structures is that of fiber bundles [16]. A fiber bundle is a map p : E! B, where B is the base space, E is the total space, and the map p is called the projection of the bundle. For each point b 2 B in the base space, the space p 1 ðbþ is Fig. 2. Fiber bundle representation. (a) Fiber bundle concept. (b) Fiber bundle representation for motion analysis. called a fiber of the bundle over b 2 B. This is shown in Fig. 2a. For the motion problem, we assign the reference image as the base space. Then, the temporal trajectory of each pixel in the reference image forms a fiber, as can be seen in Fig. 2b. Although the fiber bundle formalism gives a good representational framework, it does not provide any tools to solve the stated motion problem. Therefore, we convert the fiber bundle representation to a 5D space ðx; y; t; v x ;v y Þ, in which ðx; yþ represent the pixel coordinates in the reference image, t represents time (frame number), and ðv x ;v y Þ represent the velocity of a pixel in the reference image at each time instance t. In this 5D space, each moving object is represented as a smooth layer by v x ðx; y; tþ and v y ðx; y; tþ. Because the initial layers are not smooth due to noisy initial correspondences, we impose a spatiotemporal smoothness constraint upon the layers. Segmentation is performed on the resulting smoothed layers, which are then transformed back into the fiber bundle representation to obtain temporal trajectories. In fact, the velocity ðv x ;v y Þ implicitly exists in the 3D ðx; y; tþ space as well. However, at the cost of increasing the dimension of the space by explicitly adding the velocity, we can easily deal with the spatiotemporal data (more technically, this kind of approach is called phase space representation in control theory). In our proposed approach, we enforce a single constraint: spatiotemporal smoothness [22]. In other words, we assume that each pixel in the reference frame moves smoothly in time (that is, a temporal smoothness constraint) and neighboring pixels that belong to a single object move in a similar way (that is, a spatial smoothness constraint). These two constraints are enforced simultaneously in the 5D space. This is why our approach does not require any restrictive camera or motion model such as affine. Even nonrigid motion can be properly handled because it still satisfies the spatiotemporal smoothness constraint. It is important to note that the spatiotemporal smoothness of motion is equivalent to the smoothness of the layers in the 5D space. Therefore, our main tasks are smoothing the layers and identifying individual layers that correspond to independent moving regions. To enforce the spatiotemporal smoothness constraint in the 5D space, we use the tensor voting framework [20]. It is a generic perceptual organization framework and one of its main functions is to extract geometric features such as curves and surfaces from a given set of points.

MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1591 In summary, the outstanding advantage of our votingbased motion analysis approach is that it eliminates most

3 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1591 In summary, the outstanding advantage of our votingbased motion analysis approach is that it eliminates most limitations that other approaches inherently have and produces better results. For instance, we do not make any assumption regarding the camera model, motion model, or scene structure except to expect that motion should be locally smooth in space and time. Also, our framework is designed to detect and segment multiple moving objects simultaneously from multiple frames. Using the tensor voting framework also ensures that our approach is very robust to noise. These advantages against other approaches allow us to apply our framework to a broader range of image sequences and produce accurate results with the simultaneous use of spatial and temporal motion information. This paper is organized as follows: In Section 2, we present the structure of the 5D space and the 5D tensor voting framework. The details of our proposed algorithm are discussed in Section 3. Section 4 shows various experimental results of our framework, including not only simple motion segmentation examples but also interesting vision problems that can be easily solved by our framework. The conclusion and future work are presented in Section 5. 2 DATA REPRESENTATION AND 5D TENSOR VOTING Visualizing the proposed 5D space and its structure is very crucial to understand our approach. In practice, however, it is not possible to visualize the 5D space. Thus, we first present an equivalent 3D space for 1D image sequences and then extend it to the 5D space. We also explain 5D tensor voting in this section. 2.1 Analogy for One-Dimensional Image By assuming that all of the pixels in a 2D image have only horizontal motion, a 1D image can be generated by extracting a single scan line of the 2D image. This process is illustrated in Fig. 3a. The original 2D image sequence is a synthetic one where all of the pixels move only horizontally (the car moves to the left side and the background moves to the right side). Therefore, we can produce a 1D image sequence by extracting the same scan line from all images (the horizontal lines in Fig. 3a). Their spatiotemporal space ðx; tþ is represented in Fig. 3b by stacking all of the 1D images together in temporal order. The two parts of the background are labeled object1 and object3 and the car is labeled object2. Although we can visually observe the clear motion of the car and the background in the spatiotemporal space, there is no actual motion information yet because we did not establish any correspondences. Using cross correlation, we first find correspondences between the reference 1D image (the horizontal line in Fig. 3b) and all other images. The resulting initial temporal trajectories (that is, fibers) are presented in Fig. 3c with object numbers from 1 to 3. As can be seen there, the trajectories are quite noisy due to occlusions and camera motion. Since the spatiotemporal smoothing and the segmentation are difficult in this spatiotemporal space, we convert it into the proposed 3D space (5D space in the case of 2D image sequences), which is illustrated in Fig. 3d. The 3D Fig. 3. One-dimensional image example. (a) Original 2D synthetic images. (b) Vertical stack of all 1D images. (c) Initial noisy temporal trajectories. (d) Initial noisy layers. (e) Smoothed layers by tensor voting. (f) Spatiotemporally smoothed trajectories. space has three components: image domain x, time domain t, and velocity domain v. Thus, a point in the 3D space indicates the velocity v of a pixel x in the reference frame at a given moment t. The initial layers in the 3D space include a large number of outliers since they were generated from noisy trajectories. Through the tensor voting stage, these outliers are easily detected and removed and their correct positions in the space are estimated via a densification stage, resulting in smooth layers with 2D support in Fig. 3e. The smoothness of each layer along the time axis represents the temporal smoothness constraint of motion, while the smoothness of each layer along the image axis represents the spatial smoothness constraint of motion. This illustrates how we enforce both the spatial and temporal smoothness constraints simultaneously through tensor voting. The number on the layers in Figs. 3d and 3e indicates the number of objects in Figs. 3b and 3c. Hence, it should be clear that finding independent moving regions is equivalent to identifying each layer in Fig. 3e. The final temporal trajectories computed from the smooth layers in Fig. 3e are shown in Fig. 3f. Through the spatiotemporal smoothing process, we can find accurate trajectories for all pixels in the reference frame. This also means that the full continuous motion of objects that are partially or fully occluded in some frames can be recovered, as shown in Fig. 3f. The layers in Fig. 3e are perfectly flat in this example because every pixel has a constant-velocity translational motion. In general, however, the layers are not necessarily flat. Indeed, they can be any smooth shapes, depending on how objects and the camera move. 2.2 Description of the 5D Space The previous 1D image example is a special case where all of the pixels have only horizontal motion. In general

4 1592 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 TABLE 1 Comparison between the 5D Space and the 3D Space sequences, pixels move arbitrarily in the image space; thus, the layered representation of 2D image sequences should be described in the 5D space instead of the 3D space. Basically, both 5D and 3D spaces have the same structure except for their dimensions. The comparison is shown in Table 1. More specifically, we define a velocity vector function V over the 3D spatiotemporal domain D ¼ I T, where I IR 2 represents the pixels in the reference frame and T IR represents time, that is, V : I T! IR 2 ; ðx; y; tþ 7! ðv x ;v y Þ: In practice, we have to deal with discrete image coordinates and time. The velocities can be real numbers due to subpixel-accuracy computation. In our approach, all of the initial mappings from ðx; y; tþ to ðv x ;v y Þ are obtained from measurements (that is, correspondence hypotheses) and smooth motion (in space-time) produces 3D layers in the 5D space. Note that our framework does not try to compute an analytic form of the function V. Rather, each final ðv x ;v y Þ at ðx; y; tþ is obtained by voting, which is a spatiotemporal process in the 5D space. 2.3 Tensor Voting in 5D Space In this section, we briefly explain the concept of tensor voting in a 5D space. More details about the general tensor voting framework can be found in [20]. The tensor voting framework has two elements: tensor calculus for data representation and tensor voting for data communication. Each input point is initially encoded as a tensor, which is a symmetric nonnegative definite matrix. In the 5D space, the eigenvalues of the matrix are 1, 2, 3, 4, and 5 in descending order and the corresponding eigenvectors are e 1, e 2, e 3, e 4, and e 5, respectively. Intuitively, the shape of the tensor defines the type of geometric feature (for example, curve and surface) and the size defines its saliency, or confidence measure. If the orientation of a point is unknown, then the point is initialized as the identity matrix, which is the case in our approach. Otherwise, the given orientation, such as surface normal vectors, is properly encoded in its tensor. After the encoding step, each token (a 5D point with its associated tensor) casts votes to its neighboring tokens based on predefined voting kernels. Each voting kernel is a tensor field and it encapsulates all voting-related information such as the size and shape of the voting neighborhood and the vote strength and orientation at each point. The basic idea of the voting kernel can be explained by the fundamental 2D stick field and this is illustrated in Fig. 4. Let us assume that the voter O (the token that casts votes) is given with its normal! N and we want to compute its vote to the receiver P. To generate the vote, we must consider two Fig. 4. Generation of the fundamental 2D stick field. (a) Orientation and strength. (b) Resulting fundamental 2D stick field. See text for more details. things: the orientation and strength of the vote. As can be seen in Fig. 4a, the orientation (gray arrow starting from P) is given by drawing a big circle whose center is in the line of! N (in this case, it is at C) and it passes both O and P while preserving the normal! N. This process ensures the smoothest connection between two points, O and P, with associated normals. The strength of the vote is computed by the following decay function: DF ðs; ; Þ ¼e jj2 s þc 2 2 : Here, jsj is the arc length, is the curvature, c is the degree of decay, and is the scale of voting that essentially defines the neighborhood size. The resulting fundamental 2D stick field is shown in Fig. 4b, where a darker region represents stronger votes. If the orientation of the voter is unknown, then it is encoded as a ball tensor (that is, identity matrix). In this case, the vote is generated by integrating the contribution of the rotating fundamental field. Any voting fields in higher dimensional space also can be generated in a similar way. During the voting process, each input token collects votes from its neighbors by tensor addition and the resulting tensor T at the point is decomposed as in T ¼ 1 e 1 e T 1 þ 2e 2 e T 2 þ 3e 3 e T 3 þ 4e 4 e T 4 þ 5e 5 e T 5 ¼ð 1 2 Þe 1 e T 1 þ ð 2 3 Þðe 1 e T 1 þ e 2e T 2 Þþ ð 3 4 Þðe 1 e T 1 þ e 2e T 2 þ e 3e T 3 Þþ ð 4 5 Þðe 1 e T 1 þ e 2e T 2 þ e 3e T 3 þ e 4e T 4 Þþ 5 ðe 1 e T 1 þ e 2e T 2 þ e 3e T 3 þ e 4e T 4 þ e 5e T 5 Þ: Table 2 shows the relation between the dimensionality of geometric features and their tensor interpretation in the 5D space. The saliency of each feature is defined as shown in the table and it represents the confidence of each feature. The dimensionality of each geometric feature is given by the number of parameters that are required to describe the feature in a parametric representation. 3 OUR APPROACH In this section, we provide the details of our proposed approach and Fig. 5 shows the overall algorithm. Our motion analysis framework is based on the 4D framework proposed by Nicolescu and Medioni [25] and our overall processing steps are similar to their approach. For instance,

MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1593 TABLE 2 Tensor Interpretation in the 5D Space both approaches have the same candidate match step, initial tensor

First, we combine the rich temporal motion information with the spatial motion information by processing multiple frames together, while the 4D framework uses only the two-frame motion information.

5 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1593 TABLE 2 Tensor Interpretation in the 5D Space both approaches have the same candidate match step, initial tensor voting step, outlier rejection step, and the last densification step. However, our 5D framework is quite different from the 4D framework in two important aspects. First, we combine the rich temporal motion information with the spatial motion information by processing multiple frames together, while the 4D framework uses only the two-frame motion information. Second, we use monocular cues in a different way (that is, oversegmentation and merging) to estimate more accurate motion boundaries. In other words, we process different kinds of data sets by using similar processing steps. Also note that the use of temporal motion information provides not only a better motion analysis but also straightforward solutions to other difficult vision problems, as shown in Section Initial Color Oversegmentation In order to obtain accurate motion boundaries, we start our framework with a color-based segmentation of the reference image. We have a very strong preference for erring on the side of oversegmentation, from which we can recover. After smoothing these segments individually, the motion boundary of each moving object is created by merging the segments based on their motion similarity. Thus, it is true that the quality of the final motion segmentation is affected by the quality of the oversegmentation. However, when it comes to oversegmentation, most known segmentation approaches generate reasonably good results. For our experiments, we use the mean shift-based image segmentation described in [11], with color reference images, and it usually generates 100 to 200 segments. The color-based oversegmentation approach ensures that each segment contains pixels of only a single object. Therefore, we can assume that the motion layer associated with each segment forms a smooth layer because of the spatiotemporal smoothness constraint. In [25], the spatial smoothness constraint is enforced to the layers in a 4D space without prior segmentation. For the initial tensor voting and outlier rejection steps, it is probably reasonable because the motion layers of different objects are physically separated in the 4D space, minimizing votes between different motion layers. In the densification step, however, it is not easy to estimate missing tokens around motion boundaries because different motion layers can cast votes to all candidates. To solve this problem, the authors utilize 2D image gradient information around motion boundaries, which, unfortunately, often fails when the image has textures around motion boundaries. In our approach, the voting process is restricted within each segment. Thus, we can avoid unwanted votes between different objects for both the initial tensor voting and the densification processes. Fig. 5. Overall proposed algorithm for motion segmentation and analysis. 3.2 Finding Candidate Matches In order to find candidate matches between the reference image and other images, we use a simple cross correlation. For a given pair of images, we first define the size of the correlation window and the search range. The size of the windows we use in our experiments roughly ranges from 5 to 17 and they work fine in most cases. A large correlation window tends to oversmooth the motion layers, while a small one tends to capture detailed local motion with many outliers. A rough search range is manually given in the beginning by observing the sequence. Fig. 6 shows how to find matches using the cross-correlation window. For each pixel in Fig. 6a, a correlation window is defined around the pixel and its best match (that is, highest correlation score) is found in Fig. 6b within the search range. In our framework, we find a candidate temporal trajectory of a pixel instead of a single match between two images. Assume that four images are given ði 0 ;I 1 ;I 2 ;I 3 Þ, and the first one, I 0, is the reference image, as presented in Fig. 7. From the correlation computation, we obtain the velocities ðv 01 ðpþ; V 02 ðpþ; V 03 ðpþþ, where V ij ðpþ is the velocity of a pixel p in the reference image between I i and I j. Since the appropriate velocity form in our approach is the velocity of the pixel p at each time t (that is, V 01 ðpþ, V 12 ðpþ, and V 23 ðpþ), we need to convert the obtained velocities into these forms using the simple equation V t;tþ1 ðpþ ¼V 0;tþ1 ðpþ V 0;t ðpþ. A sequential search that finds V 01 ðpþ, V 12 ðpþ, and V 23 ðpþ directly from the given images is not appropriate because Fig. 6. Finding matches using cross correlation. A certain size of correlation window is associated with each pixel in the left side image and then searches its best match (that is, highest correlation score) in the right side image to estimate the velocity of the pixel. The search is made within the predefined search range, as shown in the right side image.

1594 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Fig. 8. Converting the 3D temporal trajectory representation into the 5D spatiotemporal space.

V 01 ðpþ, V 02 ðpþ, and V 03 ðpþ are found first from the images and, then, V 01 ðpþ, V 12 ðpþ, and V 23 ðpþ are computed from them.

6 1594 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Fig. 8. Converting the 3D temporal trajectory representation into the 5D spatiotemporal space. We are observing four pixels (A, B, C, D) in three consecutive frames, t 1, t, and t þ 1. Fig. 7. Finding candidate matches. V 01 ðpþ, V 02 ðpþ, and V 03 ðpþ are found first from the images and, then, V 01 ðpþ, V 12 ðpþ, and V 23 ðpþ are computed from them. any wrong match at t results in wrong matches for all of the following frames after t. When we extend the 4D framework [25] to the 5D framework, there exists one limitation in this candidate match step. In the case of the 4D framework, where only two frames are processed, each pixel can have multiple candidates at the same time, increasing the chances of having correct matches. In our 5D framework, however, each pixel in the reference frame is allowed to have only a single candidate match for each of the other frames. This limitation arises from the fact that we are considering candidate temporal trajectories in multiple frames. If multiple candidate matches are allowed between the reference frame and each of the other frames, then the number of all possible combinations for temporal trajectories will explode. It is very important to note that the current simple crosscorrelation approach to find candidate matches can be replaced by any other approaches that also find matches. Indeed, our proposed motion analysis framework focuses on the spatiotemporal smoothing for the given set of candidate matches that include many outliers; thus, the candidate match step is not a main part of the framework. For instance, it is obvious that the cross-correlation approach may not correctly detect rapid rotation and zooming because it implicitly assumes local translation. In this case, we might be able to generate better candidate matches by using more sophisticated approaches such as [9], [10]. The motion analysis framework itself, however, is irrelevant to this issue and it can process all kinds of smooth motion, including, for example, rapid rotation and zooming. 3.3 Initial Tensor Voting The generation of 5D tokens from the correspondences is given as follows: Assume that a pixel P whose coordinates are ðx r ;y r Þ in the reference frame has its matches in the frames at times t and t þ 1 and their coordinates are ðx t ;y t Þ and ðx tþ1 ;y tþ1 Þ, respectively. In the 3D temporal trajectory representation (that is, fiber bundle), those two matches form a short trajectory segment between ðx t ;y t ;tþ and ðx tþ1 ;y tþ1 ;tþ 1Þ. Our main proposed idea is to convert the temporal trajectory representation into the 5D point representation and the corresponding 5D point of pixel P in frame t can be represented as ðx r ;y r ;t;x tþ1 x t ;y tþ1 y t Þ. Fig. 8 illustrates the conversion process and how differently the spatiotemporal smoothness constraint is presented in both spaces. In the figure, we are observing the motion of four pixels (A, B, C, D) in three consecutive frames ðt 1;t;tþ 1Þ and their temporal trajectories are shown in Fig. 8a. Assuming that x, y, and t are continuous, a trajectory curve can be expressed as parametric equations: x ¼ xðtþ; y ¼ yðtþ; t ¼ t: Thus, we can compute a tangent vector! T at any point of the trajectories:! dx T ¼ dt ; dy dt ; 1 ¼ðv x ;v y ; 1Þ: The tangent vectors of the 12 sample points in the 3D region between t 1 and t þ 1 in the Fig. 8a must be similar because we assume the spatiotemporal smooth motion. In the 5D space (Fig. 8b), the v x and v y components of the tangent vector at each ðx; y; tþ are explicitly represented with the additional ðv x ;v y Þ axes. The set of all such 5D points constitute layers that are locally smooth because adjacent 5D points are generated from spatiotemporally adjacent temporal trajectories (for example, the 12 points in Fig. 8a). The layer in the 5D space also can be represented as parametric equations: x ¼ x; y ¼ y; t ¼ t; v x ¼ v x ðx; y; tþ; v y ¼ v y ðx; y; tþ: Since these equations are controlled by three parameters ƒ! ƒ! ƒ! ðx; y; tþ, the layer has three tangent vectors ð T 1 ;T2;T3Þ ƒ! ƒ! and two normal vectors ð N 1 ;N2Þ: x T 1 ¼ 1; 0; y x ;T 2 ¼ 0; y x T 3 ¼ 0; 0; x N 1 ; x y y N 2 y ; 2 x Therefore, the layer we are trying to extract has a dimensionality of three (that is, 3D variety) and the five eigenvectors associated with the tensor at each point are interpreted as two normal vectors ðe 1 ;e 2 Þ and three tangent vectors ðe 3 ;e 4 ;e 5 Þ. Accordingly, the saliency we are interested in is computed by 2 3, as can be found in Table 2.

7 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1595 Fig. 9. Densification process. Circular dots indicate inliers and the vertical square dots indicate a set of candidate token positions. Due to the smoothness constraint, the candidate in the middle obtains the highest saliency so that it is selected. Since the initial matches do not provide any orientation, we encode the obtained 5D points as ball tensors. This means that all of the eigenvalues are set to one and the eigenvector matrix is the identity matrix. After all points are encoded, each of them casts its information to the neighborhood, the size of which is defined by, through the tensor voting process. 3.4 Outlier Rejection and Densification Through the first tensor voting step, each token collects votes from its neighbors within the same segment. As we have mentioned in the previous section, the geometric feature that represents the desired smooth layers in the 5D space is the 3D variety and its saliency is computed by 2 3. If a token is within a layer, then its neighbors strongly support the token, resulting in high 3D variety saliency. On the other hand, if a token results from a wrong match, then it is more likely an isolated point in the 5D space and collects weaker votes from its neighbors than tokens in layers. Thus, we can locate those outliers by comparing the 3D variety saliency of each token with a threshold that is set by averaging all the 3D variety saliencies in the segment. After the outlier rejection process, we need to estimate correct positions of the removed tokens to obtain dense velocity fields. For this densification process, the smoothness constraint is also applied. Fig. 9 illustrates the densification process. At the removed position ðx; y; tþ, we set up multiple candidate velocities ðv x ;v y Þ between the minimum and maximum velocities of its neighbors within the same segment and generate corresponding candidate 5D tokens (square dots). Each token now collects votes from its existing neighbors (circular dots) and, then, we select the one that has the highest 3D variety saliency (the square in the middle) among the candidates for the ðx; y; tþ position. In the densification process, the previous oversegmentation greatly helps in preserving the exact motion boundaries. In Fig. 10, we assume that the removed token position belongs to the boundary of object 2, so the best estimated token position must be the second candidate from the bottom. Without the oversegmentation, the inlier tokens of both object 1 and object 2 cast votes to the candidate tokens. Thus, it is not clear which candidate token between the second one from the top and the second one from the bottom will obtain the highest saliency. On the other hand, Fig. 10. Densification process without the oversegmentation. The candidates for the missing token position that belongs to object 2 collect votes from the inlier tokens of both object 1 and object 2. if we know that the removed token position belongs to object 2 in advance from the oversegmentation, then we can allow only the inlier tokens from object 2 to cast votes to the candidates. This will correctly find the best candidate token, which is the second one from the bottom. 3.5 Merging Motion Layer Segments Through the previous densification process, we have obtained smooth layers and velocity information for every pixel in the 5D space. Each layer is associated with its small homogeneous image segment so that we need to merge the layers to form complete object shapes. In our approach, we compare all adjacent layers in pairs and use the following two properties to measure the similarity between them:. the average of the ðv x ;v y Þ components (that is, average velocity of the pixels) of the 5D tokens around the boundary between two layers and. the normal space that those tokens span. Roughly speaking, the first property examines whether two layers are connected to each other and the second property examines whether the connection is smooth. In the case of nontranslational motion such as zooming, rotation, or nonrigid motion, the layers may not be flat or parallel to the ðx; y; tþ space. In such a case, the average velocity and the normal space of all the tokens in the layer do not correctly characterize the layer. Therefore, we consider only tokens around the boundary region where two layers meet, as shown in Fig. 11a, and those tokens are enough to examine how smoothly two layers are connected. For the tokens in each square window, we compute the average velocity components and normal spaces for both layer 1 and layer 2 and then compute the similarity of the two layers based on them. The window size, 21 21, was empirically obtained considering the size that is small enough to avoid the above problem but large enough to keep the local motion characteristic around boundaries. This process is repeated for other square windows and the final similarity measures between two layers are computed by averaging the errors from each square window. The following equations are used for error computation.. Average velocity error : 1 P N n jv n layer1 V n layer2 j.. Normal space error: S 3 =S 2.

8 1596 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Fig. 11. Merging process and various cases. (a) Top view of two adjacent layers, where ðv x ;v y Þ is orthogonal to this paper (only tokens in each square are computed for similarity). (b) Both the velocity and normal spaces are similar merge. (c) Similar velocity but different normal spaces do not merge. (d) Different velocity but similar normal spaces do not merge. (e) Both velocity and normal spaces are different do not merge. In the first equation, N is the total number of square windows and V is the average velocity component of the tokens in each layer at a given window n. In the second equation, S i are singular values (S 1 is the largest one) of the matrix E in which we stack the two normal vectors (e 1 and e 2 ) from each token in both layer 1 and layer 2. If the two normal spaces are compatible, then E will be rank two and S 3 will be small. Otherwise, the third singular value, S 3, will be large, meaning that the two normal spaces are not compatible. Note that the two similarity measures, average velocity and normal space, are independent measures; thus, both of them should be satisfied at the same time to merge layers. Figs. 11b, 11c, 11d, and 11e show the different cases. If two layers have similar velocities and normal spaces around their boundary, then they can be merged (shown in Fig. 11b). However, if either of the two similarity measures fails, then two layers cannot be merged (shown in Figs. 11c, 11d, and 11e). Successively, all layers are tested in pairs for merging and, based on those pairwise results, we build the final motion segmentation for the given reference image. In this merging stage, we need to consider how to set thresholds. When we decide if two layers are merged together or not, their similarity measures are compared with certain thresholds. Here, those thresholds are empirically obtained. For most sequences, a constant threshold for the average velocity error can be used because we roughly normalize the maximum velocity of objects when we initially generate 5D points. Also, the threshold for the normal space error does not significantly affect the merging results, so a constant threshold value obtained empirically can be used as well. In some cases, however, we might need to try several thresholds to get a reasonable result. For instance, if the given sequence has many moving objects with different sizes and speeds and the quality of the initial matching is poor, then it is hard to distinguish which layers are moving objects and which layers are sets of wrong matches. Fig. 12. Various motion segmentation results. (a)-(d) The Blue car sequence. (e)-(h) The Truck sequence. (i)-(l) The Flower garden sequence. (a), (e), and (i) Reference images. (b) and (f) Ground-truth segmentation without shadows. (c) and (g) Ground-truth segmentation with shadows. (j) Ground-truth segmentation with thin branches. (k) Ground-truth segmentation without thin branches. (d), (h), and (l) Our motion segmentation results. 4 EXPERIMENTAL RESULTS In this section, we present some experimental results. Section 4.1 focuses on our framework itself by showing various segmentation results. In Section 4.2, we demonstrate the importance of the temporal smoothness constraint in some challenging sequences that have repetitive patterns or severe occlusions. In Section 4.3, we present various vision applications and show how our segmentation and dense trajectory results can easily solve those problems. Section 4.4 provides all of the input parameters that are used for the experiments. The experiments have been performed under the following system environments: Pentium GHz CPU, 2 Gbyte RAM, both CPU-based and GPU-based motion analysis frameworks [23], and Nvidia GeForce 7800GTX GPU. 4.1 Basic Motion Segmentation Various real image sequences. We first present some basic motion segmentation results of three different sequences: the Blue car, Truck, and Flower garden sequences. They are shown in each column in Fig. 12. In the cases of Blue car and Truck, there are shadows that move along with the moving cars, so the shadows are merged to the car regions because their motion is compatible. Thus, we separately show the groundtruth segmentation of the cars without and with the shadows in the second row (Figs. 12b and 12f) and the third row (Figs. 12c and 12g), respectively. Our motion segmentation

In fact, both are caused by the same undersegmentation problem. Although most segmentation tools produce good oversegmentation outputs, sometimes they fail for some extreme cases.

9 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1597 TABLE 3 Segmentation Error Rates for the Sequences in Fig. 12 results of those two sequences are shown in the last row (Figs. 12d and 12h). In the case of Flower garden, we see two erroneous segmentation results at the thin branches and the trunk of the tree. In fact, both are caused by the same undersegmentation problem. Although most segmentation tools produce good oversegmentation outputs, sometimes they fail for some extreme cases. For instance, the thin branches in Flower garden are incorrectly included in the large sky segment region and hence are classified as outliers and smoothed out during the densification. If the branches are separately segmented in the oversegmentation step, then, definitely, they will appear in the final segmentation result. Another example can be found in the boundary regions between the tree and flowers. Due to very complex textures around the boundaries, some segments span over both the tree and flowers, resulting in noisy boundaries. The overall segmentation error rates, in the form of the percentage of wrong pixel assignments, for the three sequences are provided in Table 3. Teddy bear and cones sequences. In order to show the accuracy of per-pixel motion estimation and quantitatively compare our proposed approach with others, we tested two sequences, Teddy bear and Cones, which come with their ground-truth disparity maps [1]. Figs. 13a and 13c show the ground-truth disparity maps of Teddy bear and Cones, respectively, and Figs. 13b and 13d show our disparity map results. Counting all pixels in the images and setting the error threshold to one, the reported error rates were 11.5 percent and 8.47 percent for Teddy bear and Cones, respectively. The comparison of the error rates with other methods is provided in Table 4. Unlike other methods that are optimized to handle stereo pairs only, our approach is more generic in that it can handle all kinds of motion, not just horizontal disparity. Also, our method is purely local and is the only local method that competitively compares to global methods. For instance, Graph Cuts is one of the popular global methods and it shows 12.4 percent of error rate (Graph Cuts with occlusion handling [19]) for Cones [1]. One of the most important properties of the tensor voting framework is its robustness to noise. In order to demonstrate this, we tested the same Teddy bear after adding synthetic noise to the original set of 5D input points. The first noisy input, Teddy bear 50, is generated by randomly selecting 50 percent of the input points and adding random real numbers between 5 to 5 to the points. The second noisy input, Teddy bear 80, is generated by adding the same random real numbers to 80 percent of the original input points. Fig. 14 shows those noisy inputs and their final spatiotemporally smoothed layers. The disparity errors are computed in the same way and the reported Fig. 13. Results of Teddy bear and Cones. (a) Ground-truth disparity map of Teddy bear. (b) Our disparity map of Teddy bear. (c) Groundtruth disparity map of Cones. (d) Our disparity map of Cones. TABLE 4 Error Rate Comparison for Teddy bear and Cones [1] error rates are 12.1 percent and 16.8 percent for Teddy bear 50 and Teddy bear 80, respectively. These low error rates successfully demonstrate that our motion analysis framework is very robust to noisy inputs. 4.2 Challenging Sequences Walking person. This sequence consists of five frames and the third frame is selected as the reference frame. From this sequence, we want to show how our 5D approach with temporal information outperforms the 4D approach in [25]. Fig. 15a shows the five input frames and frame 3 is the reference frame. Two boxed regions in the reference frame show regions where it is difficult to find correspondences due to their periodic patterns. This problem is clearly seen in Fig. 15b, which shows the v x components of the initial matches between the reference frame and the first frame: Some groups of wrong matches of the pixels in the boxed regions accidentally form smooth regions. If we use only two frames (third and first) in the 4D space, those wrong matches cannot be removed due to their smoothness, as can be seen in Fig. 15c. The wrong matches, which are in black ellipses, have survived from the outlier rejection process and they will eventually generate false motion layers after densification. On the other hand, Fig. 15d shows the correct outlier rejection result between the third and first frames by our proposed 5D approach, where the wrong layers are clearly removed in spite of their smoothness. This is

1598 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Fig. 14. Testing Teddy bear after adding synthetic noise. (a) Original input.

(e) Output of (b) error rate: 12.1 percent. (f) Output of (c) error rate: 16.8 percent. Fig. 16. Results of the Two pedestrians sequence. (a) Input sequence. (b) Outlier rejection for frame 5.

(c) Outlier rejection by two frames in 4D. (d) Outlier rejection by multiple frames in 5D. (e) Final segmentation. (f) Some temporal trajectories.

10 1598 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Fig. 14. Testing Teddy bear after adding synthetic noise. (a) Original input. (b) Teddy bear 50 add noise to 50 percent of the input points. (c) Teddy bear 80 add noise to 80 percent of the input points. (d) Output of (a) error rate: 11.5 percent. (e) Output of (b) error rate: 12.1 percent. (f) Output of (c) error rate: 16.8 percent. Fig. 16. Results of the Two pedestrians sequence. (a) Input sequence. (b) Outlier rejection for frame 5. (c) Densification for frame 5. (d) Initial temporal trajectories. (e) Spatiotemporally smoothed trajectories. Fig. 15. Results of the Walking person sequence. (a) Input sequence. (b) Initial matches. (c) Outlier rejection by two frames in 4D. (d) Outlier rejection by multiple frames in 5D. (e) Final segmentation. (f) Some temporal trajectories. possible because the wrong matches are not consistent in the temporal sense in the 5D space. In other words, other matches between the reference frame and other frames except the first frame may not show those accidental smooth layers, so they are classified as outliers in the 5D space. Fig. 15e represents the final motion segmentation by coloring each segment and Fig. 15f shows the temporal trajectories of some pixels. Two pedestrians. From this sequence, we present the advantage of the temporal smoothness constraint in the presence of a serious occlusion. As can be seen in Fig. 16a, there are two people walking in opposite directions so that the one in the back suffers from serious occlusions in frames 4-6. Therefore, the matches for him between the reference frame (frame 2) and those three frames create a significant amount of wrong matches and they are mostly removed during the outlier rejection process due to their random distribution in the 5D space. Fig. 16b shows the result of the outlier rejection for frame 5 and we see in the figure that most of the matches of the person in the back have been removed. On the other hand, he is mostly visible in other frames and the matches for those frames create smooth layers in the 5D space. During the densification process, those smooth layers that consist of good matches extend to the regions where wrong matches are removed. This is possible due to the temporal smoothness constraint and the final densification result of frame 5 is presented in Fig. 16c. Figs. 16d and 16e show the temporal trajectories of the pixels in a single scan line for initial noisy matches and spatiotemporally smoothed matches, respectively. In Fig. 16d, the pixels of the person in the back lose their temporal tracks due to occlusions, but they are recovered by the spatiotemporal smoothing process, as can be seen in Fig. 16e. 4.3 Extended Applications In this section, we present some extended vision applications where our results, motion segmentation, and dense trajectories allow us to solve problems with greater ease than other approaches. Tracking in a long sequence. An ubiquitous problem in computer vision is to track moving objects in a sequence. Tracking is especially difficult if the camera moves and the objects occlude each other in the presence of multiple moving objects. All of these difficulties are handled properly under our framework in which each independent moving object is segmented, the occlusion is appropriately processed, as can be seen in the Two Pedestrians example, and so on. Thus, tracking is a straightforward problem with our framework and we provide even pixel-level tracking information. Fig. 17a shows an example sequence in which four cars and the camera move. In Fig. 17b, we illustrate the tracking results of the sequence by showing reprojected images from the computed dense temporal trajectories of the reference image. Since the cars are already segmented, tracking them is straightforward. Another tracking example is shown in

MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1599 Fig. 17. Tracking example 1. (a) Input sequence: four cars and a moving camera.

The dark trail in our result is caused by the shadows of the car on the ground and it is not a registration problem. Fig. 20. Mosaic example strong parallax.

(c) The mosaic generated by Autostitch. Fig. 18. Tracking example 2. (a) Input sequence: one car and a moving camera. (b) Temporal trajectories. (c) Pixel-level tracking with a segmented object. Fig. 18. The temporal trajectories of the pixels in the reference image are presented in Fig.

We process about 10 frames at a time so that, for a long sequence, we divide the sequence into multiple overlapping frame sets and then process them separately.

11 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1599 Fig. 17. Tracking example 1. (a) Input sequence: four cars and a moving camera. (b) Pixel-level tracking with segmented objects. Fig. 19. Mosaic example moving object. (a) Our result. (b) The result of Autostitch. The dark trail in our result is caused by the shadows of the car on the ground and it is not a registration problem. Fig. 20. Mosaic example strong parallax. (a) The foreground mosaic generated by our method (the Teddy bear is another independent object, so it is not included). (b) The background mosaic generated by our method. (c) The mosaic generated by Autostitch. Fig. 18. Tracking example 2. (a) Input sequence: one car and a moving camera. (b) Temporal trajectories. (c) Pixel-level tracking with a segmented object. Fig. 18. The temporal trajectories of the pixels in the reference image are presented in Fig. 18b and the reprojected images generated from the trajectories are shown in Fig. 18c. We process about 10 frames at a time so that, for a long sequence, we divide the sequence into multiple overlapping frame sets and then process them separately. After processing all sets, we simply combine those results to obtain long temporal trajectories. Mosaic generation. A mosaic is created by geometrically aligning a set of images and stitching them together. It is one of the oldest research topics in computer vision and many researchers have proposed various algorithms. For instance, Shum and Szeliski [27] applied both global and local alignment techniques to reduce the accumulated registration errors and small motion parallax, Uyttendaele et al. [31] proposed methods to deal with blurry regions due to moving objects and exposure changes across images, and Davis [14] focused on the registration problem in the presence of moving objects based on the Mellin transform. Brown and Lowe [8] used the popular SIFT features for robust image registration and provided a complete mosaic framework called Autostitch. Most of these algorithms, however, fail to generate accurate mosaics when an image set has relatively large moving objects or suffers from strong parallax in large areas because those independent moving parts cause poor parameter estimation for image registration. On the other hand, our framework allows us to generate very accurate mosaics even in the presence of moving objects since they are already identified and separated from the background before the parameter estimation step [24]. Also, the resulting temporal trajectories (that is, correspondences) can be directly used for parameter estimation. Fig. 19 compares our method with the Autostitch framework for the sequence in Fig. 18 (moving-object case) and the comparison for the Teddy bear sequence (strong-parallax case) is shown in Fig. 20. As can be seen in both examples, the Autostitch generates distorted mosaics due to the moving car or strong parallax of the foreground objects. In the case of the Teddy bear sequence, we can also produce the mosaic of the foreground objects. Those separate mosaics of foreground and background are possible thanks to our accurate motion segmentation framework. Another advantage of our motion segmentation framework for mosaics is that we can generate background-only mosaics. However, the mosaics generated by most other approaches include moving objects because they do not distinguish the foreground and background. This difference can be seen in Fig. 19. Three-dimensional reconstruction. In the case of rigid 3D motion, we can reconstruct its dense 3D structure directly from our results. Although the basic 3D reconstruction framework is simple, it is hard to implement in

1600 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 TABLE 6 Average Processing Time of Each Step Fig. 21. Three-dimensional reconstruction example.

12 1600 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 TABLE 6 Average Processing Time of Each Step Fig. 21. Three-dimensional reconstruction example. (a) One of the given images. (b) Generated 3D reconstruction. practice. For instance, finding good feature points with their accurate correspondences is not easy and multiple moving objects with a moving camera also make the problem much harder. In our case, however, such problems were already solved in the main framework, so we can use any standard 3D reconstruction methods without any difficulties. Fig. 21a shows an example image where both the tissue box and the camera move. By applying the 3D reconstruction algorithm described in [30] to our segmented dense correspondence groups separately and combining the resulting 3D structures later, we obtain the overall 3D structure of the scene in Fig. 21b. Note that the structure consists of dense 3D points, unlike many other approaches which have only sparse 3D points from selected image features. 4.4 Parameters for Each Experiment Table 5 shows the list of all of the image sequences used and their input parameters. The provided data includes the sequence names, corresponding figure numbers, image sizes, number of input images (NI), correlation window sizes for initial candidate matches (CWS), for tensor voting, parameters for oversegmentation, thresholds in the merging step, and overall processing time. As can be seen from the table, different values are used for each image sequence. However, the overall outputs are not as sensitive to the selection of the values. For instance, we tested the Both the CPU and GPU codes use the same oversegmentation and initial matching codes. Flower garden sequence with different values for segmentation. The resulting segmentation error rates (refer to Fig. 12) based on the ground-truth without branches are 2.22 percent, 3.15 percent, and 3.38 percent for ¼ 3; 7; 11, respectively. Table 6 shows the average processing time of each step for the sequences in Table 5. Note that the One car sequence requires a batch process so that only one process time of the batch is counted in the average computation. Also, the table compares the processing time between the conventional CPU-based code and the new GPU-based code described in [23]. Roughly speaking, the new GPU-based code shows a processing speed about 10 times faster than that of the CPU-based code. 5 CONCLUSION AND FUTURE WORK We have presented a novel spatiotemporal smoothing process for motion analysis that preserves discontinuities in a 5D space. The outputs of the system are accurate motion boundaries and dense temporal trajectories. Through many experimental results, we demonstrated the accuracy of the outputs and showed the importance of the temporal smoothing process with some challenging sequences. Besides these direct outputs of our framework, we TABLE 5 List of Parameters and Processing Time 1 Number of input images. 2 Correlation window size. 3 Parameters for oversegmentation [11]: spatial/color/minimum region. 4 Thresholds in the merging step: velocity error/normal space error. 5 Processing time (sec) by GPU-based implementation.

13 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1601 also showed how several complex vision problems can be simplified by applying our framework at the first stage. One important technical issue of our framework is in the initial oversegmentation step. Although the mean shiftbased segmentation tool mostly produces very good oversegmentation, sometimes it fails, violating our assumption that a segment contains pixels only from a single object. Unfortunately, our current system does not have a module that fixes such an undersegmentation situation (for example, the results of the Flower garden sequence). Another issue has to do with the initial candidate match step. Theoretically, our main motion analysis framework can process any kinds of motion. However, due to the limited function of the current initial match algorithm (that is, a simple cross correlation), some class of motion may not be processed correctly. We will investigate these issues in future research. ACKNOWLEDGMENTS The research has been funded in part by the Integrated Media Systems Center, a US National Science Foundation (NSF) Engineering Research Center, Cooperative Agreement EEC , and NSF Grant IIS Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the NSF. REFERENCES [1] Middlebury College Stereo Evaluation Webpage, middlebury.edu/stereo/, [2] E. Adelson and Y. Weiss, A Unified Mixture Framework for Motion Segmentation: Incorporating Spatial Coherence and Estimating the Number of Models, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [3] T. Amiaz and N. Kiryati, Dense Discontinuous Optical Flow via Contour-Based Segmentation, Proc. IEEE Int l Conf. Image Processing, vol. 3, pp , [4] S. Ayer and H. Sawhney, Layered Representation of Motion Video Using Robust Maximum-Likelihood Estimation of Mixture Models and MDL Encoding, Proc. Fifth Int l Conf. Computer Vision, pp , [5] H. Baker and R. Bolles, Generalizing Epipolar-Plane Image Analysis on the Spatiotemporal Surface, Int l J. Computer Vision, vol. 3, no. 1, pp , May [6] M. Bleyer and M. Gelautz, A Layered Stereo Algorithm Using Image Segmentation and Global Visibility Constraints, Proc. IEEE Int l Conf. Image Processing, vol. 5, pp , [7] R. Bolles, H. Baker, and D. Marimont, Epipolar-Plane Image Analysis: An Approach to Determining Structure from Motion, Int l J. Computer Vision, vol. 1, no. 1, pp. 7-56, [8] M. Brown and D. Lowe, Recognising Panoramas, Proc. Ninth IEEE Int l Conf. Computer Vision, pp , [9] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High Accuracy Optical Flow Estimation Based on a Theory for Warping, Proc. Eighth European Conf. Computer Vision, vol. 4, pp , [10] T. Brox, A. Bruhn, and J. Weickert, Variational Motion Segmentation with Level Sets, Proc. Ninth European Conf. Computer Vision, pp , May [11] D. Comaniciu and P. Meer, Mean Shift: A Robust Approach toward Feature Space Analysis, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp , May [12] J. Costeira and T. Kanade, A Multibody Factorization Method for Independently Moving Objects, Int l J. Computer Vision, vol. 29, no. 3, pp , Sept [13] D. Cremers and S. Soatto, Motion Competition: A Variational Approach to Piecewise Parametric Motion Segmentation, Int l J. Computer Vision, vol. 62, no. 3, pp , May [14] J. Davis, Mosaics of Scenes with Moving Objects, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [15] H. Hirschmuller, Stereo Vision in Structured Environments by Consistent Semi-Global Matching, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , [16] D. Husemoller, Fibre Bundles, third ed. Springer, [17] K. Kanatani, Motion Segmentation by Subspace Separation and Model Selection, Proc. Eighth Int l Conf. Computer Vision, vol. 2, pp , [18] A. Klaus, M. Sormann, and K. Karner, Segment-Based Stereo Matching Using Belief Propagation and a Self-Adapting Dissimilarity Measure, Proc. 18th Int l Conf. Pattern Recognition, [19] V. Kolmogorov and R. Zabih, Computing Visual Correspondence with Occlusions via Graph Cuts, Proc. Eighth Int l Conf. Computer Vision, vol. 2, pp , [20] G. Medioni, M. Lee, and C. Tang, A Computational Framework for Segmentation and Grouping, first ed. Elsevier, [21] E. Mémin and P. Pérez, Hierarchical Estimation and Segmentation of Dense Motion Fields, Int l J. Computer Vision, vol. 46, no. 2, pp , Feb [22] C. Min and G. Medioni, Motion Segmentation by Spatiotemporal Smoothness Using 5D Tensor Voting, Proc. Fifth IEEE Workshop Perceptual Organization in Computer Vision, [23] C. Min and G. Medioni, Tensor Voting Accelerated by Graphics Processing Units (GPU), Proc. 18th Int l Conf. Pattern Recognition, [24] C. Min, Q. Yu, and G. Medioni, Multi-Layer Mosaics in the Presence of Motion and Depth Effects, Proc. 18th Int l Conf. Pattern Recognition, [25] M. Nicolescu and G. Medioni, Layered 4D Representation and Voting for Grouping from Motion, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 4, pp , Apr [26] C. Schnörr, Determining Optical Flow for Irregular Domains by Minimizing Quadratic Functionals of a Certain Class, Int l J. Computer Vision, vol. 6, no. 1, pp , Apr [27] H. Shum and R. Szeliski, Systems and Experiment Paper: Construction of Panoramic Image Mosaics with Global and Local Alignment, Int l J. Computer Vision, vol. 36, no. 2, pp , Feb [28] J. Sun, Y. Li, S. Kang, and H. Shum, Symmetric Stereo Matching for Occlusion Handling, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , [29] C. Tomasi and T. Kanade, Shape and Motion from Image Streams under Orthography: A Factorization Method, Int l J. Computer Vision, vol. 9, no. 2, pp , Nov [30] P. Torr and D. Murray, The Development and Comparison of Robust Methods for Estimating the Fundamental Matrix, Int l J. Computer Vision, vol. 24, no. 3, pp , Oct [31] M. Uyttendaele, A. Eden, and R. Szeliski, Eliminating Ghosting and Exposure Artifacts in Image Mosaics, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , [32] R. Vidal and R. Hartley, Motion Segmentation with Missing Data Using Power Factorization and GPCA, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , [33] J. Wang and E. Adelson, Layered Representation for Motion Analysis, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [34] J. Xiao and M. Shah, Motion Layer Extraction in the Presence of Occlusion Using Graph Cuts, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp , Oct [35] Q. Yang, L. Wang, R. Yang, H. Stewenius, and D. Nister, Stereo Matching with Color-Weighted Correlation, Hierarchical Belief Propagation and Occlusion Handling, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , [36] L. Zelnik-Manor and M. Irani, Degeneracies, Dependencies and Their Implications in Multi-Body and Multi-Sequence Factorizations, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , 2003.

California in 2000 and 2006, respectively. Currently, he is working at Apple Inc. as a senior graphics software engineer.

14 1602 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Changki Min received the BS degree in mathematics from Hanyang University, Korea, in 1997 and the MS and PhD degrees in electrical engineering from the University of Southern California in 2000 and 2006, respectively. Currently, he is working at Apple Inc. as a senior graphics software engineer. His research interests include motion analysis, segmentation, tensor voting, computer graphics, and GPGPU. Gérard Medioni received the Diplôme d Ingénieur from the Ecole Nationale Supérieure des Télécommunications (ENST), Paris, in 1977 and the MS and PhD degrees from the University of Southern California (USC) in 1980 and 1983, respectively. He has been with USC since then and is currently a professor of computer science and electrical engineering, a codirector of the Institute for Robotics and Intelligent Systems (IRIS), and a codirector of the USC Games Institute. He served as the chairman of the Computer Science Department from 2001 to He has made significant contributions to the field of computer vision. His research covers a broad spectrum of the field, such as edge detection, stereo and motion analysis, shape inference and description, and system integration. He has published 3 books, more than 50 journal papers, and 150 conference articles. He is the holder of eight international patents. He is an associate editor of the Image and Vision Computing Journal, Pattern Recognition and Image Analysis Journal, and International Journal of Image and Video Processing. He served as a program cochair of the 1991 IEEE Computer Vision and Pattern Recognition (CVPR) Conference and the 1995 IEEE International Symposium on Computer Vision, a general cochair of the 1997 IEEE CVPR Conference, a conference cochair of the 1998 International Conference on Pattern Recognition, a general cochair of the 2001 IEEE CVPR Conference, a general cochair of the 2007 IEEE CVPR Conference, and a general cochair of the upcoming 2009 IEEE CVPR Conference. He is a fellow of the IEEE, IAPR, and AAAI and a member of the IEEE Computer Society.. For more information on this or any other computing topic, please visit our Digital Library at

Perceptual Grouping from Motion Cues Using Tensor Voting

Perceptual Grouping from Motion Cues Using Tensor Voting 1. Research Team Project Leader: Graduate Students: Prof. Gérard Medioni, Computer Science Mircea Nicolescu, Changki Min 2. Statement of Project