MANY complex video-related vision tasks require the

Size: px
Start display at page:

Download "MANY complex video-related vision tasks require the"

Transcription

1 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER Inferring Segmented Dense Motion Layers Using 5D Tensor Voting Changki Min and Gérard Medioni, Fellow, IEEE Abstract We present a novel local spatiotemporal approach to produce motion segmentation and dense temporal trajectories from an image sequence. A common representation of image sequences is a 3D spatiotemporal volume ðx; y; tþ, and its corresponding mathematical formalism is the fiber bundle. However, directly enforcing the spatiotemporal smoothness constraint is difficult in the fiber bundle representation. Thus, we convert the representation into a new 5D space ðx; y; t; v x ;v y Þ with an additional velocity domain, where each moving object produces a separate 3D smooth layer. The smoothness constraint is now enforced by extracting 3D layers using the tensor voting framework in a single step that solves both correspondence and segmentation simultaneously. Motion segmentation is achieved by identifying those layers and the dense temporal trajectories are obtained by converting the layers back into the fiber bundle representation. We proceed to address three applications (tracking, mosaic, and 3D reconstruction) that are hard to solve from the video stream directly because of the segmentation and dense matching steps but become straightforward with our framework. The approach does not make restrictive assumptions about the observed scene or camera motion and is therefore generally applicable. We present results on a number of data sets. Index Terms Motion analysis, tensor voting, optical flow, segmentation, mosaicking. Ç 1 INTRODUCTION MANY complex video-related vision tasks require the segmentation of moving objects from the background. For instance, tracking systems locate moving objects first and then find their temporal trajectories; surveillance systems need to detect movement in their field of view. Also, videos can be efficiently compressed by providing region-level segmentation and temporal estimation of their motion. In addition to region segmentation, we notice that these tasks require the temporal motion information of the segmented regions. Thus, the two important concepts in motion analysis are motion segmentation and the dense temporal trajectories of pixels and they are studied in this paper within a new framework. The main goal of motion segmentation is to segment an image into regions based on similar motions of nearby individual pixels. Here, the most important constraint, which also defines the meaningful regions, is the coherence of the motion of pixels in a region. The temporal trajectories are created by identifying pixel correspondences across frames and we create dense trajectories by considering all pixels in the reference image. In our proposed approach, these two tasks, motion segmentation and temporal trajectory inference (or correspondence), are solved at the same time and the individual moving objects. C. Min is with Apple Inc., 6 Infinite Loop, MS 306-2CW, Cupertino, CA changkimin@gmail.com.. G. Medioni is with the Institute for Robotics and Intelligence Systems, University of Southern California, PHE 212, MC-0273, Los Angeles, CA medioni@usc.edu. Manuscript received 13 Feb. 2006; revised 10 Aug. 2006; accepted 12 Sept. 2007; published online 10 Oct Recommended for acceptance by J. Weickert. For information on obtaining reprints of this article, please send to: tpami@computer.org, and reference IEEECS Log Number TPAMI Digital Object Identifier no /TPAMI are simply identified by grouping pixels that have similar temporal trajectories. The resulting segmentation and dense trajectory information serve as fundamental building blocks in many other vision problems. A set of pixels in a region with similar motion characteristics constitutes a smooth layer. The layered representation for motion analysis was introduced by Wang and Adelson [33] and adopted by other researchers [4], [2], [34]. Image sequences are decomposed into multiple layers and some maps are associated to the layers to define their motion, opacity, and so on. Most layer-based approaches use an affine motion model, which may fail when the scene contains strong parallax effects or when moving objects cover a large part of the image. Many algorithms for two-frame motion segmentation have been proposed. For instance, Kolmogorov and Zabih [19] proposed a graph-cut approach to minimize an energy function for the correspondence problem. The energy function is constructed in such a way that a solution that violates uniqueness (a pixel in one image should correspond to either one pixel or no pixel in the other image) produces infinite energy. The optimization of this energy function is NP-hard, so the authors use the -expansion algorithm to compute a strong local minimum. Ayer and Sawhney [4] estimated motion parameters based on an ML estimation of mixture models and the MDL encoding principle. However, generalizing the above methods to multiple frames is not trivial. There are many approaches that utilize multiple frames for motion analysis. The factorization method introduced by Tomasi and Kanade [29] is one of the most popular techniques and has been significantly improved by other researchers [12], [17], [32], [36]. For instance, Costeira and Kanade [12] introduced the shape interaction matrix for multibody factorization and some solutions for its limitations have been presented in [32], [36]. Most factorization /08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society

2 1590 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Fig. 1. Spatiotemporal volume representation. (Left) An image sequence. (Right) Its spatiotemporal volume, which is created by stacking all of the images along the time axis. approaches, however, are sensitive to noise and assume linear models such as affine. The approaches presented in [3], [10], [13], [21], [26] solve both optical flow and segmentation problems simultaneously. Amiaz and Kiryati [3] embedded the accurate optical flow estimation framework introduced by Brox et al. [9] within a contour-based segmentation framework for dense optical flow computation and crisp flow boundaries. Cremers and Soatto [13] and Brox et al. [10] incorporated level set and variational methods to minimize energy functions with the spatiotemporal smoothness constraint. Although these approaches produce fairly good optical flow and segmentation results, they are limited to process only two frames or two competing regions, except [10], [13]. Nicolescu and Medioni [25] first proposed using the tensor voting framework with a layered representation for motion segmentation. The initial noisy motion layers are smoothed by using tensor voting and, then, precise motion boundaries are computed by applying 2D intensity cues around candidate motion boundary regions. Although the resulting segmentation is usually accurate, it may fail to find accurate motion boundaries if, for example, the boundary regions have complex textures or there is a stronger intensity cue around the true motion boundaries. This is because the use of a 2D intensity cue is not appropriate except for some cases where the motion boundaries have high contrast with the background. Also, this approach is inherently limited to a pair of images, meaning that rich temporal motion information is not used; thus, it generates only the segmentation without any temporal trajectories. A convenient representation of a video sequence is a spatiotemporal 3D ðx; y; tþ volume, as illustrated in Fig. 1. The ðx; y; tþ spatiotemporal volume on the right side of Fig. 1 is constructed by stacking all of the images on the left side of Fig. 1 in a temporal order. Bolles et al. [7] exploited the spatiotemporal volume to analyze translational motion in image sequences. An epipolar-plane image, or EPI for short, is formed by slicing the spatiotemporal volume, taking into account the epipolar geometry (for example, the top surface on the right side of Fig. 1), and the motion information of the image sequence is obtained by analyzing the line patterns in the EPI. The extended analysis for more general motions is presented in [5]. Although it is easy to visualize the motion in this space, the constraints are difficult to implement. The appropriate mathematical formalism to describe these structures is that of fiber bundles [16]. A fiber bundle is a map p : E! B, where B is the base space, E is the total space, and the map p is called the projection of the bundle. For each point b 2 B in the base space, the space p 1 ðbþ is Fig. 2. Fiber bundle representation. (a) Fiber bundle concept. (b) Fiber bundle representation for motion analysis. called a fiber of the bundle over b 2 B. This is shown in Fig. 2a. For the motion problem, we assign the reference image as the base space. Then, the temporal trajectory of each pixel in the reference image forms a fiber, as can be seen in Fig. 2b. Although the fiber bundle formalism gives a good representational framework, it does not provide any tools to solve the stated motion problem. Therefore, we convert the fiber bundle representation to a 5D space ðx; y; t; v x ;v y Þ, in which ðx; yþ represent the pixel coordinates in the reference image, t represents time (frame number), and ðv x ;v y Þ represent the velocity of a pixel in the reference image at each time instance t. In this 5D space, each moving object is represented as a smooth layer by v x ðx; y; tþ and v y ðx; y; tþ. Because the initial layers are not smooth due to noisy initial correspondences, we impose a spatiotemporal smoothness constraint upon the layers. Segmentation is performed on the resulting smoothed layers, which are then transformed back into the fiber bundle representation to obtain temporal trajectories. In fact, the velocity ðv x ;v y Þ implicitly exists in the 3D ðx; y; tþ space as well. However, at the cost of increasing the dimension of the space by explicitly adding the velocity, we can easily deal with the spatiotemporal data (more technically, this kind of approach is called phase space representation in control theory). In our proposed approach, we enforce a single constraint: spatiotemporal smoothness [22]. In other words, we assume that each pixel in the reference frame moves smoothly in time (that is, a temporal smoothness constraint) and neighboring pixels that belong to a single object move in a similar way (that is, a spatial smoothness constraint). These two constraints are enforced simultaneously in the 5D space. This is why our approach does not require any restrictive camera or motion model such as affine. Even nonrigid motion can be properly handled because it still satisfies the spatiotemporal smoothness constraint. It is important to note that the spatiotemporal smoothness of motion is equivalent to the smoothness of the layers in the 5D space. Therefore, our main tasks are smoothing the layers and identifying individual layers that correspond to independent moving regions. To enforce the spatiotemporal smoothness constraint in the 5D space, we use the tensor voting framework [20]. It is a generic perceptual organization framework and one of its main functions is to extract geometric features such as curves and surfaces from a given set of points.

3 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1591 In summary, the outstanding advantage of our votingbased motion analysis approach is that it eliminates most limitations that other approaches inherently have and produces better results. For instance, we do not make any assumption regarding the camera model, motion model, or scene structure except to expect that motion should be locally smooth in space and time. Also, our framework is designed to detect and segment multiple moving objects simultaneously from multiple frames. Using the tensor voting framework also ensures that our approach is very robust to noise. These advantages against other approaches allow us to apply our framework to a broader range of image sequences and produce accurate results with the simultaneous use of spatial and temporal motion information. This paper is organized as follows: In Section 2, we present the structure of the 5D space and the 5D tensor voting framework. The details of our proposed algorithm are discussed in Section 3. Section 4 shows various experimental results of our framework, including not only simple motion segmentation examples but also interesting vision problems that can be easily solved by our framework. The conclusion and future work are presented in Section 5. 2 DATA REPRESENTATION AND 5D TENSOR VOTING Visualizing the proposed 5D space and its structure is very crucial to understand our approach. In practice, however, it is not possible to visualize the 5D space. Thus, we first present an equivalent 3D space for 1D image sequences and then extend it to the 5D space. We also explain 5D tensor voting in this section. 2.1 Analogy for One-Dimensional Image By assuming that all of the pixels in a 2D image have only horizontal motion, a 1D image can be generated by extracting a single scan line of the 2D image. This process is illustrated in Fig. 3a. The original 2D image sequence is a synthetic one where all of the pixels move only horizontally (the car moves to the left side and the background moves to the right side). Therefore, we can produce a 1D image sequence by extracting the same scan line from all images (the horizontal lines in Fig. 3a). Their spatiotemporal space ðx; tþ is represented in Fig. 3b by stacking all of the 1D images together in temporal order. The two parts of the background are labeled object1 and object3 and the car is labeled object2. Although we can visually observe the clear motion of the car and the background in the spatiotemporal space, there is no actual motion information yet because we did not establish any correspondences. Using cross correlation, we first find correspondences between the reference 1D image (the horizontal line in Fig. 3b) and all other images. The resulting initial temporal trajectories (that is, fibers) are presented in Fig. 3c with object numbers from 1 to 3. As can be seen there, the trajectories are quite noisy due to occlusions and camera motion. Since the spatiotemporal smoothing and the segmentation are difficult in this spatiotemporal space, we convert it into the proposed 3D space (5D space in the case of 2D image sequences), which is illustrated in Fig. 3d. The 3D Fig. 3. One-dimensional image example. (a) Original 2D synthetic images. (b) Vertical stack of all 1D images. (c) Initial noisy temporal trajectories. (d) Initial noisy layers. (e) Smoothed layers by tensor voting. (f) Spatiotemporally smoothed trajectories. space has three components: image domain x, time domain t, and velocity domain v. Thus, a point in the 3D space indicates the velocity v of a pixel x in the reference frame at a given moment t. The initial layers in the 3D space include a large number of outliers since they were generated from noisy trajectories. Through the tensor voting stage, these outliers are easily detected and removed and their correct positions in the space are estimated via a densification stage, resulting in smooth layers with 2D support in Fig. 3e. The smoothness of each layer along the time axis represents the temporal smoothness constraint of motion, while the smoothness of each layer along the image axis represents the spatial smoothness constraint of motion. This illustrates how we enforce both the spatial and temporal smoothness constraints simultaneously through tensor voting. The number on the layers in Figs. 3d and 3e indicates the number of objects in Figs. 3b and 3c. Hence, it should be clear that finding independent moving regions is equivalent to identifying each layer in Fig. 3e. The final temporal trajectories computed from the smooth layers in Fig. 3e are shown in Fig. 3f. Through the spatiotemporal smoothing process, we can find accurate trajectories for all pixels in the reference frame. This also means that the full continuous motion of objects that are partially or fully occluded in some frames can be recovered, as shown in Fig. 3f. The layers in Fig. 3e are perfectly flat in this example because every pixel has a constant-velocity translational motion. In general, however, the layers are not necessarily flat. Indeed, they can be any smooth shapes, depending on how objects and the camera move. 2.2 Description of the 5D Space The previous 1D image example is a special case where all of the pixels have only horizontal motion. In general

4 1592 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 TABLE 1 Comparison between the 5D Space and the 3D Space sequences, pixels move arbitrarily in the image space; thus, the layered representation of 2D image sequences should be described in the 5D space instead of the 3D space. Basically, both 5D and 3D spaces have the same structure except for their dimensions. The comparison is shown in Table 1. More specifically, we define a velocity vector function V over the 3D spatiotemporal domain D ¼ I T, where I IR 2 represents the pixels in the reference frame and T IR represents time, that is, V : I T! IR 2 ; ðx; y; tþ 7! ðv x ;v y Þ: In practice, we have to deal with discrete image coordinates and time. The velocities can be real numbers due to subpixel-accuracy computation. In our approach, all of the initial mappings from ðx; y; tþ to ðv x ;v y Þ are obtained from measurements (that is, correspondence hypotheses) and smooth motion (in space-time) produces 3D layers in the 5D space. Note that our framework does not try to compute an analytic form of the function V. Rather, each final ðv x ;v y Þ at ðx; y; tþ is obtained by voting, which is a spatiotemporal process in the 5D space. 2.3 Tensor Voting in 5D Space In this section, we briefly explain the concept of tensor voting in a 5D space. More details about the general tensor voting framework can be found in [20]. The tensor voting framework has two elements: tensor calculus for data representation and tensor voting for data communication. Each input point is initially encoded as a tensor, which is a symmetric nonnegative definite matrix. In the 5D space, the eigenvalues of the matrix are 1, 2, 3, 4, and 5 in descending order and the corresponding eigenvectors are e 1, e 2, e 3, e 4, and e 5, respectively. Intuitively, the shape of the tensor defines the type of geometric feature (for example, curve and surface) and the size defines its saliency, or confidence measure. If the orientation of a point is unknown, then the point is initialized as the identity matrix, which is the case in our approach. Otherwise, the given orientation, such as surface normal vectors, is properly encoded in its tensor. After the encoding step, each token (a 5D point with its associated tensor) casts votes to its neighboring tokens based on predefined voting kernels. Each voting kernel is a tensor field and it encapsulates all voting-related information such as the size and shape of the voting neighborhood and the vote strength and orientation at each point. The basic idea of the voting kernel can be explained by the fundamental 2D stick field and this is illustrated in Fig. 4. Let us assume that the voter O (the token that casts votes) is given with its normal! N and we want to compute its vote to the receiver P. To generate the vote, we must consider two Fig. 4. Generation of the fundamental 2D stick field. (a) Orientation and strength. (b) Resulting fundamental 2D stick field. See text for more details. things: the orientation and strength of the vote. As can be seen in Fig. 4a, the orientation (gray arrow starting from P) is given by drawing a big circle whose center is in the line of! N (in this case, it is at C) and it passes both O and P while preserving the normal! N. This process ensures the smoothest connection between two points, O and P, with associated normals. The strength of the vote is computed by the following decay function: DF ðs; ; Þ ¼e jj2 s þc 2 2 : Here, jsj is the arc length, is the curvature, c is the degree of decay, and is the scale of voting that essentially defines the neighborhood size. The resulting fundamental 2D stick field is shown in Fig. 4b, where a darker region represents stronger votes. If the orientation of the voter is unknown, then it is encoded as a ball tensor (that is, identity matrix). In this case, the vote is generated by integrating the contribution of the rotating fundamental field. Any voting fields in higher dimensional space also can be generated in a similar way. During the voting process, each input token collects votes from its neighbors by tensor addition and the resulting tensor T at the point is decomposed as in T ¼ 1 e 1 e T 1 þ 2e 2 e T 2 þ 3e 3 e T 3 þ 4e 4 e T 4 þ 5e 5 e T 5 ¼ð 1 2 Þe 1 e T 1 þ ð 2 3 Þðe 1 e T 1 þ e 2e T 2 Þþ ð 3 4 Þðe 1 e T 1 þ e 2e T 2 þ e 3e T 3 Þþ ð 4 5 Þðe 1 e T 1 þ e 2e T 2 þ e 3e T 3 þ e 4e T 4 Þþ 5 ðe 1 e T 1 þ e 2e T 2 þ e 3e T 3 þ e 4e T 4 þ e 5e T 5 Þ: Table 2 shows the relation between the dimensionality of geometric features and their tensor interpretation in the 5D space. The saliency of each feature is defined as shown in the table and it represents the confidence of each feature. The dimensionality of each geometric feature is given by the number of parameters that are required to describe the feature in a parametric representation. 3 OUR APPROACH In this section, we provide the details of our proposed approach and Fig. 5 shows the overall algorithm. Our motion analysis framework is based on the 4D framework proposed by Nicolescu and Medioni [25] and our overall processing steps are similar to their approach. For instance,

5 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1593 TABLE 2 Tensor Interpretation in the 5D Space both approaches have the same candidate match step, initial tensor voting step, outlier rejection step, and the last densification step. However, our 5D framework is quite different from the 4D framework in two important aspects. First, we combine the rich temporal motion information with the spatial motion information by processing multiple frames together, while the 4D framework uses only the two-frame motion information. Second, we use monocular cues in a different way (that is, oversegmentation and merging) to estimate more accurate motion boundaries. In other words, we process different kinds of data sets by using similar processing steps. Also note that the use of temporal motion information provides not only a better motion analysis but also straightforward solutions to other difficult vision problems, as shown in Section Initial Color Oversegmentation In order to obtain accurate motion boundaries, we start our framework with a color-based segmentation of the reference image. We have a very strong preference for erring on the side of oversegmentation, from which we can recover. After smoothing these segments individually, the motion boundary of each moving object is created by merging the segments based on their motion similarity. Thus, it is true that the quality of the final motion segmentation is affected by the quality of the oversegmentation. However, when it comes to oversegmentation, most known segmentation approaches generate reasonably good results. For our experiments, we use the mean shift-based image segmentation described in [11], with color reference images, and it usually generates 100 to 200 segments. The color-based oversegmentation approach ensures that each segment contains pixels of only a single object. Therefore, we can assume that the motion layer associated with each segment forms a smooth layer because of the spatiotemporal smoothness constraint. In [25], the spatial smoothness constraint is enforced to the layers in a 4D space without prior segmentation. For the initial tensor voting and outlier rejection steps, it is probably reasonable because the motion layers of different objects are physically separated in the 4D space, minimizing votes between different motion layers. In the densification step, however, it is not easy to estimate missing tokens around motion boundaries because different motion layers can cast votes to all candidates. To solve this problem, the authors utilize 2D image gradient information around motion boundaries, which, unfortunately, often fails when the image has textures around motion boundaries. In our approach, the voting process is restricted within each segment. Thus, we can avoid unwanted votes between different objects for both the initial tensor voting and the densification processes. Fig. 5. Overall proposed algorithm for motion segmentation and analysis. 3.2 Finding Candidate Matches In order to find candidate matches between the reference image and other images, we use a simple cross correlation. For a given pair of images, we first define the size of the correlation window and the search range. The size of the windows we use in our experiments roughly ranges from 5 to 17 and they work fine in most cases. A large correlation window tends to oversmooth the motion layers, while a small one tends to capture detailed local motion with many outliers. A rough search range is manually given in the beginning by observing the sequence. Fig. 6 shows how to find matches using the cross-correlation window. For each pixel in Fig. 6a, a correlation window is defined around the pixel and its best match (that is, highest correlation score) is found in Fig. 6b within the search range. In our framework, we find a candidate temporal trajectory of a pixel instead of a single match between two images. Assume that four images are given ði 0 ;I 1 ;I 2 ;I 3 Þ, and the first one, I 0, is the reference image, as presented in Fig. 7. From the correlation computation, we obtain the velocities ðv 01 ðpþ; V 02 ðpþ; V 03 ðpþþ, where V ij ðpþ is the velocity of a pixel p in the reference image between I i and I j. Since the appropriate velocity form in our approach is the velocity of the pixel p at each time t (that is, V 01 ðpþ, V 12 ðpþ, and V 23 ðpþ), we need to convert the obtained velocities into these forms using the simple equation V t;tþ1 ðpþ ¼V 0;tþ1 ðpþ V 0;t ðpþ. A sequential search that finds V 01 ðpþ, V 12 ðpþ, and V 23 ðpþ directly from the given images is not appropriate because Fig. 6. Finding matches using cross correlation. A certain size of correlation window is associated with each pixel in the left side image and then searches its best match (that is, highest correlation score) in the right side image to estimate the velocity of the pixel. The search is made within the predefined search range, as shown in the right side image.

6 1594 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Fig. 8. Converting the 3D temporal trajectory representation into the 5D spatiotemporal space. We are observing four pixels (A, B, C, D) in three consecutive frames, t 1, t, and t þ 1. Fig. 7. Finding candidate matches. V 01 ðpþ, V 02 ðpþ, and V 03 ðpþ are found first from the images and, then, V 01 ðpþ, V 12 ðpþ, and V 23 ðpþ are computed from them. any wrong match at t results in wrong matches for all of the following frames after t. When we extend the 4D framework [25] to the 5D framework, there exists one limitation in this candidate match step. In the case of the 4D framework, where only two frames are processed, each pixel can have multiple candidates at the same time, increasing the chances of having correct matches. In our 5D framework, however, each pixel in the reference frame is allowed to have only a single candidate match for each of the other frames. This limitation arises from the fact that we are considering candidate temporal trajectories in multiple frames. If multiple candidate matches are allowed between the reference frame and each of the other frames, then the number of all possible combinations for temporal trajectories will explode. It is very important to note that the current simple crosscorrelation approach to find candidate matches can be replaced by any other approaches that also find matches. Indeed, our proposed motion analysis framework focuses on the spatiotemporal smoothing for the given set of candidate matches that include many outliers; thus, the candidate match step is not a main part of the framework. For instance, it is obvious that the cross-correlation approach may not correctly detect rapid rotation and zooming because it implicitly assumes local translation. In this case, we might be able to generate better candidate matches by using more sophisticated approaches such as [9], [10]. The motion analysis framework itself, however, is irrelevant to this issue and it can process all kinds of smooth motion, including, for example, rapid rotation and zooming. 3.3 Initial Tensor Voting The generation of 5D tokens from the correspondences is given as follows: Assume that a pixel P whose coordinates are ðx r ;y r Þ in the reference frame has its matches in the frames at times t and t þ 1 and their coordinates are ðx t ;y t Þ and ðx tþ1 ;y tþ1 Þ, respectively. In the 3D temporal trajectory representation (that is, fiber bundle), those two matches form a short trajectory segment between ðx t ;y t ;tþ and ðx tþ1 ;y tþ1 ;tþ 1Þ. Our main proposed idea is to convert the temporal trajectory representation into the 5D point representation and the corresponding 5D point of pixel P in frame t can be represented as ðx r ;y r ;t;x tþ1 x t ;y tþ1 y t Þ. Fig. 8 illustrates the conversion process and how differently the spatiotemporal smoothness constraint is presented in both spaces. In the figure, we are observing the motion of four pixels (A, B, C, D) in three consecutive frames ðt 1;t;tþ 1Þ and their temporal trajectories are shown in Fig. 8a. Assuming that x, y, and t are continuous, a trajectory curve can be expressed as parametric equations: x ¼ xðtþ; y ¼ yðtþ; t ¼ t: Thus, we can compute a tangent vector! T at any point of the trajectories:! dx T ¼ dt ; dy dt ; 1 ¼ðv x ;v y ; 1Þ: The tangent vectors of the 12 sample points in the 3D region between t 1 and t þ 1 in the Fig. 8a must be similar because we assume the spatiotemporal smooth motion. In the 5D space (Fig. 8b), the v x and v y components of the tangent vector at each ðx; y; tþ are explicitly represented with the additional ðv x ;v y Þ axes. The set of all such 5D points constitute layers that are locally smooth because adjacent 5D points are generated from spatiotemporally adjacent temporal trajectories (for example, the 12 points in Fig. 8a). The layer in the 5D space also can be represented as parametric equations: x ¼ x; y ¼ y; t ¼ t; v x ¼ v x ðx; y; tþ; v y ¼ v y ðx; y; tþ: Since these equations are controlled by three parameters ƒ! ƒ! ƒ! ðx; y; tþ, the layer has three tangent vectors ð T 1 ;T2;T3Þ ƒ! ƒ! and two normal vectors ð N 1 ;N2Þ: x T 1 ¼ 1; 0; y x ;T 2 ¼ 0; y x T 3 ¼ 0; 0; x N 1 ; x y y N 2 y ; 2 x Therefore, the layer we are trying to extract has a dimensionality of three (that is, 3D variety) and the five eigenvectors associated with the tensor at each point are interpreted as two normal vectors ðe 1 ;e 2 Þ and three tangent vectors ðe 3 ;e 4 ;e 5 Þ. Accordingly, the saliency we are interested in is computed by 2 3, as can be found in Table 2.

7 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1595 Fig. 9. Densification process. Circular dots indicate inliers and the vertical square dots indicate a set of candidate token positions. Due to the smoothness constraint, the candidate in the middle obtains the highest saliency so that it is selected. Since the initial matches do not provide any orientation, we encode the obtained 5D points as ball tensors. This means that all of the eigenvalues are set to one and the eigenvector matrix is the identity matrix. After all points are encoded, each of them casts its information to the neighborhood, the size of which is defined by, through the tensor voting process. 3.4 Outlier Rejection and Densification Through the first tensor voting step, each token collects votes from its neighbors within the same segment. As we have mentioned in the previous section, the geometric feature that represents the desired smooth layers in the 5D space is the 3D variety and its saliency is computed by 2 3. If a token is within a layer, then its neighbors strongly support the token, resulting in high 3D variety saliency. On the other hand, if a token results from a wrong match, then it is more likely an isolated point in the 5D space and collects weaker votes from its neighbors than tokens in layers. Thus, we can locate those outliers by comparing the 3D variety saliency of each token with a threshold that is set by averaging all the 3D variety saliencies in the segment. After the outlier rejection process, we need to estimate correct positions of the removed tokens to obtain dense velocity fields. For this densification process, the smoothness constraint is also applied. Fig. 9 illustrates the densification process. At the removed position ðx; y; tþ, we set up multiple candidate velocities ðv x ;v y Þ between the minimum and maximum velocities of its neighbors within the same segment and generate corresponding candidate 5D tokens (square dots). Each token now collects votes from its existing neighbors (circular dots) and, then, we select the one that has the highest 3D variety saliency (the square in the middle) among the candidates for the ðx; y; tþ position. In the densification process, the previous oversegmentation greatly helps in preserving the exact motion boundaries. In Fig. 10, we assume that the removed token position belongs to the boundary of object 2, so the best estimated token position must be the second candidate from the bottom. Without the oversegmentation, the inlier tokens of both object 1 and object 2 cast votes to the candidate tokens. Thus, it is not clear which candidate token between the second one from the top and the second one from the bottom will obtain the highest saliency. On the other hand, Fig. 10. Densification process without the oversegmentation. The candidates for the missing token position that belongs to object 2 collect votes from the inlier tokens of both object 1 and object 2. if we know that the removed token position belongs to object 2 in advance from the oversegmentation, then we can allow only the inlier tokens from object 2 to cast votes to the candidates. This will correctly find the best candidate token, which is the second one from the bottom. 3.5 Merging Motion Layer Segments Through the previous densification process, we have obtained smooth layers and velocity information for every pixel in the 5D space. Each layer is associated with its small homogeneous image segment so that we need to merge the layers to form complete object shapes. In our approach, we compare all adjacent layers in pairs and use the following two properties to measure the similarity between them:. the average of the ðv x ;v y Þ components (that is, average velocity of the pixels) of the 5D tokens around the boundary between two layers and. the normal space that those tokens span. Roughly speaking, the first property examines whether two layers are connected to each other and the second property examines whether the connection is smooth. In the case of nontranslational motion such as zooming, rotation, or nonrigid motion, the layers may not be flat or parallel to the ðx; y; tþ space. In such a case, the average velocity and the normal space of all the tokens in the layer do not correctly characterize the layer. Therefore, we consider only tokens around the boundary region where two layers meet, as shown in Fig. 11a, and those tokens are enough to examine how smoothly two layers are connected. For the tokens in each square window, we compute the average velocity components and normal spaces for both layer 1 and layer 2 and then compute the similarity of the two layers based on them. The window size, 21 21, was empirically obtained considering the size that is small enough to avoid the above problem but large enough to keep the local motion characteristic around boundaries. This process is repeated for other square windows and the final similarity measures between two layers are computed by averaging the errors from each square window. The following equations are used for error computation.. Average velocity error : 1 P N n jv n layer1 V n layer2 j.. Normal space error: S 3 =S 2.

8 1596 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Fig. 11. Merging process and various cases. (a) Top view of two adjacent layers, where ðv x ;v y Þ is orthogonal to this paper (only tokens in each square are computed for similarity). (b) Both the velocity and normal spaces are similar merge. (c) Similar velocity but different normal spaces do not merge. (d) Different velocity but similar normal spaces do not merge. (e) Both velocity and normal spaces are different do not merge. In the first equation, N is the total number of square windows and V is the average velocity component of the tokens in each layer at a given window n. In the second equation, S i are singular values (S 1 is the largest one) of the matrix E in which we stack the two normal vectors (e 1 and e 2 ) from each token in both layer 1 and layer 2. If the two normal spaces are compatible, then E will be rank two and S 3 will be small. Otherwise, the third singular value, S 3, will be large, meaning that the two normal spaces are not compatible. Note that the two similarity measures, average velocity and normal space, are independent measures; thus, both of them should be satisfied at the same time to merge layers. Figs. 11b, 11c, 11d, and 11e show the different cases. If two layers have similar velocities and normal spaces around their boundary, then they can be merged (shown in Fig. 11b). However, if either of the two similarity measures fails, then two layers cannot be merged (shown in Figs. 11c, 11d, and 11e). Successively, all layers are tested in pairs for merging and, based on those pairwise results, we build the final motion segmentation for the given reference image. In this merging stage, we need to consider how to set thresholds. When we decide if two layers are merged together or not, their similarity measures are compared with certain thresholds. Here, those thresholds are empirically obtained. For most sequences, a constant threshold for the average velocity error can be used because we roughly normalize the maximum velocity of objects when we initially generate 5D points. Also, the threshold for the normal space error does not significantly affect the merging results, so a constant threshold value obtained empirically can be used as well. In some cases, however, we might need to try several thresholds to get a reasonable result. For instance, if the given sequence has many moving objects with different sizes and speeds and the quality of the initial matching is poor, then it is hard to distinguish which layers are moving objects and which layers are sets of wrong matches. Fig. 12. Various motion segmentation results. (a)-(d) The Blue car sequence. (e)-(h) The Truck sequence. (i)-(l) The Flower garden sequence. (a), (e), and (i) Reference images. (b) and (f) Ground-truth segmentation without shadows. (c) and (g) Ground-truth segmentation with shadows. (j) Ground-truth segmentation with thin branches. (k) Ground-truth segmentation without thin branches. (d), (h), and (l) Our motion segmentation results. 4 EXPERIMENTAL RESULTS In this section, we present some experimental results. Section 4.1 focuses on our framework itself by showing various segmentation results. In Section 4.2, we demonstrate the importance of the temporal smoothness constraint in some challenging sequences that have repetitive patterns or severe occlusions. In Section 4.3, we present various vision applications and show how our segmentation and dense trajectory results can easily solve those problems. Section 4.4 provides all of the input parameters that are used for the experiments. The experiments have been performed under the following system environments: Pentium GHz CPU, 2 Gbyte RAM, both CPU-based and GPU-based motion analysis frameworks [23], and Nvidia GeForce 7800GTX GPU. 4.1 Basic Motion Segmentation Various real image sequences. We first present some basic motion segmentation results of three different sequences: the Blue car, Truck, and Flower garden sequences. They are shown in each column in Fig. 12. In the cases of Blue car and Truck, there are shadows that move along with the moving cars, so the shadows are merged to the car regions because their motion is compatible. Thus, we separately show the groundtruth segmentation of the cars without and with the shadows in the second row (Figs. 12b and 12f) and the third row (Figs. 12c and 12g), respectively. Our motion segmentation

9 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1597 TABLE 3 Segmentation Error Rates for the Sequences in Fig. 12 results of those two sequences are shown in the last row (Figs. 12d and 12h). In the case of Flower garden, we see two erroneous segmentation results at the thin branches and the trunk of the tree. In fact, both are caused by the same undersegmentation problem. Although most segmentation tools produce good oversegmentation outputs, sometimes they fail for some extreme cases. For instance, the thin branches in Flower garden are incorrectly included in the large sky segment region and hence are classified as outliers and smoothed out during the densification. If the branches are separately segmented in the oversegmentation step, then, definitely, they will appear in the final segmentation result. Another example can be found in the boundary regions between the tree and flowers. Due to very complex textures around the boundaries, some segments span over both the tree and flowers, resulting in noisy boundaries. The overall segmentation error rates, in the form of the percentage of wrong pixel assignments, for the three sequences are provided in Table 3. Teddy bear and cones sequences. In order to show the accuracy of per-pixel motion estimation and quantitatively compare our proposed approach with others, we tested two sequences, Teddy bear and Cones, which come with their ground-truth disparity maps [1]. Figs. 13a and 13c show the ground-truth disparity maps of Teddy bear and Cones, respectively, and Figs. 13b and 13d show our disparity map results. Counting all pixels in the images and setting the error threshold to one, the reported error rates were 11.5 percent and 8.47 percent for Teddy bear and Cones, respectively. The comparison of the error rates with other methods is provided in Table 4. Unlike other methods that are optimized to handle stereo pairs only, our approach is more generic in that it can handle all kinds of motion, not just horizontal disparity. Also, our method is purely local and is the only local method that competitively compares to global methods. For instance, Graph Cuts is one of the popular global methods and it shows 12.4 percent of error rate (Graph Cuts with occlusion handling [19]) for Cones [1]. One of the most important properties of the tensor voting framework is its robustness to noise. In order to demonstrate this, we tested the same Teddy bear after adding synthetic noise to the original set of 5D input points. The first noisy input, Teddy bear 50, is generated by randomly selecting 50 percent of the input points and adding random real numbers between 5 to 5 to the points. The second noisy input, Teddy bear 80, is generated by adding the same random real numbers to 80 percent of the original input points. Fig. 14 shows those noisy inputs and their final spatiotemporally smoothed layers. The disparity errors are computed in the same way and the reported Fig. 13. Results of Teddy bear and Cones. (a) Ground-truth disparity map of Teddy bear. (b) Our disparity map of Teddy bear. (c) Groundtruth disparity map of Cones. (d) Our disparity map of Cones. TABLE 4 Error Rate Comparison for Teddy bear and Cones [1] error rates are 12.1 percent and 16.8 percent for Teddy bear 50 and Teddy bear 80, respectively. These low error rates successfully demonstrate that our motion analysis framework is very robust to noisy inputs. 4.2 Challenging Sequences Walking person. This sequence consists of five frames and the third frame is selected as the reference frame. From this sequence, we want to show how our 5D approach with temporal information outperforms the 4D approach in [25]. Fig. 15a shows the five input frames and frame 3 is the reference frame. Two boxed regions in the reference frame show regions where it is difficult to find correspondences due to their periodic patterns. This problem is clearly seen in Fig. 15b, which shows the v x components of the initial matches between the reference frame and the first frame: Some groups of wrong matches of the pixels in the boxed regions accidentally form smooth regions. If we use only two frames (third and first) in the 4D space, those wrong matches cannot be removed due to their smoothness, as can be seen in Fig. 15c. The wrong matches, which are in black ellipses, have survived from the outlier rejection process and they will eventually generate false motion layers after densification. On the other hand, Fig. 15d shows the correct outlier rejection result between the third and first frames by our proposed 5D approach, where the wrong layers are clearly removed in spite of their smoothness. This is

10 1598 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Fig. 14. Testing Teddy bear after adding synthetic noise. (a) Original input. (b) Teddy bear 50 add noise to 50 percent of the input points. (c) Teddy bear 80 add noise to 80 percent of the input points. (d) Output of (a) error rate: 11.5 percent. (e) Output of (b) error rate: 12.1 percent. (f) Output of (c) error rate: 16.8 percent. Fig. 16. Results of the Two pedestrians sequence. (a) Input sequence. (b) Outlier rejection for frame 5. (c) Densification for frame 5. (d) Initial temporal trajectories. (e) Spatiotemporally smoothed trajectories. Fig. 15. Results of the Walking person sequence. (a) Input sequence. (b) Initial matches. (c) Outlier rejection by two frames in 4D. (d) Outlier rejection by multiple frames in 5D. (e) Final segmentation. (f) Some temporal trajectories. possible because the wrong matches are not consistent in the temporal sense in the 5D space. In other words, other matches between the reference frame and other frames except the first frame may not show those accidental smooth layers, so they are classified as outliers in the 5D space. Fig. 15e represents the final motion segmentation by coloring each segment and Fig. 15f shows the temporal trajectories of some pixels. Two pedestrians. From this sequence, we present the advantage of the temporal smoothness constraint in the presence of a serious occlusion. As can be seen in Fig. 16a, there are two people walking in opposite directions so that the one in the back suffers from serious occlusions in frames 4-6. Therefore, the matches for him between the reference frame (frame 2) and those three frames create a significant amount of wrong matches and they are mostly removed during the outlier rejection process due to their random distribution in the 5D space. Fig. 16b shows the result of the outlier rejection for frame 5 and we see in the figure that most of the matches of the person in the back have been removed. On the other hand, he is mostly visible in other frames and the matches for those frames create smooth layers in the 5D space. During the densification process, those smooth layers that consist of good matches extend to the regions where wrong matches are removed. This is possible due to the temporal smoothness constraint and the final densification result of frame 5 is presented in Fig. 16c. Figs. 16d and 16e show the temporal trajectories of the pixels in a single scan line for initial noisy matches and spatiotemporally smoothed matches, respectively. In Fig. 16d, the pixels of the person in the back lose their temporal tracks due to occlusions, but they are recovered by the spatiotemporal smoothing process, as can be seen in Fig. 16e. 4.3 Extended Applications In this section, we present some extended vision applications where our results, motion segmentation, and dense trajectories allow us to solve problems with greater ease than other approaches. Tracking in a long sequence. An ubiquitous problem in computer vision is to track moving objects in a sequence. Tracking is especially difficult if the camera moves and the objects occlude each other in the presence of multiple moving objects. All of these difficulties are handled properly under our framework in which each independent moving object is segmented, the occlusion is appropriately processed, as can be seen in the Two Pedestrians example, and so on. Thus, tracking is a straightforward problem with our framework and we provide even pixel-level tracking information. Fig. 17a shows an example sequence in which four cars and the camera move. In Fig. 17b, we illustrate the tracking results of the sequence by showing reprojected images from the computed dense temporal trajectories of the reference image. Since the cars are already segmented, tracking them is straightforward. Another tracking example is shown in

11 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1599 Fig. 17. Tracking example 1. (a) Input sequence: four cars and a moving camera. (b) Pixel-level tracking with segmented objects. Fig. 19. Mosaic example moving object. (a) Our result. (b) The result of Autostitch. The dark trail in our result is caused by the shadows of the car on the ground and it is not a registration problem. Fig. 20. Mosaic example strong parallax. (a) The foreground mosaic generated by our method (the Teddy bear is another independent object, so it is not included). (b) The background mosaic generated by our method. (c) The mosaic generated by Autostitch. Fig. 18. Tracking example 2. (a) Input sequence: one car and a moving camera. (b) Temporal trajectories. (c) Pixel-level tracking with a segmented object. Fig. 18. The temporal trajectories of the pixels in the reference image are presented in Fig. 18b and the reprojected images generated from the trajectories are shown in Fig. 18c. We process about 10 frames at a time so that, for a long sequence, we divide the sequence into multiple overlapping frame sets and then process them separately. After processing all sets, we simply combine those results to obtain long temporal trajectories. Mosaic generation. A mosaic is created by geometrically aligning a set of images and stitching them together. It is one of the oldest research topics in computer vision and many researchers have proposed various algorithms. For instance, Shum and Szeliski [27] applied both global and local alignment techniques to reduce the accumulated registration errors and small motion parallax, Uyttendaele et al. [31] proposed methods to deal with blurry regions due to moving objects and exposure changes across images, and Davis [14] focused on the registration problem in the presence of moving objects based on the Mellin transform. Brown and Lowe [8] used the popular SIFT features for robust image registration and provided a complete mosaic framework called Autostitch. Most of these algorithms, however, fail to generate accurate mosaics when an image set has relatively large moving objects or suffers from strong parallax in large areas because those independent moving parts cause poor parameter estimation for image registration. On the other hand, our framework allows us to generate very accurate mosaics even in the presence of moving objects since they are already identified and separated from the background before the parameter estimation step [24]. Also, the resulting temporal trajectories (that is, correspondences) can be directly used for parameter estimation. Fig. 19 compares our method with the Autostitch framework for the sequence in Fig. 18 (moving-object case) and the comparison for the Teddy bear sequence (strong-parallax case) is shown in Fig. 20. As can be seen in both examples, the Autostitch generates distorted mosaics due to the moving car or strong parallax of the foreground objects. In the case of the Teddy bear sequence, we can also produce the mosaic of the foreground objects. Those separate mosaics of foreground and background are possible thanks to our accurate motion segmentation framework. Another advantage of our motion segmentation framework for mosaics is that we can generate background-only mosaics. However, the mosaics generated by most other approaches include moving objects because they do not distinguish the foreground and background. This difference can be seen in Fig. 19. Three-dimensional reconstruction. In the case of rigid 3D motion, we can reconstruct its dense 3D structure directly from our results. Although the basic 3D reconstruction framework is simple, it is hard to implement in

12 1600 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 TABLE 6 Average Processing Time of Each Step Fig. 21. Three-dimensional reconstruction example. (a) One of the given images. (b) Generated 3D reconstruction. practice. For instance, finding good feature points with their accurate correspondences is not easy and multiple moving objects with a moving camera also make the problem much harder. In our case, however, such problems were already solved in the main framework, so we can use any standard 3D reconstruction methods without any difficulties. Fig. 21a shows an example image where both the tissue box and the camera move. By applying the 3D reconstruction algorithm described in [30] to our segmented dense correspondence groups separately and combining the resulting 3D structures later, we obtain the overall 3D structure of the scene in Fig. 21b. Note that the structure consists of dense 3D points, unlike many other approaches which have only sparse 3D points from selected image features. 4.4 Parameters for Each Experiment Table 5 shows the list of all of the image sequences used and their input parameters. The provided data includes the sequence names, corresponding figure numbers, image sizes, number of input images (NI), correlation window sizes for initial candidate matches (CWS), for tensor voting, parameters for oversegmentation, thresholds in the merging step, and overall processing time. As can be seen from the table, different values are used for each image sequence. However, the overall outputs are not as sensitive to the selection of the values. For instance, we tested the Both the CPU and GPU codes use the same oversegmentation and initial matching codes. Flower garden sequence with different values for segmentation. The resulting segmentation error rates (refer to Fig. 12) based on the ground-truth without branches are 2.22 percent, 3.15 percent, and 3.38 percent for ¼ 3; 7; 11, respectively. Table 6 shows the average processing time of each step for the sequences in Table 5. Note that the One car sequence requires a batch process so that only one process time of the batch is counted in the average computation. Also, the table compares the processing time between the conventional CPU-based code and the new GPU-based code described in [23]. Roughly speaking, the new GPU-based code shows a processing speed about 10 times faster than that of the CPU-based code. 5 CONCLUSION AND FUTURE WORK We have presented a novel spatiotemporal smoothing process for motion analysis that preserves discontinuities in a 5D space. The outputs of the system are accurate motion boundaries and dense temporal trajectories. Through many experimental results, we demonstrated the accuracy of the outputs and showed the importance of the temporal smoothing process with some challenging sequences. Besides these direct outputs of our framework, we TABLE 5 List of Parameters and Processing Time 1 Number of input images. 2 Correlation window size. 3 Parameters for oversegmentation [11]: spatial/color/minimum region. 4 Thresholds in the merging step: velocity error/normal space error. 5 Processing time (sec) by GPU-based implementation.

13 MIN AND MEDIONI: INFERRING SEGMENTED DENSE MOTION LAYERS USING 5D TENSOR VOTING 1601 also showed how several complex vision problems can be simplified by applying our framework at the first stage. One important technical issue of our framework is in the initial oversegmentation step. Although the mean shiftbased segmentation tool mostly produces very good oversegmentation, sometimes it fails, violating our assumption that a segment contains pixels only from a single object. Unfortunately, our current system does not have a module that fixes such an undersegmentation situation (for example, the results of the Flower garden sequence). Another issue has to do with the initial candidate match step. Theoretically, our main motion analysis framework can process any kinds of motion. However, due to the limited function of the current initial match algorithm (that is, a simple cross correlation), some class of motion may not be processed correctly. We will investigate these issues in future research. ACKNOWLEDGMENTS The research has been funded in part by the Integrated Media Systems Center, a US National Science Foundation (NSF) Engineering Research Center, Cooperative Agreement EEC , and NSF Grant IIS Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the NSF. REFERENCES [1] Middlebury College Stereo Evaluation Webpage, middlebury.edu/stereo/, [2] E. Adelson and Y. Weiss, A Unified Mixture Framework for Motion Segmentation: Incorporating Spatial Coherence and Estimating the Number of Models, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [3] T. Amiaz and N. Kiryati, Dense Discontinuous Optical Flow via Contour-Based Segmentation, Proc. IEEE Int l Conf. Image Processing, vol. 3, pp , [4] S. Ayer and H. Sawhney, Layered Representation of Motion Video Using Robust Maximum-Likelihood Estimation of Mixture Models and MDL Encoding, Proc. Fifth Int l Conf. Computer Vision, pp , [5] H. Baker and R. Bolles, Generalizing Epipolar-Plane Image Analysis on the Spatiotemporal Surface, Int l J. Computer Vision, vol. 3, no. 1, pp , May [6] M. Bleyer and M. Gelautz, A Layered Stereo Algorithm Using Image Segmentation and Global Visibility Constraints, Proc. IEEE Int l Conf. Image Processing, vol. 5, pp , [7] R. Bolles, H. Baker, and D. Marimont, Epipolar-Plane Image Analysis: An Approach to Determining Structure from Motion, Int l J. Computer Vision, vol. 1, no. 1, pp. 7-56, [8] M. Brown and D. Lowe, Recognising Panoramas, Proc. Ninth IEEE Int l Conf. Computer Vision, pp , [9] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High Accuracy Optical Flow Estimation Based on a Theory for Warping, Proc. Eighth European Conf. Computer Vision, vol. 4, pp , [10] T. Brox, A. Bruhn, and J. Weickert, Variational Motion Segmentation with Level Sets, Proc. Ninth European Conf. Computer Vision, pp , May [11] D. Comaniciu and P. Meer, Mean Shift: A Robust Approach toward Feature Space Analysis, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp , May [12] J. Costeira and T. Kanade, A Multibody Factorization Method for Independently Moving Objects, Int l J. Computer Vision, vol. 29, no. 3, pp , Sept [13] D. Cremers and S. Soatto, Motion Competition: A Variational Approach to Piecewise Parametric Motion Segmentation, Int l J. Computer Vision, vol. 62, no. 3, pp , May [14] J. Davis, Mosaics of Scenes with Moving Objects, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [15] H. Hirschmuller, Stereo Vision in Structured Environments by Consistent Semi-Global Matching, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , [16] D. Husemoller, Fibre Bundles, third ed. Springer, [17] K. Kanatani, Motion Segmentation by Subspace Separation and Model Selection, Proc. Eighth Int l Conf. Computer Vision, vol. 2, pp , [18] A. Klaus, M. Sormann, and K. Karner, Segment-Based Stereo Matching Using Belief Propagation and a Self-Adapting Dissimilarity Measure, Proc. 18th Int l Conf. Pattern Recognition, [19] V. Kolmogorov and R. Zabih, Computing Visual Correspondence with Occlusions via Graph Cuts, Proc. Eighth Int l Conf. Computer Vision, vol. 2, pp , [20] G. Medioni, M. Lee, and C. Tang, A Computational Framework for Segmentation and Grouping, first ed. Elsevier, [21] E. Mémin and P. Pérez, Hierarchical Estimation and Segmentation of Dense Motion Fields, Int l J. Computer Vision, vol. 46, no. 2, pp , Feb [22] C. Min and G. Medioni, Motion Segmentation by Spatiotemporal Smoothness Using 5D Tensor Voting, Proc. Fifth IEEE Workshop Perceptual Organization in Computer Vision, [23] C. Min and G. Medioni, Tensor Voting Accelerated by Graphics Processing Units (GPU), Proc. 18th Int l Conf. Pattern Recognition, [24] C. Min, Q. Yu, and G. Medioni, Multi-Layer Mosaics in the Presence of Motion and Depth Effects, Proc. 18th Int l Conf. Pattern Recognition, [25] M. Nicolescu and G. Medioni, Layered 4D Representation and Voting for Grouping from Motion, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 4, pp , Apr [26] C. Schnörr, Determining Optical Flow for Irregular Domains by Minimizing Quadratic Functionals of a Certain Class, Int l J. Computer Vision, vol. 6, no. 1, pp , Apr [27] H. Shum and R. Szeliski, Systems and Experiment Paper: Construction of Panoramic Image Mosaics with Global and Local Alignment, Int l J. Computer Vision, vol. 36, no. 2, pp , Feb [28] J. Sun, Y. Li, S. Kang, and H. Shum, Symmetric Stereo Matching for Occlusion Handling, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , [29] C. Tomasi and T. Kanade, Shape and Motion from Image Streams under Orthography: A Factorization Method, Int l J. Computer Vision, vol. 9, no. 2, pp , Nov [30] P. Torr and D. Murray, The Development and Comparison of Robust Methods for Estimating the Fundamental Matrix, Int l J. Computer Vision, vol. 24, no. 3, pp , Oct [31] M. Uyttendaele, A. Eden, and R. Szeliski, Eliminating Ghosting and Exposure Artifacts in Image Mosaics, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , [32] R. Vidal and R. Hartley, Motion Segmentation with Missing Data Using Power Factorization and GPCA, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , [33] J. Wang and E. Adelson, Layered Representation for Motion Analysis, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [34] J. Xiao and M. Shah, Motion Layer Extraction in the Presence of Occlusion Using Graph Cuts, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp , Oct [35] Q. Yang, L. Wang, R. Yang, H. Stewenius, and D. Nister, Stereo Matching with Color-Weighted Correlation, Hierarchical Belief Propagation and Occlusion Handling, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , [36] L. Zelnik-Manor and M. Irani, Degeneracies, Dependencies and Their Implications in Multi-Body and Multi-Sequence Factorizations, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp , 2003.

14 1602 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 Changki Min received the BS degree in mathematics from Hanyang University, Korea, in 1997 and the MS and PhD degrees in electrical engineering from the University of Southern California in 2000 and 2006, respectively. Currently, he is working at Apple Inc. as a senior graphics software engineer. His research interests include motion analysis, segmentation, tensor voting, computer graphics, and GPGPU. Gérard Medioni received the Diplôme d Ingénieur from the Ecole Nationale Supérieure des Télécommunications (ENST), Paris, in 1977 and the MS and PhD degrees from the University of Southern California (USC) in 1980 and 1983, respectively. He has been with USC since then and is currently a professor of computer science and electrical engineering, a codirector of the Institute for Robotics and Intelligent Systems (IRIS), and a codirector of the USC Games Institute. He served as the chairman of the Computer Science Department from 2001 to He has made significant contributions to the field of computer vision. His research covers a broad spectrum of the field, such as edge detection, stereo and motion analysis, shape inference and description, and system integration. He has published 3 books, more than 50 journal papers, and 150 conference articles. He is the holder of eight international patents. He is an associate editor of the Image and Vision Computing Journal, Pattern Recognition and Image Analysis Journal, and International Journal of Image and Video Processing. He served as a program cochair of the 1991 IEEE Computer Vision and Pattern Recognition (CVPR) Conference and the 1995 IEEE International Symposium on Computer Vision, a general cochair of the 1997 IEEE CVPR Conference, a conference cochair of the 1998 International Conference on Pattern Recognition, a general cochair of the 2001 IEEE CVPR Conference, a general cochair of the 2007 IEEE CVPR Conference, and a general cochair of the upcoming 2009 IEEE CVPR Conference. He is a fellow of the IEEE, IAPR, and AAAI and a member of the IEEE Computer Society.. For more information on this or any other computing topic, please visit our Digital Library at

Perceptual Grouping from Motion Cues Using Tensor Voting

Perceptual Grouping from Motion Cues Using Tensor Voting Perceptual Grouping from Motion Cues Using Tensor Voting 1. Research Team Project Leader: Graduate Students: Prof. Gérard Medioni, Computer Science Mircea Nicolescu, Changki Min 2. Statement of Project

More information

What is Computer Vision?

What is Computer Vision? Perceptual Grouping in Computer Vision Gérard Medioni University of Southern California What is Computer Vision? Computer Vision Attempt to emulate Human Visual System Perceive visual stimuli with cameras

More information

Motion Segmentation with Accurate Boundaries - A Tensor Voting Approach

Motion Segmentation with Accurate Boundaries - A Tensor Voting Approach Motion Segmentation with Accurate Boundaries - A Tensor Voting Approach Mircea Nicolescu and Gérard Medioni Integrated Media Systems Center University of Southern California Los Angeles, CA 90089-0273

More information

Motion Tracking and Event Understanding in Video Sequences

Motion Tracking and Event Understanding in Video Sequences Motion Tracking and Event Understanding in Video Sequences Isaac Cohen Elaine Kang, Jinman Kang Institute for Robotics and Intelligent Systems University of Southern California Los Angeles, CA Objectives!

More information

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation Obviously, this is a very slow process and not suitable for dynamic scenes. To speed things up, we can use a laser that projects a vertical line of light onto the scene. This laser rotates around its vertical

More information

Continuous Multi-View Tracking using Tensor Voting

Continuous Multi-View Tracking using Tensor Voting Continuous Multi-View Tracking using Tensor Voting Jinman Kang, Isaac Cohen and Gerard Medioni Institute for Robotics and Intelligent Systems University of Southern California {jinmanka, icohen, medioni}@iris.usc.edu

More information

Prof. Fanny Ficuciello Robotics for Bioengineering Visual Servoing

Prof. Fanny Ficuciello Robotics for Bioengineering Visual Servoing Visual servoing vision allows a robotic system to obtain geometrical and qualitative information on the surrounding environment high level control motion planning (look-and-move visual grasping) low level

More information

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation ÖGAI Journal 24/1 11 Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation Michael Bleyer, Margrit Gelautz, Christoph Rhemann Vienna University of Technology

More information

Two-Frames Accurate Motion Segmentation Using Tensor Voting and Graph-Cuts

Two-Frames Accurate Motion Segmentation Using Tensor Voting and Graph-Cuts Two-Frames Accurate Motion Segmentation Using Tensor Voting and Graph-Cuts Thang Dinh * and Gérard Medioni Institute for Robotics and Intelligent Systems University of Southern California Los Angeles,

More information

Atraditional formulation of the motion analysis problem

Atraditional formulation of the motion analysis problem 492 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 4, APRIL 2003 Layered 4D Representation and Voting for Grouping from Motion Mircea Nicolescu, Student Member, IEEE Computer

More information

CS 4495 Computer Vision Motion and Optic Flow

CS 4495 Computer Vision Motion and Optic Flow CS 4495 Computer Vision Aaron Bobick School of Interactive Computing Administrivia PS4 is out, due Sunday Oct 27 th. All relevant lectures posted Details about Problem Set: You may *not* use built in Harris

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

Continuous Multi-Views Tracking using Tensor Voting

Continuous Multi-Views Tracking using Tensor Voting Continuous Multi-Views racking using ensor Voting Jinman Kang, Isaac Cohen and Gerard Medioni Institute for Robotics and Intelligent Systems University of Southern California Los Angeles, CA 90089-073.

More information

Motion Estimation. There are three main types (or applications) of motion estimation:

Motion Estimation. There are three main types (or applications) of motion estimation: Members: D91922016 朱威達 R93922010 林聖凱 R93922044 謝俊瑋 Motion Estimation There are three main types (or applications) of motion estimation: Parametric motion (image alignment) The main idea of parametric motion

More information

Edge and corner detection

Edge and corner detection Edge and corner detection Prof. Stricker Doz. G. Bleser Computer Vision: Object and People Tracking Goals Where is the information in an image? How is an object characterized? How can I find measurements

More information

EE795: Computer Vision and Intelligent Systems

EE795: Computer Vision and Intelligent Systems EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 FDH 204 Lecture 14 130307 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Review Stereo Dense Motion Estimation Translational

More information

Chapter 9 Object Tracking an Overview

Chapter 9 Object Tracking an Overview Chapter 9 Object Tracking an Overview The output of the background subtraction algorithm, described in the previous chapter, is a classification (segmentation) of pixels into foreground pixels (those belonging

More information

Multi-stable Perception. Necker Cube

Multi-stable Perception. Necker Cube Multi-stable Perception Necker Cube Spinning dancer illusion, Nobuyuki Kayahara Multiple view geometry Stereo vision Epipolar geometry Lowe Hartley and Zisserman Depth map extraction Essential matrix

More information

CSE 252B: Computer Vision II

CSE 252B: Computer Vision II CSE 252B: Computer Vision II Lecturer: Serge Belongie Scribes: Jeremy Pollock and Neil Alldrin LECTURE 14 Robust Feature Matching 14.1. Introduction Last lecture we learned how to find interest points

More information

Stereo Vision. MAN-522 Computer Vision

Stereo Vision. MAN-522 Computer Vision Stereo Vision MAN-522 Computer Vision What is the goal of stereo vision? The recovery of the 3D structure of a scene using two or more images of the 3D scene, each acquired from a different viewpoint in

More information

Factorization with Missing and Noisy Data

Factorization with Missing and Noisy Data Factorization with Missing and Noisy Data Carme Julià, Angel Sappa, Felipe Lumbreras, Joan Serrat, and Antonio López Computer Vision Center and Computer Science Department, Universitat Autònoma de Barcelona,

More information

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

Accurate 3D Face and Body Modeling from a Single Fixed Kinect Accurate 3D Face and Body Modeling from a Single Fixed Kinect Ruizhe Wang*, Matthias Hernandez*, Jongmoo Choi, Gérard Medioni Computer Vision Lab, IRIS University of Southern California Abstract In this

More information

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching Stereo Matching Fundamental matrix Let p be a point in left image, p in right image l l Epipolar relation p maps to epipolar line l p maps to epipolar line l p p Epipolar mapping described by a 3x3 matrix

More information

6.801/866. Segmentation and Line Fitting. T. Darrell

6.801/866. Segmentation and Line Fitting. T. Darrell 6.801/866 Segmentation and Line Fitting T. Darrell Segmentation and Line Fitting Gestalt grouping Background subtraction K-Means Graph cuts Hough transform Iterative fitting (Next time: Probabilistic segmentation)

More information

Segmentation and Tracking of Partial Planar Templates

Segmentation and Tracking of Partial Planar Templates Segmentation and Tracking of Partial Planar Templates Abdelsalam Masoud William Hoff Colorado School of Mines Colorado School of Mines Golden, CO 800 Golden, CO 800 amasoud@mines.edu whoff@mines.edu Abstract

More information

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching Stereo Matching Fundamental matrix Let p be a point in left image, p in right image l l Epipolar relation p maps to epipolar line l p maps to epipolar line l p p Epipolar mapping described by a 3x3 matrix

More information

Estimation of common groundplane based on co-motion statistics

Estimation of common groundplane based on co-motion statistics Estimation of common groundplane based on co-motion statistics Zoltan Szlavik, Laszlo Havasi 2, Tamas Sziranyi Analogical and Neural Computing Laboratory, Computer and Automation Research Institute of

More information

Motion and Tracking. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

Motion and Tracking. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE) Motion and Tracking Andrea Torsello DAIS Università Ca Foscari via Torino 155, 30172 Mestre (VE) Motion Segmentation Segment the video into multiple coherently moving objects Motion and Perceptual Organization

More information

Module 7 VIDEO CODING AND MOTION ESTIMATION

Module 7 VIDEO CODING AND MOTION ESTIMATION Module 7 VIDEO CODING AND MOTION ESTIMATION Lesson 20 Basic Building Blocks & Temporal Redundancy Instructional Objectives At the end of this lesson, the students should be able to: 1. Name at least five

More information

arxiv: v1 [cs.cv] 28 Sep 2018

arxiv: v1 [cs.cv] 28 Sep 2018 Camera Pose Estimation from Sequence of Calibrated Images arxiv:1809.11066v1 [cs.cv] 28 Sep 2018 Jacek Komorowski 1 and Przemyslaw Rokita 2 1 Maria Curie-Sklodowska University, Institute of Computer Science,

More information

CS 664 Segmentation. Daniel Huttenlocher

CS 664 Segmentation. Daniel Huttenlocher CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical

More information

CS 534: Computer Vision Segmentation and Perceptual Grouping

CS 534: Computer Vision Segmentation and Perceptual Grouping CS 534: Computer Vision Segmentation and Perceptual Grouping Ahmed Elgammal Dept of Computer Science CS 534 Segmentation - 1 Outlines Mid-level vision What is segmentation Perceptual Grouping Segmentation

More information

CORRELATION BASED CAR NUMBER PLATE EXTRACTION SYSTEM

CORRELATION BASED CAR NUMBER PLATE EXTRACTION SYSTEM CORRELATION BASED CAR NUMBER PLATE EXTRACTION SYSTEM 1 PHYO THET KHIN, 2 LAI LAI WIN KYI 1,2 Department of Information Technology, Mandalay Technological University The Republic of the Union of Myanmar

More information

Overview. Related Work Tensor Voting in 2-D Tensor Voting in 3-D Tensor Voting in N-D Application to Vision Problems Stereo Visual Motion

Overview. Related Work Tensor Voting in 2-D Tensor Voting in 3-D Tensor Voting in N-D Application to Vision Problems Stereo Visual Motion Overview Related Work Tensor Voting in 2-D Tensor Voting in 3-D Tensor Voting in N-D Application to Vision Problems Stereo Visual Motion Binary-Space-Partitioned Images 3-D Surface Extraction from Medical

More information

Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision

Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision What Happened Last Time? Human 3D perception (3D cinema) Computational stereo Intuitive explanation of what is meant by disparity Stereo matching

More information

A Bottom Up Algebraic Approach to Motion Segmentation

A Bottom Up Algebraic Approach to Motion Segmentation A Bottom Up Algebraic Approach to Motion Segmentation Dheeraj Singaraju and RenéVidal Center for Imaging Science, Johns Hopkins University, 301 Clark Hall, 3400 N. Charles St., Baltimore, MD, 21218, USA

More information

Notes 9: Optical Flow

Notes 9: Optical Flow Course 049064: Variational Methods in Image Processing Notes 9: Optical Flow Guy Gilboa 1 Basic Model 1.1 Background Optical flow is a fundamental problem in computer vision. The general goal is to find

More information

Stereo and Epipolar geometry

Stereo and Epipolar geometry Previously Image Primitives (feature points, lines, contours) Today: Stereo and Epipolar geometry How to match primitives between two (multiple) views) Goals: 3D reconstruction, recognition Jana Kosecka

More information

Robotics Programming Laboratory

Robotics Programming Laboratory Chair of Software Engineering Robotics Programming Laboratory Bertrand Meyer Jiwon Shin Lecture 8: Robot Perception Perception http://pascallin.ecs.soton.ac.uk/challenges/voc/databases.html#caltech car

More information

Combining Appearance and Topology for Wide

Combining Appearance and Topology for Wide Combining Appearance and Topology for Wide Baseline Matching Dennis Tell and Stefan Carlsson Presented by: Josh Wills Image Point Correspondences Critical foundation for many vision applications 3-D reconstruction,

More information

LAYER-BASED motion segmentation has been investigated

LAYER-BASED motion segmentation has been investigated 1644 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 10, OCTOBER 2005 Motion Layer Extraction in the Presence of Occlusion Using Graph Cuts Jiangjian Xiao, Student Member,

More information

Dense Image-based Motion Estimation Algorithms & Optical Flow

Dense Image-based Motion Estimation Algorithms & Optical Flow Dense mage-based Motion Estimation Algorithms & Optical Flow Video A video is a sequence of frames captured at different times The video data is a function of v time (t) v space (x,y) ntroduction to motion

More information

Planes Intersecting Cones: Static Hypertext Version

Planes Intersecting Cones: Static Hypertext Version Page 1 of 12 Planes Intersecting Cones: Static Hypertext Version On this page, we develop some of the details of the plane-slicing-cone picture discussed in the introduction. The relationship between the

More information

HOUGH TRANSFORM CS 6350 C V

HOUGH TRANSFORM CS 6350 C V HOUGH TRANSFORM CS 6350 C V HOUGH TRANSFORM The problem: Given a set of points in 2-D, find if a sub-set of these points, fall on a LINE. Hough Transform One powerful global method for detecting edges

More information

COMPUTER VISION > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE

COMPUTER VISION > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE COMPUTER VISION 2017-2018 > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE OUTLINE Optical flow Lucas-Kanade Horn-Schunck Applications of optical flow Optical flow tracking Histograms of oriented flow Assignment

More information

Simultaneous Two-View Epipolar Geometry Estimation and Motion Segmentation by 4D Tensor Voting

Simultaneous Two-View Epipolar Geometry Estimation and Motion Segmentation by 4D Tensor Voting IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 9, SEPTEMBER 2004 1167 Simultaneous Two-View Epipolar Geometry Estimation and Motion Segmentation by 4D Tensor Voting Wai-Shun

More information

Feature Tracking and Optical Flow

Feature Tracking and Optical Flow Feature Tracking and Optical Flow Prof. D. Stricker Doz. G. Bleser Many slides adapted from James Hays, Derek Hoeim, Lana Lazebnik, Silvio Saverse, who 1 in turn adapted slides from Steve Seitz, Rick Szeliski,

More information

CS443: Digital Imaging and Multimedia Perceptual Grouping Detecting Lines and Simple Curves

CS443: Digital Imaging and Multimedia Perceptual Grouping Detecting Lines and Simple Curves CS443: Digital Imaging and Multimedia Perceptual Grouping Detecting Lines and Simple Curves Spring 2008 Ahmed Elgammal Dept. of Computer Science Rutgers University Outlines Perceptual Grouping and Segmentation

More information

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS This chapter presents a computational model for perceptual organization. A figure-ground segregation network is proposed based on a novel boundary

More information

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS Cognitive Robotics Original: David G. Lowe, 004 Summary: Coen van Leeuwen, s1460919 Abstract: This article presents a method to extract

More information

Computer Vision Lecture 20

Computer Vision Lecture 20 Computer Perceptual Vision and Sensory WS 16/17 Augmented Computing Computer Perceptual Vision and Sensory WS 16/17 Augmented Computing Computer Perceptual Vision and Sensory WS 16/17 Augmented Computing

More information

Lecture 20: Tracking. Tuesday, Nov 27

Lecture 20: Tracking. Tuesday, Nov 27 Lecture 20: Tracking Tuesday, Nov 27 Paper reviews Thorough summary in your own words Main contribution Strengths? Weaknesses? How convincing are the experiments? Suggestions to improve them? Extensions?

More information

A Robust and Efficient Motion Segmentation Based on Orthogonal Projection Matrix of Shape Space

A Robust and Efficient Motion Segmentation Based on Orthogonal Projection Matrix of Shape Space A Robust and Efficient Motion Segmentation Based on Orthogonal Projection Matrix of Shape Space Naoyuki ICHIMURA Electrotechnical Laboratory 1-1-4, Umezono, Tsukuba Ibaraki, 35-8568 Japan ichimura@etl.go.jp

More information

BIL Computer Vision Apr 16, 2014

BIL Computer Vision Apr 16, 2014 BIL 719 - Computer Vision Apr 16, 2014 Binocular Stereo (cont d.), Structure from Motion Aykut Erdem Dept. of Computer Engineering Hacettepe University Slide credit: S. Lazebnik Basic stereo matching algorithm

More information

Inferring Segmented Surface Description from Stereo Data *

Inferring Segmented Surface Description from Stereo Data * Inferring Segmented Surface Description from Stereo Data * Mi-Suen Lee and Gérard Medioni University of Southern California Los Angles, CA 90089-0273 Abstract We present an integrated approach to the derivation

More information

Training-Free, Generic Object Detection Using Locally Adaptive Regression Kernels

Training-Free, Generic Object Detection Using Locally Adaptive Regression Kernels Training-Free, Generic Object Detection Using Locally Adaptive Regression Kernels IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIENCE, VOL.32, NO.9, SEPTEMBER 2010 Hae Jong Seo, Student Member,

More information

Motion Analysis. Motion analysis. Now we will talk about. Differential Motion Analysis. Motion analysis. Difference Pictures

Motion Analysis. Motion analysis. Now we will talk about. Differential Motion Analysis. Motion analysis. Difference Pictures Now we will talk about Motion Analysis Motion analysis Motion analysis is dealing with three main groups of motionrelated problems: Motion detection Moving object detection and location. Derivation of

More information

Matching. Compare region of image to region of image. Today, simplest kind of matching. Intensities similar.

Matching. Compare region of image to region of image. Today, simplest kind of matching. Intensities similar. Matching Compare region of image to region of image. We talked about this for stereo. Important for motion. Epipolar constraint unknown. But motion small. Recognition Find object in image. Recognize object.

More information

First Order Augmentation to Tensor Voting for Boundary Inference and Multiscale Analysis in 3D

First Order Augmentation to Tensor Voting for Boundary Inference and Multiscale Analysis in 3D 594 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 5, MAY 2004 First Order Augmentation to Tensor Voting for Boundary Inference and Multiscale Analysis in 3D Wai-Shun Tong,

More information

Using temporal seeding to constrain the disparity search range in stereo matching

Using temporal seeding to constrain the disparity search range in stereo matching Using temporal seeding to constrain the disparity search range in stereo matching Thulani Ndhlovu Mobile Intelligent Autonomous Systems CSIR South Africa Email: tndhlovu@csir.co.za Fred Nicolls Department

More information

A Subspace Approach to Layer Extraction, Patch-Based SFM, and Video Compression

A Subspace Approach to Layer Extraction, Patch-Based SFM, and Video Compression A Subspace Approach to Layer Extraction, Patch-Based SFM, and Video Compression Qifa Ke and Takeo Kanade December 2001 CMU-CS-01-168 School of Computer Science Carnegie Mellon University Pittsburgh, PA

More information

Chapter 3 Image Registration. Chapter 3 Image Registration

Chapter 3 Image Registration. Chapter 3 Image Registration Chapter 3 Image Registration Distributed Algorithms for Introduction (1) Definition: Image Registration Input: 2 images of the same scene but taken from different perspectives Goal: Identify transformation

More information

Tracking Minimum Distances between Curved Objects with Parametric Surfaces in Real Time

Tracking Minimum Distances between Curved Objects with Parametric Surfaces in Real Time Tracking Minimum Distances between Curved Objects with Parametric Surfaces in Real Time Zhihua Zou, Jing Xiao Department of Computer Science University of North Carolina Charlotte zzou28@yahoo.com, xiao@uncc.edu

More information

Segmentation Based Stereo. Michael Bleyer LVA Stereo Vision

Segmentation Based Stereo. Michael Bleyer LVA Stereo Vision Segmentation Based Stereo Michael Bleyer LVA Stereo Vision What happened last time? Once again, we have looked at our energy function: E ( D) = m( p, dp) + p I < p, q > We have investigated the matching

More information

Computer Vision I - Filtering and Feature detection

Computer Vision I - Filtering and Feature detection Computer Vision I - Filtering and Feature detection Carsten Rother 30/10/2015 Computer Vision I: Basics of Image Processing Roadmap: Basics of Digital Image Processing Computer Vision I: Basics of Image

More information

Computer Vision Lecture 20

Computer Vision Lecture 20 Computer Perceptual Vision and Sensory WS 16/76 Augmented Computing Many slides adapted from K. Grauman, S. Seitz, R. Szeliski, M. Pollefeys, S. Lazebnik Computer Vision Lecture 20 Motion and Optical Flow

More information

Towards the completion of assignment 1

Towards the completion of assignment 1 Towards the completion of assignment 1 What to do for calibration What to do for point matching What to do for tracking What to do for GUI COMPSCI 773 Feature Point Detection Why study feature point detection?

More information

Data Term. Michael Bleyer LVA Stereo Vision

Data Term. Michael Bleyer LVA Stereo Vision Data Term Michael Bleyer LVA Stereo Vision What happened last time? We have looked at our energy function: E ( D) = m( p, dp) + p I < p, q > N s( p, q) We have learned about an optimization algorithm that

More information

Constructing a 3D Object Model from Multiple Visual Features

Constructing a 3D Object Model from Multiple Visual Features Constructing a 3D Object Model from Multiple Visual Features Jiang Yu Zheng Faculty of Computer Science and Systems Engineering Kyushu Institute of Technology Iizuka, Fukuoka 820, Japan Abstract This work

More information

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam Presented by Based on work by, Gilad Lerman, and Arthur Szlam What is Tracking? Broad Definition Tracking, or Object tracking, is a general term for following some thing through multiple frames of a video

More information

Capturing, Modeling, Rendering 3D Structures

Capturing, Modeling, Rendering 3D Structures Computer Vision Approach Capturing, Modeling, Rendering 3D Structures Calculate pixel correspondences and extract geometry Not robust Difficult to acquire illumination effects, e.g. specular highlights

More information

Proceedings of the 6th Int. Conf. on Computer Analysis of Images and Patterns. Direct Obstacle Detection and Motion. from Spatio-Temporal Derivatives

Proceedings of the 6th Int. Conf. on Computer Analysis of Images and Patterns. Direct Obstacle Detection and Motion. from Spatio-Temporal Derivatives Proceedings of the 6th Int. Conf. on Computer Analysis of Images and Patterns CAIP'95, pp. 874-879, Prague, Czech Republic, Sep 1995 Direct Obstacle Detection and Motion from Spatio-Temporal Derivatives

More information

Problem definition Image acquisition Image segmentation Connected component analysis. Machine vision systems - 1

Problem definition Image acquisition Image segmentation Connected component analysis. Machine vision systems - 1 Machine vision systems Problem definition Image acquisition Image segmentation Connected component analysis Machine vision systems - 1 Problem definition Design a vision system to see a flat world Page

More information

CS231A Course Notes 4: Stereo Systems and Structure from Motion

CS231A Course Notes 4: Stereo Systems and Structure from Motion CS231A Course Notes 4: Stereo Systems and Structure from Motion Kenji Hata and Silvio Savarese 1 Introduction In the previous notes, we covered how adding additional viewpoints of a scene can greatly enhance

More information

Visual Tracking (1) Feature Point Tracking and Block Matching

Visual Tracking (1) Feature Point Tracking and Block Matching Intelligent Control Systems Visual Tracking (1) Feature Point Tracking and Block Matching Shingo Kagami Graduate School of Information Sciences, Tohoku University swk(at)ic.is.tohoku.ac.jp http://www.ic.is.tohoku.ac.jp/ja/swk/

More information

Motion. 1 Introduction. 2 Optical Flow. Sohaib A Khan. 2.1 Brightness Constancy Equation

Motion. 1 Introduction. 2 Optical Flow. Sohaib A Khan. 2.1 Brightness Constancy Equation Motion Sohaib A Khan 1 Introduction So far, we have dealing with single images of a static scene taken by a fixed camera. Here we will deal with sequence of images taken at different time intervals. Motion

More information

What have we leaned so far?

What have we leaned so far? What have we leaned so far? Camera structure Eye structure Project 1: High Dynamic Range Imaging What have we learned so far? Image Filtering Image Warping Camera Projection Model Project 2: Panoramic

More information

Visual Tracking. Image Processing Laboratory Dipartimento di Matematica e Informatica Università degli studi di Catania.

Visual Tracking. Image Processing Laboratory Dipartimento di Matematica e Informatica Università degli studi di Catania. Image Processing Laboratory Dipartimento di Matematica e Informatica Università degli studi di Catania 1 What is visual tracking? estimation of the target location over time 2 applications Six main areas:

More information

EE795: Computer Vision and Intelligent Systems

EE795: Computer Vision and Intelligent Systems EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 FDH 204 Lecture 11 140311 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Motion Analysis Motivation Differential Motion Optical

More information

Ground Plane Motion Parameter Estimation For Non Circular Paths

Ground Plane Motion Parameter Estimation For Non Circular Paths Ground Plane Motion Parameter Estimation For Non Circular Paths G.J.Ellwood Y.Zheng S.A.Billings Department of Automatic Control and Systems Engineering University of Sheffield, Sheffield, UK J.E.W.Mayhew

More information

Dynamic Time Warping for Binocular Hand Tracking and Reconstruction

Dynamic Time Warping for Binocular Hand Tracking and Reconstruction Dynamic Time Warping for Binocular Hand Tracking and Reconstruction Javier Romero, Danica Kragic Ville Kyrki Antonis Argyros CAS-CVAP-CSC Dept. of Information Technology Institute of Computer Science KTH,

More information

AN important task of low level video analysis is to extract

AN important task of low level video analysis is to extract 584 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 5, MAY 2007 Spatio Temporal Regularity Flow (SPREF): Its Estimation and Applications Orkun Alatas, Pingkun Yan, Member,

More information

Finally: Motion and tracking. Motion 4/20/2011. CS 376 Lecture 24 Motion 1. Video. Uses of motion. Motion parallax. Motion field

Finally: Motion and tracking. Motion 4/20/2011. CS 376 Lecture 24 Motion 1. Video. Uses of motion. Motion parallax. Motion field Finally: Motion and tracking Tracking objects, video analysis, low level motion Motion Wed, April 20 Kristen Grauman UT-Austin Many slides adapted from S. Seitz, R. Szeliski, M. Pollefeys, and S. Lazebnik

More information

An Approach for Real Time Moving Object Extraction based on Edge Region Determination

An Approach for Real Time Moving Object Extraction based on Edge Region Determination An Approach for Real Time Moving Object Extraction based on Edge Region Determination Sabrina Hoque Tuli Department of Computer Science and Engineering, Chittagong University of Engineering and Technology,

More information

Multiple View Geometry

Multiple View Geometry Multiple View Geometry CS 6320, Spring 2013 Guest Lecture Marcel Prastawa adapted from Pollefeys, Shah, and Zisserman Single view computer vision Projective actions of cameras Camera callibration Photometric

More information

Fundamental matrix. Let p be a point in left image, p in right image. Epipolar relation. Epipolar mapping described by a 3x3 matrix F

Fundamental matrix. Let p be a point in left image, p in right image. Epipolar relation. Epipolar mapping described by a 3x3 matrix F Fundamental matrix Let p be a point in left image, p in right image l l Epipolar relation p maps to epipolar line l p maps to epipolar line l p p Epipolar mapping described by a 3x3 matrix F Fundamental

More information

COMPARATIVE STUDY OF DIFFERENT APPROACHES FOR EFFICIENT RECTIFICATION UNDER GENERAL MOTION

COMPARATIVE STUDY OF DIFFERENT APPROACHES FOR EFFICIENT RECTIFICATION UNDER GENERAL MOTION COMPARATIVE STUDY OF DIFFERENT APPROACHES FOR EFFICIENT RECTIFICATION UNDER GENERAL MOTION Mr.V.SRINIVASA RAO 1 Prof.A.SATYA KALYAN 2 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING PRASAD V POTLURI SIDDHARTHA

More information

Lecture 16: Computer Vision

Lecture 16: Computer Vision CS4442/9542b: Artificial Intelligence II Prof. Olga Veksler Lecture 16: Computer Vision Motion Slides are from Steve Seitz (UW), David Jacobs (UMD) Outline Motion Estimation Motion Field Optical Flow Field

More information

A Robust Wipe Detection Algorithm

A Robust Wipe Detection Algorithm A Robust Wipe Detection Algorithm C. W. Ngo, T. C. Pong & R. T. Chin Department of Computer Science The Hong Kong University of Science & Technology Clear Water Bay, Kowloon, Hong Kong Email: fcwngo, tcpong,

More information

Perceptual Grouping for Multiple View Stereo using Tensor Voting

Perceptual Grouping for Multiple View Stereo using Tensor Voting Perceptual Grouping for Multiple View Stereo using ensor Voting Philippos Mordohai and Gérard Medioni Integrated Media Systems Center University of Southern California {mordohai, medioni}@iris.usc.edu

More information

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep CS395T paper review Indoor Segmentation and Support Inference from RGBD Images Chao Jia Sep 28 2012 Introduction What do we want -- Indoor scene parsing Segmentation and labeling Support relationships

More information

Learning Two-View Stereo Matching

Learning Two-View Stereo Matching Learning Two-View Stereo Matching Jianxiong Xiao Jingni Chen Dit-Yan Yeung Long Quan Department of Computer Science and Engineering The Hong Kong University of Science and Technology The 10th European

More information

EXAM SOLUTIONS. Image Processing and Computer Vision Course 2D1421 Monday, 13 th of March 2006,

EXAM SOLUTIONS. Image Processing and Computer Vision Course 2D1421 Monday, 13 th of March 2006, School of Computer Science and Communication, KTH Danica Kragic EXAM SOLUTIONS Image Processing and Computer Vision Course 2D1421 Monday, 13 th of March 2006, 14.00 19.00 Grade table 0-25 U 26-35 3 36-45

More information

CS 565 Computer Vision. Nazar Khan PUCIT Lectures 15 and 16: Optic Flow

CS 565 Computer Vision. Nazar Khan PUCIT Lectures 15 and 16: Optic Flow CS 565 Computer Vision Nazar Khan PUCIT Lectures 15 and 16: Optic Flow Introduction Basic Problem given: image sequence f(x, y, z), where (x, y) specifies the location and z denotes time wanted: displacement

More information

Ruch (Motion) Rozpoznawanie Obrazów Krzysztof Krawiec Instytut Informatyki, Politechnika Poznańska. Krzysztof Krawiec IDSS

Ruch (Motion) Rozpoznawanie Obrazów Krzysztof Krawiec Instytut Informatyki, Politechnika Poznańska. Krzysztof Krawiec IDSS Ruch (Motion) Rozpoznawanie Obrazów Krzysztof Krawiec Instytut Informatyki, Politechnika Poznańska 1 Krzysztof Krawiec IDSS 2 The importance of visual motion Adds entirely new (temporal) dimension to visual

More information

CS231A Midterm Review. Friday 5/6/2016

CS231A Midterm Review. Friday 5/6/2016 CS231A Midterm Review Friday 5/6/2016 Outline General Logistics Camera Models Non-perspective cameras Calibration Single View Metrology Epipolar Geometry Structure from Motion Active Stereo and Volumetric

More information

WE investigate a fundamental issue in computer vision:

WE investigate a fundamental issue in computer vision: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 9, SEPTEMBER 2007 1627 Detecting Motion Regions in the Presence of a Strong Parallax from a Moving Camera by Multiview Geometric

More information

MULTIVIEW REPRESENTATION OF 3D OBJECTS OF A SCENE USING VIDEO SEQUENCES

MULTIVIEW REPRESENTATION OF 3D OBJECTS OF A SCENE USING VIDEO SEQUENCES MULTIVIEW REPRESENTATION OF 3D OBJECTS OF A SCENE USING VIDEO SEQUENCES Mehran Yazdi and André Zaccarin CVSL, Dept. of Electrical and Computer Engineering, Laval University Ste-Foy, Québec GK 7P4, Canada

More information

Computer Vision I - Basics of Image Processing Part 2

Computer Vision I - Basics of Image Processing Part 2 Computer Vision I - Basics of Image Processing Part 2 Carsten Rother 07/11/2014 Computer Vision I: Basics of Image Processing Roadmap: Basics of Digital Image Processing Computer Vision I: Basics of Image

More information

MOVING OBJECT DETECTION USING BACKGROUND SUBTRACTION ALGORITHM USING SIMULINK

MOVING OBJECT DETECTION USING BACKGROUND SUBTRACTION ALGORITHM USING SIMULINK MOVING OBJECT DETECTION USING BACKGROUND SUBTRACTION ALGORITHM USING SIMULINK Mahamuni P. D 1, R. P. Patil 2, H.S. Thakar 3 1 PG Student, E & TC Department, SKNCOE, Vadgaon Bk, Pune, India 2 Asst. Professor,

More information