Bayesian Algorithms for Simultaneous. Structure From Motion Estimation of Multiple. Independently Moving Objects

Size: px

Start display at page:

Download "Bayesian Algorithms for Simultaneous. Structure From Motion Estimation of Multiple. Independently Moving Objects"

Sharyl Randall
5 years ago
Views:

1 1 Bayesian Algorithms for Simultaneous Structure From Motion Estimation of Multiple Independently Moving Objects Gang Qian*, Member, IEEE, Rama Chellappa, Fellow, IEEE, and Qinfen Zheng, Member, IEEE Center for Automation Research Institute for Advanced Computer Studies University of Maryland, College Park, MD 07. Phone: (301)05-35, Fax:(301) Abstract In this paper, the problem of simultaneous structure from motion estimation for multiple independently moving objects from a monocular image sequence is addressed. Two Bayesian algorithms are presented for solving this problem using the sequential importance sampling (SIS) technique. The empirical posterior distribution of object motion and feature separation parameters is approximated by weighted samples. The first algorithm addresses the problem when only two moving objects are present. A singular value decomposition (SVD) based sample clustering algorithm is shown to be capable of separating samples related to different objects. A pair of SIS procedures is used to track the posterior distribution of the motion parameters. In the second algorithm, a balancing step is added into the SIS procedure to preserve samples of low weights so that all objects have enough samples to propagate empirical motion distributions. By using the proposed algorithms, the relative motions of all moving objects with respect to the camera can be simultaneously estimated. Both algorithms have been tested on synthetic and real image sequences. Improved results have been achieved.

2 EDICS: -SEQP Image Sequence Processing; -ANAL Analysis; -SEGM Segmentation I. INTRODUCTION Visual motion analysis has been an active research area in the image/video understanding community for over two decades. A major issue in the area is to recover the motion and structure of moving objects from an image sequence recorded by a monocular camera, which is usually known as the structure from motion (SfM) problem. Although the geometry of the SfM problem has been well understood, due to the nonlinearity of the camera projection model and the non-gaussianity of image measurement noise, research efforts on solving the problem have been mainly focused on scenarios when only one rigid object is present, e.g. when a moving camera looks at a static object. In this paper, we will address the SfM problem when multiple moving objects are involved concurrently. Basically, we want to solve the following problem. Given an image sequence containing two or more independently moving (with respect to the camera) objects, the goal is to simultaneously estimate the motion between each moving object and the camera as well as the 3D structure of each object. We refer this as the Multiple Independently Moving Objects Structure From Motion (MIMO-SfM) problem. Even though multiple independently moving objects are ubiquitous in the real world and researchers have been trying to solve this problem, it still remains a big challenge, mainly because of a lack of robust algorithms for the basic static scene SfM problem. Existing MIMO-SfM algorithms can be roughly categorized based on whether optical flow or correspondence of feature point is used. The pioneering work of MIMO-SfM using optical flow was proposed by Adiv [1], where under the assumption of piecewise planar scene, the flow field is first partitioned into connected segments using the generalized Hough technique and then the segments belonging to each moving object are grouped such that all segments in one group can be explained by a single rigid motion. Due to the assumption made about the scene structure, this algorithm has difficulties in dealing with image sequences containing complex 3D scene structure with rich higher-order 3D curved surfaces. Regarding using feature correspondences, Costeira and Kanade proposed a multi-body factorization method [] to recover motion and structure of multiple objects. This algorithm was extended from the

3 3 factorization method for SfM by Tomasi and Kanadi [3]. The shape interaction matrix is used to segment out multiple moving objects. Although this factorization-based algorithm is elegant, the assumption of camera imaging model is quite restrictive. The extended Kalman filter (EKF) has been used to solve the problem. In [], Darrell et al. proposed an EKF-based algorithm based on a recursive SfM algorithm presented by Azarbayejani and Pentland in [5]. The minimum description length (MDL) criterion is employed to determine the minimum number of models and their corresponding feature groups so that the models can interpret the feature trajectories in the associated groups. However, the recursive SfM algorithm in [5] is not guaranteed to converge to the true solution. Another MIMO-SfM algorithm proposed by Soatto and Perona [] also uses an EKF. It is based on a motion estimation algorithm called essential filter [7]. Because the relative motion of objects between consecutive image frames is computed, correspondences between adjacent time instants bear very low signal-to-noise ratio (SNR). Due to inherent ambiguities present in the SfM problem using noisy observations, the essential filter could converge to false solutions. Certain restrictions on the moving objects trajectories have also been added to help solve the problem. In the methods proposed by Han and Kanade [] and Shashua and Levin [9], the dynamic points (features on different moving objects) moving along straight-line paths with constant velocity have been considered. Although improvements can be achieved by restricting the object motion, potential applications of the algorithms are greatly diminished. Recently, Bayesian algorithms for static scene SfM have been developed to deal with nonlinearity and measurement errors. The Bayesian SfM algorithm using the SIS technique [] has been shown to be resistant to image measurement errors and also be able to handle motion ambiguities. An algorithm for moving object detection from a moving camera presented in [11] is also based on the SIS framework. Encouraging results have also been reported. This research indicates that the SIS technique is a very promising way to solving the MIMO-SfM problem. SIS is a recently proposed technique for approximating the posterior distribution of the state parameters of a dynamic system [1]. It is a useful tool for solving problems with non-linearity, non-gaussianity

4 and multi-modalities. In SIS, weighted samples are used to describe the posterior distribution of the state parameters and these samples are propagated from one time instant to the next to capture changes in the target distribution. The samples and weights are used to compute the statistics of the state variables of interest, e.g. mean and modes. In this paper, two algorithms are presented to handle different scenarios in the MIMO-SfM problem. The first approach is restricted to solving the problem when only two objects are present and the second one handles more general cases with two or more objects. Feature correspondences are the input to the algorithms. We assume the perspective camera projection model and rigid moving objects. No other assumptions about either the 3D scene structure or object motion are made, Solving the MIMO-SfM problem using SIS is not as simple as it appears. Ideally, when infinite number of samples are used, the moving object detection algorithm proposed in [11] is able to simultaneously estimate the motion parameters for each independently moving object. In practice, since only a finite number of samples can be used, samples of different moving objects will compete against each other for the offspring at the resampling step. It is very likely that one object cannot get enough samples to represent its motion so that this object is eliminated from the competition for samples and its motion and structure cannot be recovered, i.e. this object may just disappear from the SIS procedure. To deal with this issue, different strategies are suggested. In the first approach for two moving objects, two separate SIS procedures are invoked, so that each of them is responsible for the propagation of the samples and weights of the motion parameters associated with one moving object. In the second approach, only one SIS procedure is employed and an additional step called balancing is introduced, which keeps the numbers of offspring of all potential moving object candidates approximately equal. The organization of the paper is as follows. In Section, we briefly introduce the SIS method and review related work on SfM for static scenes. The two approaches are then discussed in Section 3. In Section, experimental results obtained using both synthetic and real image sequences are presented. Conclusions and future research directions are summarized in Section 5.

5 > > 5 II. BACKGROUND To solve the MIMO-SfM problem, we would like to find optimal estimates for object motion, structure and feature segmentation, which can provide the best interpretation of the observations using criteria such as the maximum a posteriori probability (MAP). In this paper,, the posterior distribution of the parameters, is approximated using random sampling methods. In this way, a much clearer picture of the structure of solution space can be provided. Not only the optimal estimates can be found, the uncertainty of the estimates can also be captured. A. Sequential Importance Sampling The SIS method is a recently proposed technique for approximating the posterior distribution of the state parameters of a dynamic system [1]. Usually, the state space model of a dynamic system is described by observation and state equations. If the measurement is denoted by y and the state parameter by x, the observation equation essentially provides the conditional distribution of the observation given the state, y! x. Similarly, the state equation gives the Markov transition distribution from time to time #"%$, & x '#( x. The goal is to find the posterior distribution of the states x( ) x* ) ++ +,) x given all the available observations up to, - /. 3., where. 05 x7 7:9 ( and 0;5 y7 7:9 (. One way to represent the approximation of the posterior distribution is by properly weighted samples [1]. Suppose 5.=<?>,@ AB 9 ( is a set of random samples properly weighted by the set of weights 5 CD<?>,@ B with respect to - and let E '#( be a trial distribution. Then the recursive SIS procedure to obtain the random samples and weights properly weighting - '#( is as follows [1]. SIS steps: for FG0H$ ) +++I), 9 ( (A) Draw J '#( 0 x<k>l@ '#( from E M'N( xm'n( O.<K>L@ A. Attach x<?>,@ '#( to form.=<k>l@ '#( 0 /.<K>L@ ) x<k>l@ '#(. (B) Compute the incremental weight P M'N( by and let CS<K>L@ '#( 0P#<K>L@ '#( CS<?>,@. P '#( <?>,@ 0 - '#( /. '#( <?>,@ - /. <?>,@ QE '#( x '#( R. <?>,@

6 > It can be shown [1] that 5. '#( ) C '#( <?>,@ B 9 ( is properly weighted with respect to - '#(. Hence, the above SIS steps can be applied recursively to get the properly weighted set for any future time instant when the corresponding observations are available. It is not difficult to show that given the properly weighted samples 5. with respect to the joint posterior distribution - /., the marginal samples formed by the components in 5. of x7 are properly weighted by the same set of weights with respect to the marginal posterior distribution - x7. Once the properly weighted samples of the joint distribution are obtained, the marginal distributions are approximated by the marginal samples weighted using the same set of weights. The choice of the trial distribution E M'N( is very crucial in the SIS procedure since it directly affects the efficiency of the proposed SIS method. In our approach, we use E '#( x '#( R. & M'N( xm'n( x where & M'N( x '#( x is the one step Markov transition probability density function (PDF). In the computation of camera motion, it describes camera motion transition from one time instant to the next and will be used to update camera motion samples. The incremental weight P '#( is proportional to ym'n( xm'n(, the conditional probability density function of the observations given the state sample. It is also known as the likelihood function of the state samples. In SIS, an additional resampling step [1] often follows the sample weight evaluation after drawing new samples for the current state. A major benefit of resampling is to statistically reduce bad samples (with small weights) and encourage good samples so that good samples will produce enough number of offspring to describe the distribution of future states. B. Structure from Motion Using SIS for Static Scenes The SIS technique has been used in solving the SfM problem for static scenes. In [], camera motion and scene structure parameters are estimated from D feature trajectories using SIS. To reduce the computational load and make the problem solvable using sampling based methods, decoupling between camera motion and structure parameters was performed. A recursive SIS-based algorithm was first used to track the empirical posterior distribution of the camera motion parameters. Then the depth values of

7 7 X C ZC X C Ψx YC ZC Ψ z O ( α, β ) O Ψy YC Fig. 1. Motion parameters of a moving camera. the feature points on the static scene were computed using the camera motion estimates by triangulation. To make this paper complete, a brief review of [] is given below. Camera Motion Parameters: Assume that the camera is moving and the scene is static. We would like to compute the global overall motion of the camera in a world coordinate system. Another coordinate system sticks to the camera. Let the positive direction of the axis of camera system point to camera looking direction and the camera center be its origin point. The camera and world coordinate systems initially coincide. When the camera moves, camera system travels with the camera. As shown in Figure 1, the following parameters are used to represent the global camera motion: m 0; ) ) ) ) (1) where ) ) are the camera rotation angles bout the axes of world system and ) are the elevation and azimuth angles of the camera translation direction. The unit vector in the translation direction is given by T ) 0 I ) # ) I,. Also be noted that the camera motion parameters are all relative to the world coordinate system, and it is not the instantaneous camera motion between consecutive time instants. In this way, the translation can be accumulated over time and the scene structure can be accurately estimated. Camera Motion and Scene Structure Estimation Using SIS: During the computation of the camera motion parameters, random samples are drawn from the camera motion parameters space. The epipolar constraint

8 [13] is then used to evaluate the samples weights. Recall that the epipolar constraint says that the perspective projections of a 3D point on the two image planes taken from different viewpoints lie on their corresponding epipolar lines, which are the intersections of the two image planes with the epipolar plane containing the 3D point and the two camera centers before and after camera movement, of the camera. Given the image position of a point in one view and camera motion parameters between the two views, the epipolar line related to this point in the other view can be easily determined and the epipolar distance can be computed. During the sample weight evaluation, the smaller the resultant epipolar distance is, the larger the weight. After weight evaluation, new samples are drawn using kinematics models of the camera. The empirical camera motion distribution can then be propagated through out the sequence. Once the camera motion is estimated, scene structure is determined by triangulation. See [] for a full treatment of SfM using the SIS techniques. III. BAYESIAN MULTIPLE INDEPENDENTLY MOVING OBJECT STRUCTURE FROM MOTION Two approaches for simultaneous estimation of motion and structure of multiple independently moving objects are presented in this section. The first one is designed for the case when only two independently moving objects are involved and the second deals with more complex scenarios when there are more than two moving objects. Before the presentation of the algorithms, we would like to formulate a state space model for object motion and feature segmentation. A. A State Space Model In the MIMO-SfM problem, object motions, structures and feature segmentation need to be solved. Similar to [], object motion and feature segmentation are first solved and then object structure is recovered. Camera Motion Parameters: The relative motion between each object and camera needs to be estimated. Either the object or the camera can be assumed to be static. In our approach, we assume that the objects are static and the camera motion relative to individual object is estimated. Once the camera motion is

9 9 obtained, it is very straightforward to compute the object motion relative to the camera if it is desired. The camera motion parameters used is given by (1). Feature Segmentation Parameters: Trajectories of feature points are used as measurements. These points belong to different moving objects. The task of feature point segmentation is to cluster the entire feature set into multiple subsets, one for each object. To represent the membership of the feature points, we use the validity vector, introduced in [11]. If there are total feature points used, is an -dimensional vector. Each feature point has a corresponding entry in, which indicates the membership of the related feature point. The complete state vector describing camera motion and feature segmentation is given by x 0 m ) Given a state sample, the motion parameters represent the camera motion relative to a particular moving object. Associations of feature points to this moving object are described by the validity vector in the same state sample. If the value of the entry in the validity vector of a feature point is positive and high, it is very probable that this feature point belongs to this object. Camera Kinematics and Projection Models: Given the above state vector, a dynamic system can be used to describe the behavior of a moving camera observing multiple independently moving objects: x '#( 0 x " n () y 0!F x ) " n (3) F + denotes the perspective projection, a function of camera motion, feature segmentation and the scene structure. n denotes the dynamic noise in the system, describing the time-varying property of the state vector and it contains both the dynamic noise of the camera motion and the validity vector. Even though the associations of the feature points could be fixed if we assume that no object splitting/merging occurred during the observation period, validity vectors are still time varying during the SIS procedure. By allowing the validity vector to be time-varying, the resulting temporal integration of the feature segmentation makes the feature segmentation results more robust to measurement errors.

10 B. Two Independently Moving Object Structure from Motion Recall that in the SIS procedure, there are two essential steps for sample and weight propagation in time: drawing new samples and update weights. Drawing Samples: Given state samples for the previous time instant, samples for the current time are drawn from a trial function. Although any trial function can be used, to design effective SIS algorithms, trial functions that can describe the dynamics of the system are desired. For motion parameters, as the prior knowledge about motion parameters is not available, the random walk model is a suitable alternative for modeling the camera position and orientation. Therefore, the motion samples at " $ are drawn using m '#( 0 m " n B () where n B represents the dynamic noise of the motion parameters. Similarly, the validity vector sample at "$ are obtained as '#( 0 " m ) y " n (5) where n is the dynamic noise in the validity vector and is an exponential forgetting factor. Both of them represent the possible time-varying nature of the validity vector. + is a function used to update the current validity vector. In this paper, different forms of + are applied, depending on the tasks. In the case of two moving objects, elements of + are computed as where 7 0 m,) y< 7 m,) y 0 ) ) if 7 otherwise is the distance from the th feature point to its associated epipolar line given the motion parameters m. is a prechosen threshold for this distance according to the feature tracking error level. In our experiments, the values are usually twice as large as the standard deviation of the feature tracking errors. Weight Evaluation: Since the state transition PDF is used for drawing new samples, the likelihood function of the observation given the state samples is used to compute the sample weights. It is given by y x "!# ' %$'&)( 5, * +* - ",.* 0/ 713 () ' /L (7)

11 ( < * ' ( 11 0;5 /L is among the seven highest values of and * is given by ' /L * 0 and / ) / ) )$ $ * ' /L / is an indicator function. It returns one if is true or zero otherwise. Sample Clustering: Since the camera motion and feature segmentation samples related to different objects are mixed, we would like to cluster state samples belonging to the same object and then use different SIS procedure to propagate samples to different objects. () (9) Assume that two moving objects, and, are present in the image sequence. There are two different kinds of validity vector samples, one for each moving object. These two kinds of samples are symmetric in the sense that for samples of object, the entries corresponding to the feature points on object have high positive values and the entries corresponding to the feature points on object have low negative values, and vice versa. Moreover, due to the symmetric nature of updating the validity vector samples used in the trial function (19), the additive inverse of a sample related to object can be viewed as a sample of object and the sample means of the two sets of validity vector samples corresponding to objects and, which can be approximated by the validity vectors obtained using true object motion, are mutually additive inverses to each other. This symmetric structure makes it possible to be clustered into two sets of samples using the singular value decomposition (SVD). Let the samples of the validity vector after resampling be 5 Let the matrix V ) ) ++ +,) ) <K>L@ 9 (, where is number of samples. > be the validity matrix. Proposition 1: If 5 x7b7:9 (, the rows of an matrix A, are either samples randomly drawn from a distribution - x or the additive inverses of such samples, then the right singular vector corresponding

12 ( 5 0 $ $ > / 5 B B B to the the largest singular value is proportional to the mean of - x when the number of samples goes to infinity. Proof: The singular value decomposition applied to a matrix A is as follows: A 0 USV () / 9 ( u v (11) where U ( ) and V ( ) are two unitary matrices and 5 u and 5 v are the column vectors of the two matrices. S is a diagonal matrix with non-negative decreasing singular values 5 along the diagonal. Determining the largest singular value ( and its corresponding left and right singular vectors u( and v( is equivalent to solving the following optimization problem: where u 0KP < / ( ) u( ) v( 0 u v / 7:9 ( x7 P )+ ++) P < Since x7 is either a sample randomly drawn from - x or the additive inverse of such a sample, statistically, the absolute values of 5P < v( v v $ $ %/ 71 / 71 x7 v * x7 v * " / 71 " / v B7:9 ( are equal. Therefore, x7 " v * x7 v * (1) (13) (1) where ( is the set of indices of rows containing samples drawn from - x and * is the complementary to (. Notice that 5 x771 are samples from - x. Hence (1) can be rewritten as where 5 x > B > Hence when v( v 9 ( x > v * 9 ( are samples from - x. From the Law of Large Numbers, it follows that x v * 0 B goes to infinity, v( can be obtained as v( v $ / 9 ( > x > v * (15) (1) x v * (17)

13 B B 7 7 B B 7 13 Thus v( is proportional to the minimum mean square error (MMSE) estimate of x, which is given by the conditional mean, x 0 x - x. xk5 x B > 9 (. When goes to infinity, it is equal to the unconditional mean > Due to the symmetric structure of the samples belonging to the two moving objects, the validity matrix can be viewed as a matrix containing either samples from a distribution - or the additive inverses of the samples from the same distribution. Therefore, according to the above proposition, when the number of samples is large enough, the right singular vector corresponding to the largest singular value approaches the mean of -. As a result, we have the following algorithm for clustering the samples of the motion parameters and the validity vectors of two moving objects. SVD-based Sample Clustering Algorithm 1) Construct the validity matrix V using the samples of the validity vector by adding them to the rows of V. ) Compute the SVD of the validity matrix V. Find the largest singular value and its corresponding left and right singular vectors u and v. Compute v, the additive inverse of v by changing the signs of all entries in v. Let be the minimum number of feature points required in solving SfM (e.g. in the case of a calibrated camera seven points are sufficient to uniquely determine the motion and structure parameters [1]). If the number of positive entries in either v or v is less than, the samples cannot be correctly clustered at the current time instant. Skip the remaining steps in the sample clustering algorithm and return to the main SIS procedure. For each row of the validity matrix V 0;5 v< 3) Separation samples. If the number of common positive entries in v< the th motion and validity vector sample x< common positive entries in v<. *. It is possible that and v is larger than or equal and v is larger than or is segmented to cluster. (. If the number of, is segmented to belongs to both clusters or neither. The samples belonging to the two moving objects have been segmented out into two clusters,. ( and. *. Once the sample population is clustered into two sets related to the two moving objects, the motion

14 > > B > B > 1 and structure of each object can then be estimated separately. The overall algorithm for the case of two independently moving objects SfM is described below. SIS Procedure for Two Independently Moving Objects 1) Initialization. Draw samples of the motion parameters 5 m<k>l@ B 9 ( from the initial distribution -. - describes the distribution of motion parameters m before the camera moves. Although the rotation angles and the translational vector are all zero, the translational angles are uniformly distributed. Hence in the state samples, the rotation angles are set to zero and translation direction angles are drawn from the uniform distribution in ) - and ) -, respectively. The validity vector samples are set to one. Assign equal weights to these samples. For 0H$ ) +++) : ) Sample generation. Draw samples of the motion parameters at time instant, 5 m<?>,@ ;B the distributions of 5 m<?>,@ ( B 9 ( " 9 (, from. Since video sequences are used here as the image sources instead of sets of image frames in arbitrary orders, a random walk dynamic model is assumed and the following distribution can be used as a good approximation to that of ), ) 5 ) ) (1) ) ) 5 ) where,, and can be chosen as some positive numbers. Draw samples of the validity vector 5 ;B <K>L@ 9 ( via (5), with function + defined by (). 3) Weight computation and resampling. Compute the weights of the samples,5c ; <K>L@, using the observed feature correspondences according to the likelihood equation (7). The resulting samples and : their corresponding weights x<?>,@ ) CS<K>L@ ; are properly weighted with respect to - x. Resample the above samples. ) Sample clustering and invocation of new SIS procedure. Apply the above SVD-based sample clustering algorithm to the sample set. If the samples belonging to the two moving objects cannot be clustered into two clusters, go to step ). Otherwise, one more SIS procedure is invoked. The current SIS procedure will keep the propagation of samples and weights of the state parameters corresponding to the first object, whose related samples are in cluster (, and the new SIS procedure

15 15 will maintain the samples and weights of the state parameters of the other object, using the samples in *. Thus, the motion/structure and validity vector related to the two moving objects can be simultaneously estimated. By using the above SIS procedure, the posterior distributions of the camera motion and feature segmentation parameters related to each object can be approximately described by the resulting samples and their corresponding weights. For a sample set related to a moving object, the empirical mean of the validity vector can be input to clustering methods such as the k-means algorithm to determine the final feature segmentation of this moving object. Inference of Depth Distribution: The posterior depth distribution of segmented feature points for each object can then be obtained. In [], two algorithms are presented to find the posterior distribution of feature depths. The discussion about the inference of - z based on the results obtained on - /. is still valid. Hence, the posterior distribution of the depths of the segmented feature points for each moving object can be directly inferred using the samples and weights properly weighted with respect to the posterior motion distribution. Both algorithms developed in [] can be used to find samples and weights properly weighted with respect to the posterior distribution of feature depths. C. Multiple Independently Moving Object Structure from Motion We now look at the more general case of MIMO-SfM when two or more moving objects are present. In this algorithm, the two steps of sample-drawing and weight-evaluation are very similar to those in the previous section. The only change is in the function for updating validity vectors. It is modified as 7 m ) y 0; * E 7 7 " $ where 7 and have the same meanings as the case of two moving objects. 7 " $ (19) Sample Clustering: The purpose of sample clustering is different from that in the previous algorithm. Here only one SIS procedure is used. State samples corresponding to the same object are clustered and the camera motion and feature segmentation with respect to each individual object can then be inferred using the clustered samples. A balancing step is applied in the SIS procedure to keep the sample numbers

16 B B 1 of all the candidate moving objects comparable. The state samples are clustered according to the signs of the entries of the validity vector samples. Sign-based Sample Clustering Algorithm 1) Sample sorting. Let m< before the resampling step. 0 5 ) be the samples of motion and validity be the corresponding weights of the state samples. Sort the samples according to their weights such that the first sample has the highest weight. Denote the new sorted sample set and weight set by. and, respectively. ) Sample cluster center generation. Starting from the first sample in., candidate cluster centers are generated by looking at the signs of the samples in the new sample set. Assume that up to the current time instant, there are clusters 5 9 ( 0 5. ) 9 ( and each cluster has a cluster center. For a given state sample, ignore the entries in the validity vector that have been occupied by any positive entry of the validity vectors of the existing candidate centers. If the number of remaining positive entries of the validity vector in the current sample is greater than or equal to (the minimum required number of feature points for solving SfM), a new candidate cluster is generated with the current sample as the " $ st cluster center. 3) Sample clustering. The samples that contain at least positive entries in the validity vector in common with the new cluster center are segmented into the new cluster. Thus a new sample weight cluster is formed: '#( 0;5. 'N( ) '#( Go to step and look for new candidate cluster centers. Stop when all the samples in. have been visited by the center generation step. ) Finalization. It is possible that a sample is not assigned to any cluster. To preserve these samples, the samples not included in any cluster are grouped into a new cluster. Hence, all the samples in. are segmented into different clusters. Each cluster corresponds to a candidate moving object. By using this clustering algorithm, the samples in. are segmented into different clusters corresponding to potential moving objects.

17 > ( 17 Weight Balancing: To solve the problem of samples competition between different object for the number of offspring during the resampling step, a balancing step is added into the SIS procedure. Define the cumulative weight of a sample set as follows. Definition For a sample weight cluster 0 5. ),, the cumulative weight of, is the summation of the weights in 05C <K>L@ B 9 (, i.e. 0 The following step is added into the SIS procedure. Balancing step B 9 ( C <K>L@. > 1) Cluster weight evaluation. Assume that after the implementation of the clustering procedure, sample clusters are formed including the one produced in the finalization step. Compute the cumulative weight of each sample cluster. ) Weight modification. Since each cluster should be treated equally in the resampling step in the SIS, the cumulative weights of all the clusters should be equal to 30. The sample weights in all the clusters need to be modified so that the cumulative weight of each individual cluster equals. For a cluster 0 5. ), the weight modification is done by multiplying the weights in by a scalar 0 where is the cumulative weight of the cluster before the weight modification step. This step is called weight-balancing. Denote the weight-balanced cluster by 0;5. ) 0;5. ). Hence, we have the following complete SIS procedure for the multiple independently moving object SfM algorithm. SIS Procedure for Multiple Independently Moving Objects 1) Initialization. (Same as the initialization in the SIS procedure for two moving objects) For 0H$ ) +++) : ) Sample generation. Draw new samples of motion parameters in the same way that was described in the SIS procedure for two moving objects. Draw samples of the validity vector 5 (5), with function + defined by (19). <K>L@ B 9 ( using > 3) Weight computation. Compute the weights of the samples,5c <K>L@, using the observed feature corre-

18 1 spondence according to the likelihood equation (7). The resulting samples and their corresponding weights ) C <K>L@ are properly weighted with respect to - x. ) Sample clustering. Segment all the samples and weights into multiple clusters using the above sign-based sample clustering algorithm. 5) Weight balancing. Modify the weights of the samples in every cluster as described in the above weight-balancing method so that the cumulative weights of the clusters are equal. ) Resampling. Resample the samples in the clusters according to the balanced weights. Thus, the motion/structure and validity vector related to the two moving objects can be simultaneously estimated. Go to step if the feature correspondences from the next frame are available. By using the above SIS procedure, the posterior distributions of the camera motion and feature segmentation parameters related to each object can be approximately described by the resulting samples and their corresponding weights. The samples associated with one moving object can be grouped using the sample clustering algorithm. Similar to the case of two moving objects, for a sample set related to a moving object, the empirical mean of the validity vector can be input to clustering methods such as the k-means algorithm to determine the final segmentation of the points on this moving object. The posterior depth distribution of the segmented feature points for each moving object can then be obtained. IV. PERFORMANCE ANALYSIS The proposed methods for the MIMO-SfM problem have been tested using both synthetic and real image sequences. A. Two Moving Objects In this experiment, an image sequence with a walking person was used to test the first MIMO-SfM algorithm for two moving objects. Although there was only one walking person in the image sequence, the camera capturing the video sequence was also moving. Hence the background scene is the second moving object with respect to the camera. Feature points were detected and tracked by using the KLT feature tracker [15]. The trajectories of these features are displayed in Figure (a). Some of the features

19 (a) (b) (c) Fig.. The trajectories and segmentation result of feature points using the sequence with a walking person. (a) shows feature trajectories. Feature point segmentation results using the samples of the validity vector from the SIS procedure corresponding to the scene background are shown in (b) and (c). The points belonging to the person and the background are marked by circles in (b) and (c), respectively. are on the person and some on the background. By using the proposed algorithm, the two sets of motion and validity vector samples related to the person and the background can be separated. The marginalized posterior distributions of the motion parameters associated with both moving objects are shown by (a - e) in Figures 3 (person) and (background). In these plots, distributions of motion parameters at different time instants are listed column-wise, from the top (the starting time instant) to the bottom (the end). The solid lines in plots (f) in both Figures show the empirical means of the validity vectors related to the person and the background, respectively. The dashed lines describe the true segmentation. The results of feature segmentation can be found in Figure. Feature points on the person are marked by circles in Figure (b), and those on the background in Figure (c). All fifteen feature points on the person are correctly segmented without any false alarms. Thirteen feature points are correctly detected out of the sixteen points on the background. Ambiguities in Structure from Motion: In the presence of feature tracking errors, ambiguities exist in the estimation of camera motion and object structure [1], [17]. For example, in this experiment, when estimating the motion between the background scene and the camera, ambiguities caused by the confusion between out-of-image-plane rotation in-image-plane translation were observed. Ambiguities result in multi-modality of the posterior. Each mode represents a possible solution of the motion and structure parameters from the noisy feature correspondences. In our algorithm using SIS, this multi-modality in the

20 0 DISTRIBUTION OF ψ x DISTRIBUTION OF ψ y DISTRIBUTION OF ψ z ROTATION ABOUT THE X AXIS ROTATION ABOUT THE Y AXIS ROTATION ABOUT THE Z AXIS (a) (b) (c) DISTRIBUTION OF α DISTRIBUTION OF β Estimate True Segmentation ELEVATION ANGLE AZIMUTH ANGLE FEATURE POINT INDEX (d) (e) (f) VALIDITY VECTOR x Fig. 3. The marginalized posterior distributions of relative motion parameters between the walking person and the camera in the real video sequence. (a), (b) and (c) are the posterior distributions of the camera rotation angles about the, and axes, respectively. (d) and (e) are the posterior distributions of the azimuth and elevation angles of the camera translation direction, respectively. (f) shows the sample mean of the validity vector obtained by using the samples from the SIS procedure corresponding to the walking person. The dashed-line shows the true segmentation of feature points. Features with positive values belong to the person. posterior is clearly pictured by distinct peaks in the resulting empirical posterior density. In this example, there are two major modes in the posterior of the camera translation direction angle, as shown in Figure (e). These two modes are approximately separated by a distance of -. It indicates that there are two solutions of translation directions and the projections of these two translation on the image plane are roughly in opposite directions. By using the method for depth inference, the depth values of feature points on each object can be computed. Hence, part of the structure of each object at the feature points can be recovered. Figure 5 shows the reconstructed 3D model of the upper body of the person. Since two solutions of background motion exist, two sets of depth values were obtained for the feature points on the background scene. Most of the correctly segmented feature points on the background scene are around the bookshelf. 3D models of scene near the bookshelf can be reconstructed. They are shown in Figures and 7. It can be seen that the model shown in Figure is the true one.

21 1 DISTRIBUTION OF ψ x DISTRIBUTION OF ψ y DISTRIBUTION OF ψ z ROTATION ABOUT THE X AXIS ROTATION ABOUT THE Y AXIS ROTATION ABOUT THE Z AXIS (a) (b) (c) DISTRIBUTION OF α DISTRIBUTION OF β ELEVATION ANGLE AZIMUTH ANGLE FEATURE POINT INDEX (d) (e) (f) VALIDITY VECTOR x Estimate True Segmentation Fig.. The marginalized posterior distributions of the relative motion parameters between the scene background and the camera in the real video sequence containing a walking person. (a), (b) and (c) are the posterior distributions of the camera rotation angles about the, and axes, respectively. (d) and (e) are the posterior distributions of the azimuth and elevation angles of the camera translation direction, respectively. (f) shows the sample mean of the validity vector obtained by using the samples from the SIS procedure corresponding to the background. The dashed-line shows the true segmentation of feature points. Features with positive values belong to the background. B. Multiple Moving Objects Example One: In the first example, a sequence with three walking people is used. Fifty feature points on the walking people are detected after background subtraction and their trajectories are shown in Figure (a), superimposing on the first frame of the sequence. Let us number the people in Figure (a) from the left to the right. Feature segmentation results of different person are shown by Figure. Several features on the second person s legs were not correctly marked. The reason is that these features were not correctly tracked during feature tracking. It can be seen in Figure (a) that these features drifted to the right foot of the person. Despite a small number of errors, the majority of the feature points are successfully clustered around the correct persons. Example Two: In this example, a synthetic video sequence containing three moving objects was generated to test the performance of the second algorithm. In this sequence, the first object moves along a circular track to the left part of the scene, the second object translates downward along the vertical axis, and

Fig. 5. Reconstructed 3D model of the walking person. the third object translates to the left without rotation. In the experiment, 70 feature points on these three moving objects were used.

The trajectories of these features are shown in Figure 9 (a). Figures 9 (b - f) show the estimated motion distributions.

By using the second algorithm, the motion and structure of these three moving objects can be estimated simultaneously. Feature segmentation results using this sequence are shown in.

The feature points belonging to the first, and second objects are labeled by two in Figure (a), (b) and the third object by one in Figure (c), respectively.

22 Fig. 5. Reconstructed 3D model of the walking person. the third object translates to the left without rotation. In the experiment, 70 feature points on these three moving objects were used. Among these features, the first 0 features are on the first object, the next 0 on the second object, and the rest lies on the third object. The trajectories of these features are shown in Figure 9 (a). Figures 9 (b - f) show the estimated motion distributions. Since the motion distributions of all three objects are mixed together, multiple modes can be observed. By using the second algorithm, the motion and structure of these three moving objects can be estimated simultaneously. Feature segmentation results using this sequence are shown in. The horizontal axis shows the indices of the feature points. The feature points belonging to the first, and second objects are labeled by two in Figure (a), (b) and the third object by one in Figure (c), respectively. We can see that there is no false alarm in the feature segmentation results; however some feature points on the second and the third objects are not extracted. The marginalized motion distributions of these moving objects at different time instants and structure estimates are shown in Figure 11. Each column of Figure 11 corresponds to the motion and structure estimates related to one moving object. Figures from the left to the right are related to the first, the second and the third moving object. The first five rows from the top of Figure 11 show the motion distributions. From the top to bottom, the figures are the distributions of the rotation angles about the J, and axes and the translation direction angles. The last row of Figure 11 shows the structure estimates of the segmented feature points for each moving object. Example Three: This example uses an image sequence containing two persons. One person was walking to

23 3 Fig.. One reconstructed 3D model of the background scene. It can be seen that this model is the correct estimate of the true one. Fig. 7. The other 3D model of the background scene. It is a false reconstruction.

0 5 3 1 7 3 35 1 9 7 3 3 13 15 1 0 9 30 11 17 19 33 3 5 3 1 5 1 1 50 3 39 1 3 37 7 9 (a) (b) (c) (d) Fig.. The trajectories and segmentation results using the image sequence with three walking people.

24 (a) (b) (c) (d) Fig.. The trajectories and segmentation results using the image sequence with three walking people. (a) shows feature trajectories. Feature points belonging to the first, second and third persons are marked by circles in (b), (c) and (d), respectively. the right and the other was moving forward roughly along the optical axis of the camera. Since the camera was also moving, the background scene served as the third moving object with respect to the camera. Forty-four feature points were tracked throughout the sequence. The feature trajectories are displayed in Figure 1 (a). By using the second MIMO-SfM algorithm, features on different moving objects can be clustered and at the same time, the motion of each moving object can be estimated. The results of feature segmentation are shown in Figures 1 (b - d). Feature points segmented out for each object are marked by circles. Figure 13 shows the empirical means (thick curves) and the standard deviation (thin bars) of the validity vectors using the clustered samples related to the points on the background. Figures 13 (a), (b) and (c) show these values at the initial, the middle and the last time instants. It can be seen that the mean of the validity vector gradually evolves to a desired structure, in which the entries of the features on the background have large positive values while the others negative. The standard deviations of the validity vector are also decreasing. Partial 3D models of these objects were also reconstructed. Figure 1 shows the face model of the first person. Although only a rough model can be recovered due to the fact that only a small number of feature points are available, the shape of the face and surrounding regions are correctly recovered. Motion and structure ambiguities were present when the motion and structure parameters of the second person and the background were estimated. Two 3D models were computed for each of them, one is close to the true model and the other is a false reconstruction. Figure 15 shows the the two models of the background. Pictures in each row show one model, viewed from a different viewpoint. It can be seen

25 DISTRIBUTION OF ψ x 1 DISTRIBUTION OF ψ y ROTATION ABOUT THE X AXIS ROTATION ABOUT THE Y AXIS (a) (b) (c) DISTRIBUTION OF ψ z 1 DISTRIBUTION OF α 1 DISTRIBUTION OF β ROTATION ABOUT THE Z AXIS ELEVATION ANGLE AZIMUTH ANGLE (d) (e) (f) Fig. 9. The observed feature trajectories and the mixed motion estimation results using a synthetic sequence with three moving objects. (a) shows feature trajectories in the synthetic sequence. One object moves along a circular track to the left part of the scene, another object translates downward along the vertical axis, and the third object also translates to the left. The marginalized posterior distributions of the motion parameters of the three independently moving objects are shown together. (b), (c) and (d) are the posterior distributions of the camera rotation angles about the, and axes, respectively. (e) and (f) are the posterior distributions of the azimuth and elevation angles of the camera translation direction, respectively. that the model shown in the first row is a correct estimate of the true model. The pictures in Figure 1 display the two models of the second person. Still the model shown in the first row is a correct estimate of the true model. It can be seen that the true surface reconstruction is relatively much smoother than the false reconstruction. V. CONCLUSIONS The MIMO-SfM problem has been investigated in this paper using SIS. Two algorithms have been proposed to handle different scenarios: (1) when only two objects are present in the image sequence and () when two or more objects are present. The state vector contains both motion and feature segmentation parameters. In the first algorithm for two objects, an SVD-based segmentation method is proposed to cluster the samples of the validity vector of different moving objects. In the second algorithm, a balancing step is added into the SIS procedure to preserve lightly weighted samples so that all objects might have

26 (a) (b) (c) Fig.. Feature segmentation results using the synthetic sequence. The horizontal axis shows the indices of the feature points. The feature points belonging to the first, and the second objects are labeled by two in Figure (a), (b) and the third object by one in Figure (c), respectively. enough samples to propagate empirical motion and validity vector distributions. The two algorithms have been tested using both synthetic and real image sequences. A limitation of the second algorithm is that a large number of samples are required when many objects are present in the image sequence. So far, we have only considered fully calibrated camera. For future research, we would like to investigate the MIMO-SfM problem with uncalibrated cameras. ACKNOWLEDGMENT The work presented in this paper is partially supported by the U.S. Army Research Laboratory (ARL) Collaborative Technology Alliance contract DAAD REFERENCES [1] G. Adiv, Determining 3-D motion and structure from optical flow generated by several moving objects, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 7, no., pp. 3 01, 195. [] J.P. Costeira and T. Kanade, A multibody factorization method for independently moving-objects, International Journal of Computer Vision, vol. 9, no. 3, pp , 199. [3] C. Tomasi and T. Kanade, Shape and motion from image streams under orthography: A factorization method, International Journal of Computer Vision, vol. 9, no., pp , 199. [] T. Darrell, A. Azarbayejani, and A. Pentland, Robust estimation of multiple models in the structure from motion domain, Tech. Rep. TR-1, MIT, Media Lab, 199. [5] A. Azarbayejani and A. Pentland, Recursive estimation of motion, structure, and focal length, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 17, pp , 1995.

27 7 DISTRIBUTION OF ψ x 1 DISTRIBUTION OF ψ x 1 DISTRIBUTION OF ψ x ROTATION ABOUT THE X AXIS ROTATION ABOUT THE X AXIS ROTATION ABOUT THE X AXIS (a) (b) (c) DISTRIBUTION OF ψ y 1 DISTRIBUTION OF ψ y 1 DISTRIBUTION OF ψ y ROTATION ABOUT THE Y AXIS ROTATION ABOUT THE Y AXIS ROTATION ABOUT THE Y AXIS (d) (e) (f) DISTRIBUTION OF ψ z 1 DISTRIBUTION OF ψ z 1 DISTRIBUTION OF ψ z ROTATION ABOUT THE Z AXIS ROTATION ABOUT THE Z AXIS ROTATION ABOUT THE Z AXIS (g) (h) (i) DISTRIBUTION OF α 1 DISTRIBUTION OF α 1 DISTRIBUTION OF α ELEVATION ANGLE ELEVATION ANGLE ELEVATION ANGLE (j) (k) (l) DISTRIBUTION OF β 1 DISTRIBUTION OF β 1 DISTRIBUTION OF β 1 Depth Estimates AZIMUTH ANGLE AZIMUTH ANGLE AZIMUTH ANGLE (m) (n) (o) Ground truths Estimates Feature Point Index Depth Estimates Ground truths Estimates Feature Point Index Feature Point Index (p) (q) (r) Depth Estimates Ground truths Estimates Fig. 11. Motion and structure estimates using the synthetic sequence. The first five figures at the top of the first column are the marginal posterior distributions of the motion parameters at different time instants and the last one shows the structure estimates of the segmented feature points associated with the first object (moving along a circular track) and the results related to the second object (moving vertically) and the third object (moving to the left) are shown in the second and third columns, respectively.

9 30 17 31 1 3 3 137 01 5 3 3 11 7 15 1 13 39 3 0 9 5 3 1 3 1 35 33 19 7 1 (a) (b) (c) (d) Fig. 1. The trajectories and segmentation results using the image sequence with two walking people, captured by a moving camera.

28 (a) (b) (c) (d) Fig. 1. The trajectories and segmentation results using the image sequence with two walking people, captured by a moving camera. (a) shows feature trajectories. Feature points belonging to different objects are marked by circles in (b), (c) and (d). (b) shows the features on the first person (walking to the right) and (c) shows the features on the second person (walking along the optical axis) and (d) shows the features on the background scene. VALIDITY VECTOR 0 VALIDITY VECTOR VALIDITY VECTOR FEATURE POINT INDEX FEATURE POINT INDEX FEATURE POINT INDEX (a) (b) (c) Fig. 13. Temporal evolution of the validity vectors. These figures show the empirical means (thick curves) and the standard deviation (thin bars) of the validity vectors using the clustered samples related to the background in the first example. (a), (b) and (c) show these values at the initial, middle and last time instants. [] S. Soatto and P. Perona, 3D transparent structure segmentation and multiple motion estimation from monocular perspective image sequences, in Proc. of the IEEE Workshop on motion of nonrigid and articulated objects, 199, pp. 35. [7] S. Soatto, R. Frezza, and P. Perona, Motion estimation on the essential manifold, in European Conference on Computer Vision, 199, pp. B:1 7. [] M. Han and T. Kanade, Reconstruction of a scene with multiple linearly moving objects, in IEEE Conference on Computer Vision and Pattern Recognition, 000, pp. II:5 59. [9] A. Shashua and A. Levin, Multi-frame infinitesimal motion model for the reconstruction of (dynamic) scenes with multiple linearly moving objects, in International Conference on Computer Vision, 001, pp. II: [] G. Qian and R. Chellappa, Structure from motion using sequential Monte Carlo methods, in International Conference on Computer Vision, 001, pp. II: 1 1. [11] G. Qian and R. Chellappa, Moving targets detection using sequential importance sampling, in IEEE International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, UT, 001.

29 9 Fig. 1. Reconstructed 3D face model of the first person. Fig. 15. Two reconstructed models (one in each row) of the background scene. The model in the first row is the correct estimate.

person. The model in the first row is the correct estimate. [1] Jun S.

J. Amer. Statist. Assoc., vol. 93, pp. 3, 199. [13] O.D.

parameters of rigid objects with curved surfaces, IEEE Trans.

Pattern Recognition, 199, pp. 593 00. [1] G.S. Young and R.

3-D motion from a noisy flow field, IEEE Trans.

30 30 Fig. 1. Two reconstructed models (one in each row) of the back of the second person. The model in the first row is the correct estimate. [1] Jun S. Liu and Rong Chen, Sequential Monte Carlo methods for dynamic systems, J. Amer. Statist. Assoc., vol. 93, pp. 3, 199. [13] O.D. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint, MIT Press, [1] R.Y. Tsai and T.S. Huang, Uniqueness and estimation of three-dimensional motion parameters of rigid objects with curved surfaces, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol., no. 1, pp. 13 7, 19. [15] C. Tomasi and J. Shi, Good features to track, in IEEE Conference on Computer Vision and Pattern Recognition, 199, pp [1] G.S. Young and R. Chellappa, Statistical analysis of inherent ambiguities in recovering 3-D motion from a noisy flow field, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 1, no., pp , 199. [17] K. Daniilidis and H.H. Nagel, The coupling of rotation and translation in motion estimation of planar surfaces, in IEEE Conference on Computer Vision and Pattern Recognition, 1993, pp

Structure from Motion. Prof. Marco Marcon

Structure from Motion. Prof. Marco Marcon Structure from Motion Prof. Marco Marcon Summing-up 2 Stereo is the most powerful clue for determining the structure of a scene Another important clue is the relative motion between the scene and (mono)