Articulated Structure from Motion through Ellipsoid Fitting

Size: px

Start display at page:

Download "Articulated Structure from Motion through Ellipsoid Fitting"

Della Owens
5 years ago
Views:

1 Int'l Conf. IP, Comp. Vision, and Pattern Recognition IPCV' Articulated Structure from Motion through Ellipsoid Fitting Peter Boyi Zhang, and Yeung Sam Hung Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong, China {byzhang, Abstract - We present a new method to reconstruct non-rigid objects from orthographic projections based on the assumption of articulated model. We introduce an ellipsoid property to identify points belonging to a rigid subset. This enables us to formulate the problem of motion segmentation as an ellipsoid fitting problem. The obtained rigid subsets are then linked as kinematic chains to constitute 3D articulated structure. This method is practical of computational complexity O(N) mainly based on linear least squares. We demonstrate the effectiveness of this method through experiments on real tracking data, motion capture data, and challenging human dataset with missing data, in comparison with existing methods. Keywords: non-rigid structure from motion, articulated structure, ellipsoid fitting, motion segmentation, orthographic camera 1 Introduction The Non-Rigid Structure From Motion (NRSFM) problem aiming at recovering dynamic 3D structure from a sequence of 2D image measurements has attracted much attention over the years. As demonstrated in Figure 1, under affine projection, in each frame of measurement the depth information for all feature points needs to be estimated. This is an ill-posed problem unless additional constraints (e.g. based on the property of the object) are introduced to regulate the estimation to achieve meaningful results. Bregler et al. [1] introduce low rank condition on the object, constraining the object s movement to be a mean shape plus some degrees of freedom. Akhter et al. [2] also propose trajectory basis method based on this low rank assumption. Generally, low rank methods are only effective when the object is relatively rigid, and the movement is confined to be linear with few degrees of freedom. Rabaud and Belongie [3] tackle the problem assuming the shape of the object repeats itself at different times, and is thus captured in a few different frames. They identify these frames and recover the object in the same way as a rigid object. In a real scenario, many objects can fit to an articulated model. They can be modeled as a few rigid subsets connected to each other. With this assumption, as shown in Figure 1, feature points can be grouped into rigid subsets. Points in a same subset are fixed relative to each other. Furthermore, the relative depth between subsets can be recovered based on the knowledge of linkage between subsets. A major challenge of this problem is the grouping of feature points into rigid subsets, which amounts to motion segmentation. Tresadern and Reid [4] perform motion segmentation based on rank condition. Yan and Pollefeys [5] project points to low-rank manifolds and group points sharing the same subspace. Ross et al. [6] employ a probabilistic model to divide and merge the subsets iteratively. Russell et al. [7] propose an energy-based approach to label points as belonging to different subsets. The rank condition is not strong enough to accurately classify points into different subsets, and the number of subsets has to be provided a priori as an input. In this paper we show that beyond the low rank property, a group of points on a rigid structure also has the property of constituting an ellipsoid. The motion segmentation problem can thus be transformed to an ellipsoid model fitting problem. This helps achieve a more precise segmentation of feature points with low computational cost, and leads to better performance in 3D reconstruction. We describe the ellipsoid fitting method in section 2, and how to perform full 3D reconstruction in section 3. We demonstrate the effectiveness of the method through experiments on challenging datasets in section 4. Some conclusions are drawn in section 5. Figure 1. Without constraint, the feature points can have arbitrary depth. With the articulated assumption, the relative depths between feature points of the same rigid subset are fixed, and the relative depth between rigid subsets can be resolved by means of joint linkages. [8] 2 Motion segmentation Motion segmentation is a critical step for recovering articulated objects. In this section, we describe the problem, reveal the ellipsoid property of rigid subsets, and propose an efficient method to perform motion segmentation through ellipsoid fitting.

2 180 Int'l Conf. IP, Comp. Vision, and Pattern Recognition IPCV' Articulated object For an articulated object consisting of feature points measured over frames by an orthographic camera, let denote the number of rigid subsets ( is not known); let denote the measurements, and let denote the measurement of the th frame; let denote 3D structure of the th subset, where is the number of points in the th subset; let and denote the rotation and translation of the th subset in the th frame relative to a world coordinate. Let we have We denote, (1). (2),. (3) Through motion segmentation, we aim to group the feature points into subsets, so that each subset is rigid throughout all frames. To achieve that, firstly a property needs to be utilized to effectively classify the feature points into rigid subset; secondly an efficient method needs to be proposed to apply the property to different groups of points. We describe the ellipsoid property in section 2.2, and describe the grouping method in section Ellipsoid fitting The problem of grouping points into rigid subsets can be transformed to a problem of ellipsoid model fitting. In this section we reveal the ellipsoid property as a fundamental property of rigid structure under orthographic projection, and explain how ellipsoid fitting can help identify rigid subsets. An ellipsoid in centered at the origin is defined as a set of points satisfying subject to, (4), (5) where,, is a set of orthonormal vectors specifying the axis-direction of the ellipsoid;,, are constants representing the lengths of the semi-axes; and is a unit vector. Let denote the measurements of the th group of points. Subtract the mean of each row of from entries of the same row, we obtain a registered measurement matrix. Suppose this group is a rigid subset, we have. (6) We show rows of lie on an ellipsoid through the following analysis, and after that we will describe how to perform ellipsoid fitting. Express by SVD as then, (7). (8) We absorb the orthogonal matrix into. Let,, be the first three rows of ;,, be the diagonal elements of. Denote the entries of as then, (9). (10) According to (4) and (5), rows of lie on a same ellipsoid in. Moreover, the two rows of satisfy. (11) To fit ellipsoid to, we first project its row vectors to a 3D subspace. We register rows of to their centroid to get. As illustrated in Figure 2, each row of can be represented as a point in -dimensional space. We then perform a rank-3 approximation on, so that points in - dimensional space are projected to points in a 3D subspace, given by, (12) where,, is an orthonormal basis of the 3D subspace. (12) can be expressed in the form of (10) subject to (11) if and only if there exists, such that (13) This is equivalent to fitting an ellipsoid characterized by to the projected points. Since, we may write, (14)

Int'l Conf. IP, Comp. Vision, and Pattern Recognition IPCV'15 181 where is orthogonal. Then, (15). (16) In our method, we fit the ellipsoid by least squares as in (13), and then estimate using (15).

3 Int'l Conf. IP, Comp. Vision, and Pattern Recognition IPCV' where is orthogonal. Then, (15). (16) In our method, we fit the ellipsoid by least squares as in (13), and then estimate using (15). Although this method of first fitting a subspace then an ellipsoid gives a sub-optimal solution compared to directly fitting an ellipsoid in, it greatly reduces the computational cost. The 3D shape of this subset can be recovered as We define 2D reprojection error. (17) (18) as a measurement of how much this group of points deviates from rigidity. Subset 1 Subset 2 Subset 3 Figure 2. Ellipsoid fitting. Each row in subset 3 is represented as a point in 5D space. The 5D space is projected to its most significant 3D subspace, and an ellipsoid is fitted to the points. 2.3 Grouping of points as rigid subsets Through the previous analysis, the problem of finding rigid subsets among feature points can be transformed to a problem of finding appropriate ellipsoids parameterized by and, so that each group of columns in can be fitted to an ellipsoid as. (19) To fit an ellipsoid to any possible group of points, there would be as many as groups to test. We propose a method to perform ellipsoid model fitting with only complexity. We first estimate the model by fitting ellipsoid to a minimal seed, and then add inliers to the estimated model. A minimal seed contains 4 feature points, because registering each row by its mean reduces the rank of the measurement matrix by 1, and we need at least 3 points to fit an ellipsoid centered at the origin in a 3D subspace. In a real scenario points of a rigid subset are usually close to each other, so in our case, instead of randomly selecting groups of 4 points as seeds, we only consider points close to each other. This would only require ellipsoid fitting computations. The groups with ellipsoid fitting error below a threshold are chosen as seeds for rigid subsets. For a seed and the fitted model,, we evaluate of each column in through reprojection as in (18), and include the points with to expand the rigid subset. As the rigid subset expands, and may be updated to minimize. In total, this process requires ellipsoid fitting and projections. After we expand each rigid seed, we might obtain a number of rigid subsets overlapping with each other. This is because in a real scenario the object might not be a perfectly articulated object, and there may be ambiguity as to which subset the points at the boundary should belong to. We resolve this issue by first picking subsets with minimal overlap, and then assign the points belonging to more than one rigid subsets to the subset of minimal reprojection error. If some points are not grouped to any rigid subset, they tend to introduce large error. Either we can choose to discard them; or we can include each point to the rigid subset for which it has minimal at the final stage after the recovery of and, to reduce their influence. The ability to distinguish outliers is an advantage of this algorithm. As we shall show in our experiments, the range of choice of to provide optimal result is wide. As increases from 0, the number of ungrouped points decreases, and the number of duplicate points (points occurring in more than one subsets) increases. As reaches the value that well distinguishes the rigid subsets, the number of ungrouped and duplicate points are kept small and stable. The threshold should be small if the object can be well modeled as an articulated object. should be large if the object is not strictly articulated and/or there is much noise. We shall also show that the 2D reprojection error can be taken as an indicator of the 3D error, for there is a correlation between them. 3 Reconstruction of 3D structure After segmenting the feature points into rigid subsets, we describe how to reconstruct the 3D structure in this section. We connect the subsets through kinematic chain, eliminate mirror ambiguity based on the linkage between subsets, resolve the relative depth between subsets, and handle missing data. 3.1 Kinematic chain Similar to Kirk et al. [9] and Yan and Pollefeys [5], we construct a graph recording the cost of connecting every two rigid subsets through joints (called joint cost), and build kinematic chains through performing minimum spanning tree

182 Int'l Conf. IP, Comp. Vision, and Pattern Recognition IPCV'15 search in the graph; however, we calculate the joint cost through a different method that only requires least squares computation.

4 182 Int'l Conf. IP, Comp. Vision, and Pattern Recognition IPCV'15 search in the graph; however, we calculate the joint cost through a different method that only requires least squares computation. If two rigid subsets and can be connected through a link, we should be able to find virtual points and attached to the coordinate frame of and, respectively, such that the distance between trajectory of and is small (ideally zero). Place the coordinate frame of at the centroid of, can be recovered as the row mean of. Let denote the coefficients combining trajectories of to, the trajectory of as observed in the measurement is Substitute (6) into (20),. (20). (21) In our model, the joint points and are close to each other, but may not strictly coincide with each other. This relaxed condition allows us to model objects in the physical world that are not perfectly articulated, e.g. the human body. By (21), can be approximated directly from, without recovery of and. The joint cost can be found through least squares minimization as:. (22) We construct an undirected graph with the rigid subsets as nodes, and joint costs as weights of edges. In this graph we perform minimum spanning tree search to recover the connectivity among rigid subsets and then kinematic chains from connected rigid subsets. 3.2 Mirror ambiguity and can be recovered through (15) and (17). Yet a mirror ambiguity remains unresolved, since and is an equally acceptable solution. We resolve this ambiguity using the physical condition that two subsets linked by a joint generally have similar rotational transformation. For two subsets labeled and linked by joints, we align and with, and pick the camera motion with smaller error. We arbitrarily pick one subset as reference and propagate alignment to its neighbors and so on. As a result, we will only have one overall mirror ambiguity left to be handled manually D structure As we have recovered,, the exact joint location in the coordinate frame of can be reevaluated by applying (20) and (22) to subsets linked by joint. Now the only parameter remains to be estimated is the depth, the third element of. For a set of linked subsets and, we set the relative depth between and to be in each frame to minimize the joint distance, assigning one subset as the reference, the depth of its neighboring subsets can be solved through vector addition as shown in Figure 3. Figure 3. Two subsets linked by joints. The circles represent recovered joint locations. (Best viewed in color.) 3.4 Missing data Missing data does not present any problem and can be handled in a natural way in our method. For points that are visible in incomplete set of frames, we only use information in the observable frames to perform motion segmentation for these points. Once the points are grouped into rigid subsets, we can estimate the rigid substructure and its transformation in all frames. 4 Experimental Results We evaluate our method on both motion capture data and real tracking data. The motion capture datasets include 3D ground truth, providing a foundation for quantitative measurement of the accuracy of the algorithm. We evaluate the 2D reprojection error against the measurement matrix, and the 3D error against the 3D ground truth. The 2D reprojection error is defined as the Frobenius norm of the difference between 2D measurement and 2D reprojection, divided by. The 3D error is defined as the Frobenius norm of the difference between 3D ground truth and reconstructed 3D points, divided by the Frobenius norm of 3D ground truth. We evaluate our method qualitatively on real scenes two cranes and toy truck from [5]. We also perform 3D reconstruction on toy truck. The robustness of our method in the presence of missing data is demonstrated through motion capture data skin. Depending on its size, each of these datasets takes two seconds to a few minutes for an intel-i5 3.10GHz PC to process.

Int'l Conf. IP, Comp. Vision, and Pattern Recognition IPCV'15 183 4.

a stepstool. The camera is fixed while the person performs substantial movements, making it a very challenging NRSFM problem.

5 Int'l Conf. IP, Comp. Vision, and Pattern Recognition IPCV' Motion capture data The stepstool dataset taken from subject 40 trial 6 of Carnegie Mellon University Motion Capture Database [10] records a person that climbs, steps over, sits on, and jumps over a stepstool. The camera is fixed while the person performs substantial movements, making it a very challenging NRSFM problem. Before testing our algorithm, we down sampled the number of frames by a factor of 6, and removed the duplicate point tracks, resulting in 1097 frames and 149 points. From the 3D trajectory of all the tracks, we retain the information in and axis as 2D measurements. Dataset dance from [2] is a standard NRSFM dataset evaluated by many methods, it has 264 frames and 75 points. The camera rotates around the person horizontally for about 720 degrees throughout the frames. is set to 0.3% and 2.2% times for stepstool and dance respectively. Table 1. 3D error. Trajectory basis Kernel Ellipsoid Fitting Stepstool 22.70% 20.73% 1.60% Dance 18.99% 16.90% 8.85% We list the results in comparison with the trajectory basis method [2] and the kernel method by Gotardo and Martines [11] in Table 1. We calculate the 3D error after shifting each frame of the 3D ground truth to the centroid. Our method reduces the 3D error by more than an order of magnitude in stepstool, and by a factor of two in dance. This shows the limitation of methods based on low-rank assumptions for reconstructing highly flexible objects. Figure 4 (a) shows the effect of varying. The leading to optimal result is chosen where the number of ungrouped and duplicate points are small, the error is robust to change of, and the 3D error correlates with 2D error. Figure 4 (b) and (c) show the recovered kinematic chains. We observe that as few as 4 points is enough to determine a rigid subset. For comparison, we include result reported by Ross et al. [6] on stepstool to the right. Our method results in 20 subsets, distinguishing the chest, the waist and the hip, and also the shoulder from the upper arm. Although [6] use 3D data instead of 2D measurements as input, they can only identify 15 subsets. Moreover, we recover the joints at more reasonable positons. Figure 7 shows 3D stepstool results in different frames. 4.2 Real tracking data Data two cranes has 30 frames and 94 points. One crane rotates relative to the camera, while the other crane has two moving parts, but little camera motion. As shown in Figure 5, we segment the data into 3 parts, while Fayad et al. [12] further segment one more part on each crane, at least one of which appears to be redundant. Compared with Yan and Pollefeys result [5] where two points are misclassified as belonging to the other crane, our result is also better. Because there is almost no camera motion to provide depth information, it is not possible to perform 3D reconstruction. Yet this experiment shows that our method is capable of separating the rigid subsets as long as there is relative motion Frame 1 Frame 30 (a) (b) (c) Figure 4. (a) Effect of varying. Recovered kinematic structure of (b) dance, and (c) stepstool in comparison with result by Ross et al. [6] (Best viewed in color. See Figure 7 for meaning of symbols and coloring.) Figure 5. Experiments on two crane. From top to bottom are results by our method, Fayad et al. [12], and Yan and Pollefeys [5].

reconstruction result. between them. Data toy truck has 60 frames and 83 points.

Figure 6 shows that our method accurately differentiates the shovel from the rest of

As there is no ground truth for this type of real data, the reconstruction can only

Figure 6 shows textured 3D views of frame 1 with shovel down and frame 50 with shovel

3 Data with missing observations The skin dataset from [13] depicts a person flexing

Approximately 350 markers are placed on the subject to capture the subtle movement of

But since the motion capture system sometimes lose track of the markers because of

6 184 Int'l Conf. IP, Comp. Vision, and Pattern Recognition IPCV'15 Frame 1 Frame 50 Figure 6. Experiments on toy truck. The first row shows the image and segmentation result, the second shows the 3D reconstruction result. between them. Data toy truck has 60 frames and 83 points. It rotates slightly with respect to the camera, and its shovel moves up and down. Figure 6 shows that our method accurately differentiates the shovel from the rest of the truck. We also perform 3D reconstruction. As there is no ground truth for this type of real data, the reconstruction can only be assessed qualitatively. Figure 6 shows textured 3D views of frame 1 with shovel down and frame 50 with shovel up from a different angle, which look reasonable. 4.3 Data with missing observations The skin dataset from [13] depicts a person flexing his muscles. Approximately 350 markers are placed on the subject to capture the subtle movement of human skin. But since the motion capture system sometimes lose track of the markers because of occlusion, as a result 467 tracks are Stepstool Frame 81, climb Frame 185, sit Frame 387, stride Frame 456, pose Frame 643, jump Scene Setup Camera view Side view Skin Frame 16 Frame 32 Frame 52 Frame 68 Frame 92 Scene Setup [13] Camera view Side view Figure 7. Results on datasets stepstool and skin. The reconstructed points are colored according to different subsets, the gray circles represent ground truth, the colored circles are joint locations, and the black lines are recovered kinematic chains. (Best viewed in color.)

7 Int'l Conf. IP, Comp. Vision, and Pattern Recognition IPCV' captured, some full and some partial. Thus we have 97 frames and 467 points. In total 31.75% of data are missing in the measurements. The camera motion is also relatively small (about 45 degrees), making this very challenging data for NRSFM problem. We process this data with set from 0.5% to 0.9% times, resulting in 2D reprojection error ranging from 0.60% to 1.02%, and 3D error ranging from 2.85% to 3.94%. Figure 7 shows the segmentation and reconstruction with set to 0.8% times. The left foot exhibits larger 3D error than other rigid subsets because its motion relative to camera is very small. Comparatively, Fayad et al. [13] s reconstruction from the same dataset based on the same input and same measure of error results in a 3D error of 7.13%. Our 3D error is about half of theirs. Our segmentation further divides the upper body and the arm, capturing the subtle movement of the skin. 5 Conclusion We have presented a method to systematically reconstruct dynamic 3D structure from orthographic 2D measurements based on the articulated model assumption. We introduce the ellipsoid property as a property to effectively distinguish feature points belonging to different rigid subsets. This enables us to convert the motion segmentation problem to ellipsoid model fitting, and propose a practical method with computational complexity. Then, joint constraints are used to build kinematic chains and 3D articulated structures are recovered. The method is proven effective on both standard real tracking data and challenging human motion capture datasets with missing data, where the proposed method yields better motion segmentation and significant improvements in both 2D and 3D errors compared to existing methods. 6 Acknowledgements The work described in this paper was supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. HKU E) and CRCG of the University of Hong Kong. We thank João Fayad for his kindness in sending us the datasets necessary to carry out the experiments. 7 References [1] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3D shape from image streams. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, [3] V. Rabaud, and S. Belongie. Re-thinking non-rigid structure from motion. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, [4] P. Tresadern, and I. Reid. Articulated structure from motion by factorization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, [5] J. Yan, and M. Pollefeys. A factorization-based approach for articulated non-rigid shape, motion and kinematic chain recovery from video. In IEEE Transactions on Pattern Analysis and Machine Intelligence, [6] D. Ross, D. Tarlow, and R. Zemel. Learning articulated structure and motion. In International Journal of Computer Vision, 88(2): , [7] C. Russell, J. Fayad, and L. Agapito. Energy based multiple model fitting for non-rigid structure from motion. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, [8] Robot arm model, [9] A. Kirk, J. O'Brien, and D. Forsyth. Skeletal parameter estimation from optical motion capture data. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, [10] Carnegie Mellon University motion capture database, [11] P. F. U. Gotardo, and A. M. Martinez. Kernal non-rigid structure from motion. In Proceedings of IEEE International Conference on Computer Vision, [12] J. Fayad, C. Russell, and L. Agapito. Automated articulated structure and 3D shape recovery from point correspondences. In Proceedings of IEEE International Conference on Computer Vision, [13] S. I. Park, and J. K. Hodgins. Capturing and animating skin deformation in human motion. In Proceedings of the ACM SIGGRAPH Conference on Computer Graphics, [2] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade. Nonrigid structure from motion in trajectory space. In Neural Information Processing Systems, 2008.

Learning Articulated Skeletons From Motion

Learning Articulated Skeletons From Motion Danny Tarlow University of Toronto, Machine Learning with David Ross and Richard Zemel (and Brendan Frey) August 6, 2007 Point Light Displays It's easy for humans