Nonlinear Mean Shift for Robust Pose Estimation

Nonlinear Mean Shift for Robust Pose Estimation Raghav Subbarao Yakup Genc Peter Meer ECE Department Real-time Vision and Modeling Department Rutgers University Siemens Corporate Research Piscataway, NJ 08854 Princeton, NJ 08540 Abstract We propose a new robust estimator for camera pose estimation based on a recently developed nonlinear mean shift algorithm. This allows us to treat pose estimation as a clustering problem in the presence of outliers. We compare our method to RANSAC, which is the standard robust estimator for computer vision problems. We also show that under fairly general assumptions our method is provably better than RANSAC. Synthetic and real examples to support our claims are provided. 1. Introduction Real time estimation of camera pose is an important problem in computer vision. Pose estimation along with scene structure estimation is known as the Structure-From- Motion (SFM) problem which is the central goal of vision. It is widely accepted that once good estimates of the structure and motion are known, they can be improved using offline methods like bundle adjustment [19]. However, to get a starting point, a system needs to account for both noise and gross errors which do not satisfy the geometric constraints being enforced. Such errors are known as outliers. Pose estimation is also a part of other applications such as augmented reality (AR). For AR only the pose of the camera is needed, although some structure may also be estimated. The pose is required in real time and offline methods such as bundle adjustment are not applicable here. Random Sample Consensus (RANSAC) and its variations, which follow a hypothesise-and-test procedure, are the standard ways of handling outliers in SFM. In this paper we propose a new robust estimator for camera pose estimation. The estimator is based on the nonlinear mean shift algorithm of [15, 20] applied to the Special Euclidean Group which is the set of all rigid body motions in 3D and is equivalent to the set of all camera poses. We show theoretically and experimentally that our method requires fewer hypotheses than any hypothesise-and-test algorithm for the same level of performance. We discuss some of the previous work related to our approach in Section 2. In Section 3 we introduce the nonlinear mean shift algorithm. In Section 4 we develop a robust pose estimator based on this algorithm and outline a proof of why we expect the mean shift based estimator to be better than RANSAC. Finally, in Section 5 we present the results of experiments on synthetic and real data sets. 2. Previous Work Classical methods reconstruct the scene using correspondences across images and estimating the epipolar geometry between pairs of frames or the trifocal tensor for three frames. These reconstructions are then stitched together into a single frame [14]. The Euclidean equivalent of this is the relative pose estimation problem given image correspondences between two images [8]. Alternatively, the motion and structure can be estimated in a single coordinate frame [12]. Such methods require absolute camera pose estimation based on correspondences between 3D world points and 2D image points [1, 6]. An important aspect of these algorithms is that whenever any geometrical constraint is being enforced, there will be outliers which do not satisfy the constraint. These outliers occur due to errors in lower level modules such as the image feature tracker. When estimating the motion and structure it is necessary to detect and remove these outliers. The standard way of handling outliers in computer vision is the RANSAC algorithm [4]. In RANSAC, parameter hypotheses are generated by randomly choosing a minimal number of elements required to generate a hypothesis. The hypotheses are scored based on their likelihood to have generated the observed data and the best hypothesis is retained. Based on the noise model assumed, different scoring function have been proposed to develop variations of RANSAC [17, 18]. Another important contribution has been the development of preemptive forms of RANSAC [2, 10] which allow RANSAC to be used in real-time SFM systems. In such methods, all the hypotheses are not scored completely. Some hypotheses are preemptively dropped. Unlike RANSAC where a single hypothesis is generated and scored while only retaining the most likely hypothesis, preemptive RANSAC [10] proceeds by generating all the hypotheses at the beginning. The likelihood of the hypotheses 1

is incrementally estimated while the number of hypotheses are continually reduced. Different methods exist for generating a pose hypotheses given a set of point correspondences. The 5-point method [9] can be used for the relative pose between two frames given image correspondences across the frames. Alternatively, given three 3D world to 2D image point correspondences, the 3-point method [6] gives upto four different estimates. The elemental subset can be augmented with another point to decide between the hypotheses and this gives a 4-point hypotheses generation algorithm [10]. All robust methods in the RANSAC family try to find the best hypothesis among the hypotheses generated. Consequently, other inlier hypothesis (hypotheses generated only from inliers) which lie close to the true pose, are neglected. Our robust estimator combines all the inlier hypothesis rather than simply trying to find the best one. In this way, we utilize the information available in a more complete manner than the hypothesise-and-test framework. 3. Nonlinear Mean Shift We briefly discuss standard mean shift which is applicable to vector spaces. We would like to work in space of all pose estimates, which is not a vector space. This space is a standard geometrical space known as the Special Euclidean Group, denoted by SE(3). A mean shift algorithm for SE(3) is discussed in Section 3.2. 3.1. The Original Mean Shift Given n data points x i, i = 1,..., n lying in the Euclidean space R d, the kernel density estimate ˆf k (x) = c k,h n k ( x x i 2 /h 2) (1) with bandwidth h and profile function k satisfying k(z) 0 for z 0, is a nonparametric estimator of the density at x. The constant c k,h is chosen to ensure that ˆf integrates to 1. Let g(x) = k (x). From [3], taking the gradient of (1) m h (x) = g ( x x i 2 /h 2) x i g ( x (2) x x i 2 /h 2) where, m h (x) is the mean shift vector which is proportional to the normalized density gradient estimate. The iteration x j+1 = m h (x j ) + x j is a gradient ascent technique converging to a stationary point of the density [3]. Saddle points can be detected and removed, to obtain the modes. 3.2. Mean Shift over SE(3) Mean shift for Lie groups was proposed in [20]. This algorithm was generalized to the class of all analytic manifolds in [15]. The special Euclidean group is a matrix Lie group. Here, we outline details of the mean shift algorithm for the SE(3). Further details can be found in [15, 20]. The special Euclidean group consists of 4 4 matrices of the form [ ] R t X = (3) 0 1 where, t is a 3-vector and R is a 3 3 orthogonal matrix i.e. R T R = I. Each element of SE(3) has 12 elements but due to orthogonality it has just 6 degrees of freedom. Note, that SE(3) is not a vector space. Given two points X 1, X 2 SE(3), X 1 + X 2 does not lie in SE(3) but X 1 X 2 SE(3). The group operation for elements of SE(3) is matrix multiplication and not matrix addition. The group SE(3) has a closely associated vector space (Lie algebra) se(3). The correspondence between SE(3) and se(3) is established through the exponential operator exp : se(3) SE(3) and its inverse, the logarithm operator log : SE(3) se(3). The computational details of the exp and log operators for SE(3) can be found in [20]. Following standard notation in such cases, we denote elements of SE(3) by capital bold letters and elements of se(3) with small, bold letters. The usage of the same letter indicates a correspondence, x = log(x) and X = exp(x). Elements of se(3) are 4 4 matrices of the form x = [ Ω u 0 0 where, u is a 3-vector and Ω is skew-symmetric Ω = ] 0 ω z ω y ω z 0 ω x ω y ω x 0 (4). (5) Note, that elements of se(3) are defined by 6 distinct numbers. Although, the elements of se(3) are organized in the form of a matrix, when we talk about vectors in se(3) we mean the vector ( ω x, ω y, ω z, u T ) T. Therefore, se(3) is a six-dimensional vector space. We define a norm on se(3) through a 6 6 positive, definite matrix H as [11] x 2 H = x T Hx. (6) This allows us to define the distance between X and Y as log(x 1 Y) H. The exp and log operators have two useful properties. Firstly, a neighbourhood of the identity in SE(3) maps onto

a neighbourhood of the zero matrix in se(3), and there exists a neighbourhood where these operators are one-to-one. Secondly, for x, y se(3) [13] exp(x)exp(y) = exp ( x + y + O( (x, y) 2 ) ) (7) and therefore, for small x, y SE(3), exp(x)exp(y) exp(x + y). (8) Now, given a set of points X i SE(3), i = 1,..., n, we define the density estimator at X SE(3) as ˆf k (X) = c k,h n k ( log(x 1 X i ) /h 2) (9) and we obtain the mean shift vector g ( log(x 1 X i ) /h 2) log ( X 1 ) X i m h (X) = g ( (10) log(x 1 X i ) /h 2) where, g is defined as before. Note, all the g( ) terms are scalars and the log ( X 1 X i ) terms lie in se(3). Therefore, m h (X) lies in se(3). To get back to the group SE(3), the mean shift iteration now becomes, X j+1 = X j exp ( m h (X j ) ). (11) 4. Robust Pose Estimation The first step of our algorithm is hypothesis generation. These hypotheses are clustered over SE(3) using nonlinear mean shift. The most dominant detected mode is retained as the pose. The pose hypotheses based on data from a single frame are plotted in Figure 1 using the rotation elements as coordinates. We integrated this robust estimator into the camera tracking system of [16]. The world coordinate frame is based on a set of easily identifiable markers. Initially, the pose is estimated from these markers and there are no outliers. Features in the rest of the scene are triangulated using these pose estimates. The camera is then allowed to move freely without being required to keep the markers in view. Triangulated features are used to estimate pose while further features are constantly reconstructed. At this stage, the robust estimator is required to prevent mismatches in the image tracking from leading to erroneous pose estimates. In practice, a robust pose estimator is not sufficient for good results. Each robust fit is used to remove outliers and the final pose is estimated using only the inliers. When used in a SFM system, the mean shift estimator also allows us to take advantage of the continuity of the Figure 1. Sampled rotations mapped to the Lie algebra using data from a frame. The cluster around the true pose is clearly visible. camera movement. Since the pose estimates of two consecutive frames will not be very different from each other, rather than starting a mean shift iteration at each hypothesis, we only try to find the mode closest to the previous frame s pose. Therefore a single mean shift iteration is initialized at the previous pose estimate. The point of convergence is taken as the next pose estimate. 4.1. Mean Shift versus RANSAC We outline a simple proof of why mean shift performs better than hypothesis-and-test algorithms. Assume the data consists only of noisy inliers. With perfect data all hypotheses will be at the true pose. For noisy data, the hypotheses P i, i = 1,..., m are distributed around the true pose. We assume the algorithm for hypothesis generation is unbiased. The generated hypotheses will form a unimodal distribution with the made at the true pose. This mode is modeled as a Gaussian with mean at the true pose P o and covariance Σ. Since SE(3) is a 6-dimensional manifold in 12-dimensional space, Σ is a 12 12 matrix of rank 6 [7]. The squared Mahalanobis distances of the hypotheses from P o forms a χ 2 distribution with 6 degrees of freedom (dof). Let f and F be the density and distribution functions of a 6 dof χ 2 distribution. Let P r be the RANSAC result and P a be the average of m hypotheses. We compare the two estimates based on their Mahalanobis distances from P o. RANSAC will always return one of the generated hypotheses. Ideally, it will return the hypothesis with the lowest Mahalanobis distance to P o. The probability of the lowest Mahalanobis distance being d and all the others being greater than d is p( P r P o 2 Σ = d 2 ) = mf(d 2 )(1 F (d 2 )) m 1. (12) The mean of m Gaussian variables is a Gaussian random variable with the same mean but an m times less covariance. Therefore, P a is a Gaussian random variable with mean P o

Figure 2. Comparison of the error densities for RANSAC and averaging as given by (12) and (13). (a) m = 10 for both curves. (b) m = 100 for both curves. (c) m = 100 for RANSAC and m = 25 for averaging. and covariance Σ/m. Consequently, m P a P o 2 Σ is a χ2 variable and this gives p( P a P o 2 Σ = d 2 ) = mf(md 2 ). (13) The distributions for m = 10 and m = 100 are compared in the first two images of Figure 2. The averaged estimates are closer to the true pose, and as m increases this difference becomes more obvious. Therefore, averaging requires fewer hypotheses to perform as well as RANSAC. In the presence of outliers, the hypotheses will no longer form a unimodal distribution around the true pose. However, the pose estimates generated using only inliers will still be distributed in the same way. Ideally, RANSAC will return the closest of these estimates, and the above analysis for RANSAC still holds. To prevent outlier hypotheses from affecting the averaging, the averaging needs to be robust. Mean shift (with the Epanechnikov kernel) is the mean of all the points lying within the basin of attraction [3]. For an appropriately chosen bandwidth, the mean shift estimate will be the average of all the inlier hypotheses and the distance of this value from the true pose will follow the distribution (13). Since averaging requires fewer hypotheses for the same level of performance and the major bottleneck in the hypothesis-and-test procedure is the generation of the hypotheses, less time is spent removing outliers. In practice, the above assumptions may not hold. The hypotheses need not be normally distributed, although for low noise this does not lead to serious problems. More importantly, the bandwidth of the mean shift is usually conservative and not all inlier hypothesis are averaged. Therefore, the parameter m differs for mean shift and RANSAC. In the third curve of Figure 2, we compare the RANSAC error density of (12) for m = 100 and the averaging error density of (13) for m = 25. As these densities are comparable, mean shift needs to average only 25 good hypotheses to be as good as RANSAC with 100 inlier hypotheses. 5. Results In this section we compare the performance of our algorithm with RANSAC, on real and synthetic data sets, and verify the claims made in the previous section. 5.1. Synthetic Experiments We generated a random cloud of 80 3D world points. The points were projected to the image with the identity pose. To estimate the covariance of the RANSAC and mean shift estimates we use bootstrapping. In each trial, Gaussian noise, of standard deviation 0.1 for 3D points and standard deviation 0.01 for 2D, was added to the data. Then 20 randomly generated outliers were also added to the data set. This was repeated 100 times and the RANSAC and mean shift were run on each data set. The means of both RANSAC and mean shift estimates are close to the true pose. The sample covariance matrix for both sets of estimates is of rank 6, and the singular vectors corresponding to nonzero singular values lie in the tangent plane of the manifold SE(3) [7]. Secondly, the covariance of the RANSAC estimates is greater than the covariance of the mean shift estimates. The nonzero singular values are listed in Table 1. Table 1. Nonzero Singular Values of Covariance Matrices. Mean Shift RANSAC 2.79 8.41 2.27 3.86 0.79 1.71 2.83 10 4 7.37 10 4 3.28 10 5 5.82 10 5 2.42 10 5 4.78 10 5

Figure 3. Comparison of mean shift and RANSAC for pose estimation on the Corridor sequence. The ground truth cameras are drawn with black dots and the robust pose estimates are drawn in solid red. The results of mean shift are on the left and RANSAC on the right. 5.2. The Corridor Sequence We tested the robust estimators on real data using the Corridor sequence from Oxford. The 409 visible points from the first image were taken as the initial data set. For this frame there are no outliers. The point matching system of [5] was used to track points across all the images. The outliers keep increasing at each matching step since the matcher makes errors, and more importantly, as points go out of view they get wrongly assigned to the best match available. To make the comparison between mean shift and RANSAC meaningful, the same elemental subsets were used in both cases. For the first frame both methods gave good results. As the number of outliers increased mean shift performs better than RANSAC. The mean shift estimate shows a visible error only for these last three frames when the number of inliers falls sharply from 248 at the eighth frame to 119 for the ninth frame, while RANSAC breaks down much earlier. In Figure 3, the robust pose estimates are compared with the ground truth. The pose is used to render the frames so that the difference can be visualized. The error between the robust estimates and the ground truth is also compared numerically. Let ˆR and ˆt be the ground truth pose and let R, t be robust estimates. The rotational error is given by the Frobenius norm of R T ˆR I and the translation error by the vector norm of t ˆt. The rotational error, translational error and number of inliers are plotted versus the frame number in Figure 4. to allow scene feature to be reconstructed. This is the set of frames lying along a line in the top left of the image. Later, the camera is allowed to move away from the markers and the robust estimator is used. We ran our experiments on a 2.4GHz Pentium 4 machine. RANSAC requires 100 hypothesis and takes 0.4ms to process them. Each hypothesis generation takes 0.05ms leading to a total of 5.4ms for the robust estimator. The mean shift estimator requires 50 hypothesis for similar performance and takes 1.2ms, on the average, to find the mode. This gives a total time of 3.7ms for the mean shift estimator. 5.3. Camera Tracking System The workspace scene from [16] was used to test our system. An image of this scene and the camera path and the reconstructed point cloud for a sequence are shown in Figure 5. Initially the camera is moved in front of the markers Figure 4. Errors in mean shift and RANSAC robust estimates and the number on inliers plotted as functions of the frame number for the Corridor sequence.

Figure 5. Results of the camera tracking. The scene used is shown on the left. The reconstructed point cloud and camera frames are on the right. 6. Conclusions We propose a new robust pose estimator. The estimator is based on the nonlinear mean shift algorithm and shows better performance on real and synthetic data. As future work we would like to test the effect different hypothesis generation schemes on the estimator. We would also like to extend the algorithm to handle cases where only image point correspondences are given [9]. In this case, the hypotheses no longer lie on SE(3) and this needs to be taken care of during the clustering. References [1] M. A. Ameller, L. Quan, and B. Triggs. Camera pose revisited: New linear algorithms. Machine Intelligence, 16(8):802 808, 2002. [2] O. Chum and J. Matas. Randomized RANSAC with t d,d test. In British Machine Vision Conference, pages 448 457, 2002. [3] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Machine Intell., 24:603 619, May 2002. [4] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. Assoc. Comp. Mach, 24(6):381 395, 1981. [5] B. Georgescu and P. Meer. Point matching under large image deformations and illumination changes. IEEE Trans. Pattern Anal. Machine Intell., 26:674 689, 2004. [6] R. Haralick, C. Lee, K. Ottenberg, and M. Nolle. Analysis and solutions of the three point perspective pose estimation problem. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Maui, HA, pages 592 598, 1991. [7] K. Kanatani. Statistical Optimization for Geometric Computation: Theory and Practice. Elsevier, 1996. [8] D. Nister. Preemptive RANSAC for live structure and motion estimation. In Proc. 9th Intl. Conf. on Computer Vision, Nice, France, volume I, pages 199 206, October 2003. [9] D. Nister. An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Machine Intell., 26(6):756 770, 2004. [10] D. Nister. Preemptive RANSAC for live structure from motion. Machine Vision and Applications, 16(5):321 329, 2005. [11] X. Pennec and N. Ayache. Uniform distribution, distance and expectation problems for geomteric feature processing. Journal of Mathematical Imaging and Vision, 9(1):49 67, 1998. [12] M. Pollefeys. Self calibration and metric reconstruction in spite of varying and unknown intrinsic camera parameters. International J. of Computer Vision, 32:7 25, 1999. [13] W. Rossmann. Lie Groups: An Introduction through Linear Groups. Oxford University Press, 2003. [14] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or How do I organize my holiday snaps?. In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, volume 1, pages 414 431, 2002. [15] R. Subbarao and P. Meer. Nonlinear mean shift for clustering over analytic manifolds. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, New York, NY, volume I, pages 1168 1175, 2006. [16] R. Subbarao, P. Meer, and Y. Genc. A balanced approach to 3D tracking from image streams. In Proc. IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 70 78, October 2005. [17] B. Tordoff and D. Murray. Guided sampling and consensus for motion estimation. In 7th European Conference on Computer Vision, volume I, pages 82 96, Copenhagen, Denmark, May 2002. [18] P. H. S. Torr and A. Zisserman. MLESAC: A new robust estimator with application to estimating image geometry. Computer Vision and Image Understanding, 78:138 156, 2000. [19] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon. Bundle adjustment A modern synthesis. In B. Triggs, A. Zisserman, and R. Szelisky, editors, Vision Algorithms: Theory and Practice, pages 298 372. Springer, 2000. [20] O. Tuzel, R. Subbarao, and P. Meer. Simultaneous multiple 3D motion estimation via mode finding on Lie groups. In Proc. 10th Intl. Conf. on Computer Vision, Beijing, China, volume 1, pages 18 25, 2005.