Estimation of Camera Pose with Respect to Terrestrial LiDAR Data

Estimation of Camera Pose with Respect to Terrestrial LiDAR Data Wei Guan Suya You Guan Pang Computer Science Department University of Southern California, Los Angeles, USA Abstract In this paper, we present an algorithm that is to estimate the position of a hand-held camera with respect to terrestrial LiDAR data. Our input is a set of 3D range scans with intensities and one or a set of 2D uncalibrated camera images of the scene. The algorithm that automatically registers range scans and 2D images is composed of following steps. In the first step, we project the terrestrial LiDAR onto 2D images according to several preselected viewpoints. Intensity-based features such as SIFT are extracted from these projected images and these features are projected back onto the LiDAR data to obtain their 3D positions. In the second step, we estimate the initial pose of given 2D images from feature correspondences. In the third step, we refine the coarse camera pose obtained from the previous step through iterative matchings and optimization process. We presents results from experiments in several different urban settings. 1. Introduction This paper deals with the problem of automatic pose estimation of a 2D camera image with respect to 3D LiDAR data of an urban scene, which is an important problem in computer vision. Its application include urban modeling, robots localization and augmented reality. One way to solve this problem is to extract features on both types of data and find the 2D-to-3D feature correspondences. However, since the structures of this two types of data are so different, the features extracted from one type of data are usually not repeatable in the other (except for very simple features such as lines or corners). Instead of direct extracting in 3D space, features can be extracted on their 2D projections and 2D-to- 2D-to-3D matching scheme can be used. As remote sensing technology develops, most recent Li- DAR data has intensity value for each point in the cloud. e-mail: wguan@usc.edu e-mail:suya@usc.edu e-mail:gpang@usc.edu Figure 1. The 3D LiDAR data with color information (sampled by software for fast rendering). The 2D image of the same scene taken at the ground level. Some LiDAR data also contains color information. The intensity information is obtained by measuring the strength of surface reflectance, and the color information is provided by an additional co-located optical sensor that captures visible light. These information is very helpful for matching 3D range scans with 2D camera images. Unlike geometry-only LiDAR data, intensity-based features can be applied in the pose estimation process. Figure 1 shows the colored LiDAR data and a camera image taken on the ground. As we can observe, the projected LiDAR looks similar to an image that is taken by an optical camera. The fact is that if the point cloud is dense enough, the projected 2D image can be treated the same

way as a normal camera image. However, there are several differences between projected image and an image taken by a camera. First, there are many holes on the projected image due to the missing data. This is usually caused by the non-reflecting surfaces and occlusions in the scene. Second, if the point cloud intensity is measured by reflectance strength, the reflectance property of invisible lights are different from that of visible lights. Even in the case that visible lights are used to obtain LiDAR intensities, the lighting conditions could be different from the lighting of a camera image. In this paper, we propose an algorithm that can handle LiDAR with both types of intensity information. The intensity information of LiDAR data is useful for camera pose initialization. However, due to intensity differences, occlusions etc., there are not many correspondences available and one small displacement in any of these matching points will cause large errors in the computed camera pose. Moreover, in most urban scenes, there exist many repeated patterns, which make many features fail in the matching process. With the initial pose, we can estimate the location of corresponding features and limit the searching range within which repeated patterns does not appear. Therefore, we can generate more correspondences and refine the pose. After several iterative matchings, we further refine the camera pose by minimizing the differences between the projected image and camera image. The estimated camera pose is more stable after the above two steps of refinement. The contribution of this paper is summarized as follows. 1. We propose a framework of camera pose estimation with respect to 3D terrestrial LiDAR data that contains intensity values. No prior knowledge about the camera position is required. 2. We designed a novel algorithm that refines the camera pose in two steps. Both intensity and geometric information are used in the refinement process. 3. We have tested the proposed framework on different urban settings. The results show that the estimated camera pose is accurate and the framework can be applied in many applications such as mixed reality. The remainder of this paper presents the proposed algorithm in more details. We first discuss some related work in Section 2. Section 3 describes the camera pose initialization process. Following that, Section 4 discusses the algorithm that refines camera pose. We show experimental results in Section 5 and conclude the paper in the last section. 2. Related Work There has been a considerable amount of research in registering images with LiDAR data. The registration methods vary from keypoint-based matching [3, 1], structure-based matching [20, 13, 14, 21], to mutual information based registration [24]. There are also methods that are specially designed for registering aerial LiDAR and aerial images [7, 22, 5, 23, 16]. When the LiDAR data contains intensity values, keypoint-based matchings [3, 1] that are based on similarity between LiDAR intensity image and camera intensity image can be applied. Feature points such as SIFT [15] are extracted from both images and a matching strategy is used to determine the correspondences thus camera parameters. The drawback of intensity-based matching is that it usually generates very few correspondences and the estimated pose is not accurate or stable. Najafi et al [18] also created an environment map to represent an object appearance and geometry using SIFT features. Vasile, et al. [22] used LiDAR data to generate a pseudo-intensity image with shadows that are used to match with aerial imagery. They used GPS as the initial pose and applied exhaustive search to obtain the translation, scale, and lens distortion. Ding et al. [5] registered oblique aerial images based on 2D and 3D corner features in the 2D images and 3D LiDAR model respectively. The correspondences between extracted corners are generated through Hough transform and a generalized M- estimator. The corner correspondences are used to refine camera parameters. In general, a robust feature extraction and matching scheme is the key to a successful registration for this type of approaches. Instead of point-based matchings, structural features such as lines and corners have been utilized in many researches. Stamos and Allen [20] used matching of rectangles from building facades for alignment. Liu et al. [13, 14, 21] extracted line segments to form rectangular parallelepiped, which are composed of vertical or horizontal 3D rectangular parallelepiped in the LiDAR and 2D rectangles in the images. The matching of parallelepiped as well as vanishing points are used to estimate camera parameters. Yang, et al. [25] used feature matching to align ground images, but they worked with a very detailed 3D model. Wang and Neumann [23] proposed an automatic registration method between aerial images and aerial Li- DAR based on matching 3CS ( 3 Connected Segments) in which each linear feature contains 3 connected segments. They used a two-level RANSAC algorithm to refine putative matches and estimated camera pose from the correspondences. Given a set of 3D to 2D point or line correspondences, there are many approaches to solve the pose recovery problem [17, 12, 4, 19, 11, 8]. The same problems also appear in pose recovery with respect to point cloud which is generated from image sequences [9, 10]. In both cases, a probabilistic RANSAC method [6] was also introduced for automatically computing matching 3D and 2D points and remove outliers.

Figure 2. The virtual cameras are placed around the LiDAR scene. They are placed uniformly in viewing directions and logarithmically in the distance. In this paper, we will apply keypoint-based method to estimate initial camera pose, then use iterative methods with RANSAC by utilizing both intensity and geometric information to obtain the refined pose. 3. Camera Pose Initialization 3.1. Synthetic Views of 3D LiDAR To compute the pose for an image taken at an arbitrary viewpoint, we first create synthetic views that cover a large viewing directions. Z-buffers are used to handle occlusions. Our application is to recover the camera poses of images taken in urban environments, so we can restrict the placement of virtual cameras to the height of eye-level to simplify the problem. Generally, the approach is not limited to such images. We place the cameras around the LiDAR in about 180 degrees. The cameras are placed uniformly in the angle of views and logarithmically in distance, as shown in Figure 2. The density of locations depends on the type of feature that we use for matchings. If the feature is able to handle rotation, scale and wide baseline, we need less virtual cameras to cover most cases. In contrast, if the feature is neither rotation-invariant nor scale-invariant, we need to select as many viewpoints as possible, and rotate the camera at each viewpoint. Furthermore, it should be noted that the viewpoints cannot be too close to point cloud, otherwise the quality of projected image is not good enough to generate initial feature correspondences. In our work, we use SIFT [15] features which are scale and rotation invariant and robust to moderate viewing angle changes. We select 6 viewing angles uniformly and 3 distance for each viewing angle in a logarithmic way. The synthetic views are shown in Figure 3. 3.2. Generation of 3D Feature Cloud We extract 2D SIFT features for each synthetic view. Once the features are extracted, we project them back onto the point cloud by finding intersection with the first plane Figure 3. The synthetic views of LiDAR data. 2D features are extracted from each synthetic view. that is obtained through plane segmentation with method [20]. It is possible that the same feature is reprojected onto different points through different synthetic views. To handle this problem, we post process these feature points so that close points with similar descriptors are merged into one feature. Note that we can also get the 3D features by triangulation method. However, such method depends on matching pairs so it generates much fewer features for potential match. The obtained positions of 3D keypoints are not accurate due to projection and reprojection errors, but good enough to provide an initial pose. We will optimize their positions and camera pose in later stage. The generated 3D feature cloud is shown in Figure 4. Each point is associated with one descriptor. For a given camera image, we extract SIFT features and match them with the feature cloud. A direct 3D to 2D matching and RANSAC method is used to estimate the pose and remove outliers. When we use RANSAC method, rather than maximizing the number of inliers that are consensus to the hypothesized pose, we make modifications as follows. We cluster the inliers according to their normal directions. Inliers with close normal directions will be grouped into the same cluster. Let N1 and N2 be the number of inliers for the largest two clusters. Among all the hypothesized poses, we want to maximize the value of N2, i.e.

Figure 4. SIFT features in 3D space. The 3D positions are obtained by reprojecting 2D features onto the 3D LiDAR data. [R T ] = argmaxn2. (1) [R T ] This is to ensure that not all the inliers lie within the same plane, in which case the calculated pose is unstable and sensitive to position errors. 4. Camera Pose Refinement (c) 4.1. Refinement by More Point Correspondences With the estimated initial pose, we can generate more feature correspondences by limiting the similarity searching space. For the first iteration, we still use SIFT feature. From 2nd iteration on, we can use less distinctive features to generate more correspondences. In our work, we use Harris corners as the keypoints. For each corner point, a normalized intensity histogram within an 8x8 patch is computed as the descriptor. Its corresponding point will probably lie within the neighborhood of H by H pixels. Initially, H is set to 64 pixels. For each iteration, the size is reduced to half since more accurate pose is obtained. We keep the minimum searching size to 16 pixels. Figure 5 shows a few iterations and matching results within reduced searching space. 4.2. Geometric Structures and Alignment The purpose of geometric structure extraction is not to form features to generate correspondences. Instead, they are used to align 3D structure with 2D structures in the camera image. In our work, line segments are used to align 3D range scans with 2D images. Therefore, we need to define the distance between these line segments. There are two types of lines in the 3D LiDAR data. The first type is generated from the geometric structure, which can be computed at the intersections between segmented planar regions and at the borders of the segmented planar (d) (e) Figure 5. The initial camera pose. 3D to 2D matching on initial pose. (c) Camera pose after 1st iteration. (d) 3D to 2D matching based on refined pose. (e) Camera pose after 2nd iteration. regions [20]. The other type of lines is formed by intensities. These lines can be detected on the projected synthetic image with method [2] and reprojected onto 3D LiDAR to get their 3D coordinates. For each hypothesis pose, these

3D lines are projected onto 2D images and we measure the alignment error as follows. Eline = M N 1 K(li, Lj ) max(d(li1, Lj ), D(li2, Lj )), N i=1 j=1 (2) where li is the ith 2D line segment with li1 and li2 as its two endpoints, Lj is jth 3D line segment. M and N are number of 2D segments and 3D segments respectively. K(li, Lj ) is a binary function deciding whether the two line segments have similar slopes. D(li1, Lj ) is a function describing the distance from the endpoint li1 to projected line segment Lj. The function K and D are defined as follows, { 0 for (l, L) < Kth K(l, L) = (3) 1 for (l, L) Kth { 0 for d(l12, L) Dth D(l12, L) = d(l12, L) for d(l12, L) < Dth (4) where (l, L) represents the angle difference between the two line segments, and d(l12, L) is the distance from endpoint l1 or l2 to projected line segment L. Kth and Dth are two thresholds deciding whether the two segments are potential matches. In our experiment, we set Kth = π/6, Dth = W/20, where W is the image width. 4.3. Refinement by Minimizing Error Function Figure 7. Intensity differences between projected image and camera image. errors after iterative refinements errors after optimization. the differences between LiDAR-projected image and camera image. The differences are represented by an error function, which is composed of two parts, line differences and intensity differences. We have talked about line differences above. The intensity error function is defined as follows, Eintensity = Figure 6. The refined pose from iterative matchings. Camera pose after minimizing the error function. Once we have obtained the camera pose with iterative refinements, we can further refine the pose by minimizing 1 (s I3D (i) I2D (i))2, {i} i (5) where I3D (i) and I2D (i) are intensity values for the ith pixel on projected image and camera image respectively. {i} is the number of projected pixels. s is the scale factor that compensate the reflectance or lighting differences. s will take the value that can minimize the intensity errors, so the above error function is equivalent to, Eintensity = i (I2D (i)2 I3D (i)2 I2D (i)2 ), I3D (i)2 (6)

The overall error function is a weighted combination of two error functions, 80 70 E pose = αe line pose + (1 α)e intensity pose, (7) where pose is determined by the rotation R, translation T or equivalently 3D positions of keypoints P. We set α = 0.5 in our experiments. Since intensity errors usually have larger scales, this will make intensity a larger effect on the overall error function. The relative pose is refined via minimization of the above error function: (R, T, P ) = argmmine pose R,T,P. (8) R,T,P The refinement results are shown in Figure 6 and 7. 5. Experimental Results We have tested more than 10 sets of scenes with camera images taken from different viewpoints. Figure 8 shows an example of pose recovery through iterations and optimizations. After a series of refinement through iterative matchings and optimization, we can get an accurate view of a given camera image. Figure 9 shows an image of the same scene but taken from another view. It can be easily observed that the virtual image is well aligned with the real image by blending the two images together. We have also measured the projection errors for each refinement process. The results are shown in Figure 10 and Figure 11. As is shown in Figure 10, the errors stay constant after 3rd refinements. This is because that usually we have obtained sufficient correspondences after 3rd iteration to get a stable camera pose. The errors are caused by the errors in calculating the 3D position of keypoints. This can be further improved by adjusting the pose to get even smaller projection errors, as shown in Figure 11. However, due to moving passengers, occlusions, lighting conditions etc., there are always errors between a projected image and a camera image. 6. Conclusion We have proposed a framework of camera pose estimation with respect to 3D terrestrial LiDAR data. The LiDAR data contains intensity information. We first project the Li- DAR onto several pre-selected viewpoints and calculate the SIFT features. These features are reprojected back onto the LiDAR data to obtain their positions in 3D space. These 3D features are used to compute the initial pose of the camera pose. In the next stage, we iteratively refine the camera pose by generating more correspondences. After that, we further refine the pose through minimizing the proposed objective errors errors 60 50 40 30 20 10 0 0 2 4 6 8 10 iteration no. Figure 10. The errors after each iterative refinement. 60 50 40 30 20 10 after iterative refinements after optimization 0 0 5 10 15 20 25 30 image no. Figure 11. The errors before and after optimization. function. The function is composed of two components, errors from intensity differences and errors from geometric structure displacements between projected LiDAR image and camera image. We have tested the proposed framework on different urban settings. The results show that the estimated camera pose is stable and the framework can be applied in many applications such as augmented reality. References [1] D. G. Aguilera, P. R. Gonzalvez, and J. G. Lahoz. An automatic procedure for co-registration of terrestrial laser scanners and digital cameras. ISPRS Journal of Photogrammetry and Remote Sensing, 64(3):308 316, 2009. [2] N. Ansari and E. J. Delp. On detecting dominant points. Pattern Recognition, 24(5):441 451, 1991. [3] S. Becker and N. Haala. Combined feature extraction for facade reconstructio. In ISPRS Workshop on Laser Scanning,

(c) (d) (e) (f) Figure 8. The camera image The initial view calculated from limited number of matches (c) The refined view by generating more correspondences (d) It has no further refinement (from more matches) after 2 or 3 iterations (e) The pose is refined by minimizing the proposed error function (f) The virtual building is well aligned with the real image for the calculated view. 2007. [4] S. Christy and R. Horaud. Iterative pose computation from line correspondences, 73(1):137 144, 1999. [5] M. Ding, K. Lyngbaek, and A. Zakhor. Automatic registration of aerial imagery with untextured 3d lidar models. In Computer Vision and Pattern Recognition (CVPR), 2008. [6] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fiting with applications to image analysis and automated cartography, 24(6):381 395, 1981. [7] C. Frueh, R. Sammon, and A. Zakhor. Automated texture mapping of 3d city models with oblique aerial imagery. In Symposium on 3D Data Processing, Visualization and Transmission, pages 3963 403, 2004. [8] W. Guan, L. Wang, M. Jonathan, S. You, and U. Neumann. Robust pose estimation in untextured environments for augmented reality applications. In ISMAR, 2009. [9] W. Guan, S. You, and U. Neumann. Recognition-driven 3d navigation in large-scale virtual environments. In IEEE Virtual Reality, 2011. [10] W. Guan, S. You, and U. Neumann. Efficient matchings and mobile augmented reality. In ACM TOMCCAP, 2012. [11] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. [12] R. Horaud, F. Dornaika, B. Lamiroy, and S. Christy. Object pose: The link between weak perspective, paraperspective and full perspective, 22(2), 1997.

Figure 9. The estimated camera pose with respect to the same scene as in Figure 8 but from a different viewpoint. The right figure shows the mixed reality of both virtual world and real world. [13] L. Liu and I. Stamos. Automatic 3d to 2d registration for the photorealistic rendering of urban scenes. In Computer Vision and Pattern Recognition, pages 137 143, 2005. [14] L. Liu and I. Stamos. A systematic approach for 2d-image to 3d-range registration in urban environments. In In International Conference on Computer Vision, pages 1 8, 2007. [15] D. Lowe. Object recognition from local scale invariant features. In International Conference on Computer Vision, 1999. [16] A. Mastin, J. Kepner, and J. Fisher. Automatic registration of lidar and optical images of urban scenes. In Computer Vision and Pattern Recognition (CVPR), pages 2639 2646, 2009. [17] D. Oberkampf, D. DeMenthon, and L. Davis. Iterative pose estimation using coplanar feature points. In CVGIP, 1996. [18] F. of 3D, appearance models for fast object detection, and pose estimation. Iterative pose estimation using coplanar feature points. In in ACCV, pages 415 426, 2006. [19] L. Quan and Z. Lan. Linear n-point camera pose determination. In PAMI, 1999. [20] I. Stamos and P. K. Allen. Geometry and texture recovery of scenes of large scale, 88(2):94 118, 2002. [21] I. Stamos, L. Liu, C. Chen, G.Wolberg, G. Yu, and S. Zokai. Integrating automated range registration with multiview geometry for the photorealistic modeling of large-scale scenes. pages 237 260, 2008. [22] A. Vasile, F. R. Waugh, D. Greisokh, and R. M. Heinrichs. Automatic alignment of color imagery onto 3d laser radar data. In Applied Imagery and Pattern Recognition Workshop, 2006. [23] L. Wang and U. Neumann. A robust approach for automatic registration of aerial images with untextured aerial lidar data. In Computer Vision and Pattern Recognition (CVPR), pages 2623 2630, 2009. [24] R. Wang, F. Ferrie, and J. Macfarlane. Automatic registration of mobile lidar and spherical panoramas. In Computer Vision and Pattern Recognition Workshops (CVPRW), pages 33 40, 2012. [25] G. Yang, J. Becker, and C. Stewart. Estimating the location of a camera with respect to a 3d model. In 3D Digital Imaging and Modeling, 2007.