Real-Time RGB-D Registration and Mapping in Texture-less Environments Using Ranked Order Statistics

Size: px

Start display at page:

Download "Real-Time RGB-D Registration and Mapping in Texture-less Environments Using Ranked Order Statistics"

Mark Hutchinson
5 years ago
Views:

Real-Time RGB-D Registration and Mapping in Texture-less Environments Using Ranked Order Statistics Khalid Yousif 1, Alireza Bab-Hadiashar 2, Senior Member, IEEE and Reza Hoseinnezhad 3 Abstract In

2 Real-Time RGB-D Registration and Mapping in Texture-less Environments Using Ranked Order Statistics Khalid Yousif 1, Alireza Bab-Hadiashar 2, Senior Member, IEEE and Reza Hoseinnezhad 3 Abstract In this paper, we present a real time 3D SLAM system for texture-less scenes using only the depth information provided by a low cost RGB-D sensor. The proposed registration method is based on a novel informative sampling scheme that is able to extract the points carrying the most useful information from two consecutive frames. Those points are assigned 3D feature descriptors and matched with their correspondences from the other frame. A robust estimator is then used to refine the matches and estimate a rigid transformation between the frames. A global pose for the camera and associated 3D position of the points are computed by concatenating the relative transformations and a map is constructed. As a result, we are able to achieve very accurate registration in real-time using the sparse keypoints. Experimental evaluations show that our proposed method outperforms the state of the art registration algorithms in terms of accuracy and efficiency. I. INTRODUCTION Simultaneous Localization and Mapping (SLAM) [1] is the problem of simultaneous estimation of the pose of a mobile robot or a moving camera and mapping an unknown environment. The SLAM problem has been intensely studied over the past couple of decades and various solutions have been developed for specific applications [2]. The invention of cheap RGB-D sensors (e.g. Microsoft Kinect) in recent times has caused an intense interest in obtaining a dense 3D model of the environment instead of the sparse 2D map [3][4][5]. The main motivation is that dense 3D maps are rich in information and can be effectively used to improve object recognition and manipulation, navigation, collision avoidance and path planning algorithms. A 3D representation of the environment is also very useful in augmented reality applications that facilitate interactions between the real and virtual environments [3]. Despite the above advantages, a number of challenges arise when using RGB-D cameras for 3D registration and mapping applications. For instance, the registration of frames in a texture-less scene is very difficult because the existing visual information would not be sufficient to align the corresponding 3D points between different frames. In addition, the huge amount of data provided by RGB-D sensors pose significant challenges for acquiring, processing and visualizing data in real-time. Many types of environments in which mobile robots are used in (e.g. offices, warehouses and homes) predominantly inlclude texture-less objects (plain walls, floors, ceilings, Authors are with school of Aerospace, Mechanical and Manufacturing Engineering, RMIT University, Melbourne, VIC 3001, Australia 1 s @student.rmit.edu.au 2 abh@rmit.edu.au 3 rezah@rmit.edu.au Fig. 1: Constructed 3D map of a RMIT university postgraduate office using the proposed 3D registration and mapping algorithm. desks etc..). Fig. 2 shows a typical corridor in an office building. Registration in these environments using visual feature extraction methods would often fail as there is not enough constraining visual features. In this paper, we outline the problem of 3D registration and mapping in an environment that may contain textureless scenes and propose a fast and accurate method that does not rely on extracting any visual information from the RGB image. Instead, we extract geometric 3D features from sequential frames and assign a descriptor to each feature, match them using their descriptors and finally, refine the matches and align them using a robust estimator [6]. Due to the hard real time constraints and high computational cost of feature descriptor matching, this is a challenging task. Our main contribution here is the development of an informative sampling based 3D feature extraction technique. The method is able to exploit the geometric information of the points and their neighbors to identify points that carry the most useful information. We call the points resulting from our informative sampling scheme: Ranked Order Statistics (ROS) keypoints. We will show that the proposed feature extraction method is able obtain a subset of points of the original point cloud that can deliver a very accurate registration compared to using a point cloud containing 10 times more points. The main advantage of using this sampling technique is that it would reduce computational time significantly. In fact, we will show that our method outperforms the state of the art registration methods both in accuracy and computational efficiency. The rest of the paper is organized as follows. In section II

features. we will discuss the related work in this area. We will present our informative sampling based feature extraction method in section III and our 3D registration method in section IV.

[4] who employed a Microsoft Kinect to capture RGB and depth information in sequential frames in order to estimate the camera pose and obtain a 3D reconstructed environment.

3 (b) (a) Fig. 2: (a) A typical corridor in an office building with limited texture information (b) Alignment using the proposed method (c) Alignment using our previous method [7] that relies on extracting visual features. we will discuss the related work in this area. We will present our informative sampling based feature extraction method in section III and our 3D registration method in section IV. Results are presented in section V followed by a conclusion in section VI. II. RELATED WORK A. RGB-D Registration and Mapping RGB-D based mapping was first introduced by Henry et al. [4] who employed a Microsoft Kinect to capture RGB and depth information in sequential frames in order to estimate the camera pose and obtain a 3D reconstructed environment. They extracted and matched FAST features [8] from sequential RGB images, and used RANSAC to refine those matches and estimate an initial transformation. This transformation was refined using a generalized-icp [9] algorithm. Global consistency was achieved by applying a Sparse Bundle Adjustment method and loop closure was detected by matching the current frame to previously collected keyframes. Endres et al. [5] proposed a similar method called RGB-D-SLAM in which SIFT, SURF [10] or ORB [11] features were used in place of FAST features and posegraph optimization was used instead of Bundle Adjustment for global optimization. Du et al. [12] extended Henry et al. s method by incorporating on-line user interaction and feedback during the map building step. User input enabled the system to recover from registration failures which may be caused by a fast moving camera. Newcombe et al. [3] proposed a GPU based RGB-D mapping algorithm called KinectFusion. The method was designed to avoid using the RGB images and instead, they proposed an ICP variant for registration in which the current measurement is matched with the full growing surface model as opposed to matching sequential frames. They also implemented a segmentation step which divided the scene into (c) foreground and background and was able to detect moving humans that would be present in the scene. This allowed user interaction during the map building procedure without deteriorating the accuracy of the estimated transformations and the map. The major downside to their approach is that it is computationally expensive because of the requirement to store and process the full dense volumetric representation of the constructed model in GPU memory and as such, is not scalable for large environments. Whelan et al. [13] proposed an extension to KinectFusion which outlined those problems and allowed the region of the environment that is mapped by the KinectFusion algorithm to vary dynamically, allowing their approach to map larger scenes. Our approach is similar as we also do not rely on the availability of RGB features for registration. However, since ICP only works when there is a good initial alignment, we advocate the use of geometric 3D features for registration. Hu et al. [14] proposed a heuristic switching algorithm that chooses between an RGB-D mapping approach and a 8- point RANSAC monocular SLAM, based on the availability of depth information. A 3D Iterative Sparse Local Submap Joining Filter (I-SLSJF) was then used to merge the two generated maps. In our previous work [7], we outlined the problem of 3D registration in dark environments and proposed a method that switches between RGB and IR images for visual feature extraction and matching based on the brightness of the RGB image. A highly robust estimator [6] was then employed to estimate the transformation between frames. The estimate was then refined using an ICP algorithm. This method fails in cases where both the RGB and the IR images contain insufficient visual information, since ICP alone fails to provide an accurate estimate without being provided with a good initial guess. The method proposed in this paper is designed to overcome solely based on using depth information. B. 3D Features The well known Harris corner detector [15] was implemented in 3D and is available in the point cloud library (PCL) [16]. In the 3D implementation, the surface normals were used in place of the image gradients which were used in the original algorithm [17]. SIFT [18] is a blob detector which extracts image patterns that differ from their immediate neighborhood in terms of intensity, color and texture. SIFT was also implemented in 3D and is available in PCL. The main difference between the original method and the 3D implementation is that instead of comparing a pixel with its neighboring pixels in the image plane, a 3D point in a point cloud is compared to its euclidean distance based nearest neighbors found using a kd-tree search. Gelfand et al. [19] proposed a sampling strategy that groups the point cloud data into stable points (constraining points) and non-stable points (that have no effect on the transformation estimation). This is achieved by analyzing the contribution of the force (translational) and torque (rotational) components by each pair of points using a point-to-plane error metric. The main shortcoming of this method is that it is optimized

for a specific metric and it becomes suboptimal when other metrics (such as the squared Euclidean distance error metric employed here) are used. An example of this method is illustrated in Fig.

Zhong [20] proposed a 3D feature detector and descriptor called Intrinsic Shape Signatures.

The method extracts keypoints based on two measures: points that correspond to the smallest eigenvalues of the covariance matrix representing those with large variations, and the ratio between

Despite the availability of 3D feature extraction methods, the above mentioned implementations seem to be focused on general 3D modeling applications and none of those seem to be particularly

4 for a specific metric and it becomes suboptimal when other metrics (such as the squared Euclidean distance error metric employed here) are used. An example of this method is illustrated in Fig. 3 (c) which shows some points with significant local curvature being omitted from the point cloud (such as the top part of a chair). Zhong [20] proposed a 3D feature detector and descriptor called Intrinsic Shape Signatures. This method relies on exploiting the information provided by the covariance matrix that is constructed using surface normals estimates. The method extracts keypoints based on two measures: points that correspond to the smallest eigenvalues of the covariance matrix representing those with large variations, and the ratio between successive eigenvalues in order to discard redundant points that have a similar spread. Despite the availability of 3D feature extraction methods, the above mentioned implementations seem to be focused on general 3D modeling applications and none of those seem to be particularly optimized for SLAM purposes. For instance, our experiments showed that Zhong s method [20] finds many feature points on commonly encountered planar surfaces such as office walls and doors (see Fig. 3 (b)). Those points are ill-conditioned for registration purposes and can deteriorate the estimation outcome. In this work, we address the problem of aligning RGB- D images in an environment that may not contain sufficient visual information for 3D registration using approaches that rely on visual feature extraction and matching tasks. Our proposed method extracts unique 3D features from sequential frames using a novel ranked order statistics based informative sampling segmentation method. The extracted 3D features are then matched using SHOT descriptors [21] and finally, the modified selective statistical estimator (MSSE) [6] is employed to segment the good matches and estimate a 6 degree of freedom (6DOF) transformation between the two frames. The estimated transformations are concatenated up to the current time in order to obtain a global transformation of the camera and a global map. These steps are outlined in the following sections. III. EXTRACTING GEOMETRIC FEATURES USING RANKED ORDER STATISTICS The aim of our proposed feature extraction method is to informatively down-sample a point cloud into a subset of points that geometrically differ from their immediate neighborhood. This is somewhat similar to the idea behind geometrically stable sampling for ICP [19]. However, we present an efficient method that finds the sample with the most information directly using the statistical analysis of their flatness. The main idea is to segment the points into two main groups (a 3rd group is defined later): points that are locally flat and points that have significant local curvature. The details of the implementation of this method is presented in Algorithm 1. The input to this method is a point cloud with a normal vector calculated at every point. We then calculate the average of the squared angles between each point s normal and the normals of its N nearest neighbors (a) (c) Fig. 3: A comparison between different sampling techniques. The point clouds after: (a) uniform sampling (b) ISS (c) covariance keypoints (d) ROS. which are found using a kd-tree search. The angles are calculated using the following equations: θ i = (b) (d) φ = arccos(p i.q f ) (1) N j=1 φ 2 j N ; i = 1... n (2) where φ is the angle between two normal vectors p i (a query point) and q f (a neighboring point), n is the number of points in the point cloud and θ i is the average angle between normals (of the query points and all its neighbors). The average squared angle θ i which is calculated for every point in the point cloud, is considered as an error measure that specifies the similarity of a point s curvature to its neighbors. Therefore, for points associated with a θ i value close to zero, the orientations of normals of the point and its neighbors are similar (such as points lying on a flat plane) regardless of the normal vector s orientation at those points (i.e. regardless of which plane they lay on). On the other hand, the points associated with higher residual values differ from their neighborhood and carry more useful information for registration purposes. To further clarify this point, let us assume that there are two planes perpendicular to each other. Those points with the smallest residuals correspond to points lying on either one of the two planes. However, we are mainly looking for the points that lay close to the intersection between the two planes. In this scheme, those correspond to points with higher residual values. In the next step, we store all the θ i angles in the residual vector r θ and sort those in ascending order. The final step is to find the transition point that separates the two aforementioned groups. A ranked order statistics technique (MSSE) [6] is employed for this segmentation. The segmentation is conducted iteratively by first calculating the standard deviation of sorted residuals using the first k-th

5 Residuals Group 1 (Points with similar normal vector orientations to neighbors) Group 3 (Points with lowest resdiuals) T σ i Sorted resdiuals (r θ(i) ) Group 2 (points with significant local curvature) Transiction Point (k ) Sorted point index (i) Fig. 4: Classification of points based on the calculated residuals using the MSSE constraint. points (initial k-th order that corresponds to the assumed minimum percentage of points included in the first segment): σ 2 k = k i=1 r 2 θ i n p where r θi is ith residual in the sorted residual vector θ and p is the dimension of the model. The transition point k is found when the following condition is met: (3) r θk > T σ k (4) where T is a constant factor which is set to 2.5 and includes around 99% population of a normal distribution. We finally construct a third group, which are chosen to be the first 2% points of the sorted residuals (i.e. the points associated with the 2% smallest residuals in the first group). The points that are included in groups 2 and 3 are the extracted 3D feature points. We found that including points of group 3 improves the accuracy of the registration since adding a few of those points which share similar normal orientations and carry some useful information, helps constrain the transformation estimation. We chose the first 2% for a repeatability purpose, since those points with the lowest residuals, are more likely to be the same points in sequential overlapping frames and as such, be appropriately matched with their correspondences. Fig. 4 shows the classification of different points based on the calculated residuals using the MSSE constraint (4). To meet the stringent time constraint of the registration procedure, the selected keypoints are downsampled to around 350 keypoints. Our extensive experiments showed that having around 350 keypoints provides the right balance between registration accuracy and speed. Fig. 3 shows a comparison between our proposed method and other feature extraction techniques. IV. 3D REGISTRATION AND MAPPING In the following sections, we will discuss the steps of our proposed 3D registration and mapping method which Algorithm 1 Step-by-Step Algorithm of Proposed Feature Extraction Method 1: Input: Point cloud with Normals P n 2: Output: Sampled Point cloud P d 3: Initialize and clear vector α; 4: for i = 0 to length(p n ) do 5: Find N nearest neighbors to query point i; 6: for j = 1 to N do 7: Calculate angle between normals (φ j ) of query point and its neighbor j; 8: Add φ 2 j to the vector α; 9: end for 10: Calculate the average of the angles in vector α (θ); 11: Add θ to residuals vector (r θ ); 12: end for 13: Sort r θ in an ascending order; 14: Apply the MSSE constraint (4) to find transition point (k ) that separates the 2 groups; 15: Store indices associated with points that have sorted indices > k (group 2) in indices vector K Indices ; 16: Store indices of the points associated with the first 2% residuals (group 3) in K Indices ; 17: Down-sample the selected keypoint indices until number of extracted keypoints is less than : Store points from P n associated with K Indices into P d ; 19: return P d ; are outlined in Algorithm 2. A. Preprocessing steps To improve the efficiency and accuracy of the registration, we exclude points that are more than 5 meters away from the sensor (due to the decrease in depth precision the further the points are from the camera). We then uniformly sample the point cloud (containing around 300, 000 points) in a way that all points lying inside a predefined radius are reduced to a single point (their centroid). As a result, the minimum distances between points in the point cloud are constrained and the total number of points are reduced. The point cloud spatial resolution decreases the further the scene is from the RGB-D sensor. To insure that the sampled points are distributed fairly uniformly in depth, we assign a variable search radius to every frame based on the average distance between each point in the point cloud and its nearest neighbors. In our experiments, the above procedure reduces the number of points to around 3500 points. B. Normal vector estimation We estimate the normal to a surface at a given point by fitting a plane to the point and its neighbors in a search area that is calculated in a similar way to the method explained in section IV-A. The search area is intentionally large to include points from adjacent surfaces, therefore affecting the surface normals near edges and corners. By doing so, we differentiate between the angles of normals lying on either plane and those near corners and edges. This concept is illustrated in Fig. 5 in

Algorithm 2 Step-by-Step Algorithm of Proposed 3D Registration and Mapping Method 1: Input: Point cloud (previous frame) P t, point cloud (current frame) P s ) 2: Output: Transformation between

6 Algorithm 2 Step-by-Step Algorithm of Proposed 3D Registration and Mapping Method 1: Input: Point cloud (previous frame) P t, point cloud (current frame) P s ) 2: Output: Transformation between frames T, Global pose of camera G n, Map M 3: Filter point clouds; 4: Uniformly down-sample P t and P s to approximately 3, 500; 5: Calculate normal vectors at each point in P t and P s ; 6: Extract ROS features from P t and P s as described in Algorithm 1; 7: Assign SHOT descriptors to each feature point in P t and P s ; 8: Initially match the extracted features between P t and P s using their descriptors; 9: Obtain the inliers (good matches) using MSSE and estimate the 6DOF transformation T between the two consecutive frames; 10: Concatenate the estimated transformations to obtain a global pose of the camera G n. 11: Transform and map (M) the points with respect to a global reference frame; 12: return T, G n and M; Fig. 5: Example of estimated surface normals for a subset of points using a small search radius (left) large search radius (right) (Courtesy of and permission granted by Radu B. Rasu [22]). which the effect of using a large search area is demonstrated in the right image. C. 3D Feature Extraction and Matching The process of constructing a descriptor vector, even for around 3500 points, and matching those with their corresponding points from the previous frame is still computationally expensive. To improve the computation efficiency, we need to find a way to reduce the number of points without losing the information required for registration. To achieve this, we developed an informative sampling scheme based on using ranked order statistics (described in section III). After applying this method, the resulting point cloud would contain around 300 to 400 points. A feature descriptor is then computed for each extracted keypoint. We found that SHOT descriptors [21] provided the most accurate matches while it is only slightly more computationally expensive than other descriptors. Matching was performed, using a mutual consistency check, by finding the nearest neighbors in the descriptor vector space from the source point cloud (at time t) to the target point cloud (at time t 1) and vice versa. Only pairs of corresponding points that are mutually matched to each other are considered as the initial correspondences. D. Inlier Detection and Initial Transformation Estimation Matching 3D features between consecutive images using their descriptors usually results in many false matches (outliers). To remove the effect of these false matches, we refine the matches using a high breakdown point robust estimator (e.g. MSSE [6]). The MSSE is an extension of the robust least K-th order statistical estimator and can tolerate much higher ratio of outliers compared to RANSAC (we previously showed that MSSE outperforms RANSAC in terms of registration accuracy [7]). MSSE is comprised of two main steps, first the MSSE fits a given model to a set of data that contains outliers. In the second step, the scale of noise from this data is estimated. In contrast to RANSAC, MSSE is flexible and does not rely on the strong fixed error threshold assumption (which forms the bases of the RANSAC algorithm) [7]. Instead, MSSE assumes that the minimum number of inliers that are required to define a structure is known a priori. After setting the minimum number of inliers (ḱ), the implementation of MSSE is very straightforward and is as follows. Thousand sets of 3-tuples of corresponding initial matches are randomly chosen (although the minimum number of points required to constrain the transformation estimation is 2, we select 3 for obtaining a more accurate estimation [23]) and for each set, the transformation is estimated using the following equation: ( A d ) T = argmin w i T(p i s) p i t 2 T i=1 where T is the estimated transformation, p s and p t are the 3D coordinates of the matched feature points of the source and target frames respectively, A d is the number of correspondences (here A d = 3) and w i is a weighing factor that takes the distance between the point from the source point cloud to the camera into account (due to the decrease in depth precision and resolution with increasing the distance from the sensor). The estimated transformation is applied to all the features in the first frame and the distances between the transformed feature points and the corresponding points in the target frame are calculated (called residuals - r) and their squared values are sorted in ascending order. Having set the ḱ, then the transformation associated with the least ḱ-th order residual (from the 1000 hypotheses) is chosen as the best initial transformation. Using the residuals of that transformation, the inlier group members are chosen by applying (4) iteratively starting at i = ḱ and incrementing this value at each iteration until the condition is met. Finally, (5)

7 TABLE I: Comparison of the accuracy of 3D registration using different keypoint extraction methods. Translational errors are in meters, rotational errors are in degrees and time is in seconds. Method Trans. errorrot. errorextr. timereg. time# of points ROS SIFT3D ISS keypoints Covariance samp Uniform samp the transformation is re-calculated using all the obtained inliers. E. Global Pose Estimation and Mapping The transformation that was calculated in the previous section describes the motion between two RGB-D frames. In order to obtain a global pose of the camera with respect to a fixed reference frame (the initial frame), we concatenate all the transformations up to the current time using: G n = G n 1 T n,n 1 (6) where G n 1 is the previous global transformation with respect to the pose of the initial reference frame G 0 at n = 1 and T n,n 1 is the 6DOF transformation between the current frame and the previous frame. The map is then constructed by aligning the 3D points of the current frame with the previously constructed map (initially the map consists of the first frame) using the above global pose estimate. Fig. 1 shows a reconstructed 3D map of an office environment using the proposed method. V. EXPERIMENTAL RESULTS We evaluate the performance of our proposed 3D registration and mapping method by comparing it with the state of the art techniques. All methods were implemented using a Dell Latitude E6320, powered by an Intel i processor, 8 GB of RAM and running on Ubuntu The Robot Operating System (ROS Hydro) [24] and PCL 1.7 [16] were used for perception and 3D geometry processing. We used a Microsoft Kinect for capturing RGB-D data which operates at a frame-rate of 30 fps in VGA resolution mode. All of the experiments were conducted at typical university offices, which contain many texure-less objects (e.g. plain walls). In all of our experiments, the Microsoft Kinect was rotated counter clockwise until it was back to to its original position (360 ). During this process, the point clouds were acquired, registered, mapped and a global pose of the camera was estimated online. We performed 6 trails for each method and averaged the calculated results and errors. We evaluated the accuracy of all methods by calculating the average translational error (in the x,y and z directions) in meters, average rotational error (roll, pitch and yaw angles) in degrees and the average time of the registration process in seconds. A. Comparison between keypoint extraction methods In this experiment, we compared the accuracy of registration using our proposed keypoint extraction and some of the best available methods. The compared methods are: SIFT3D [16], ISS keypoints [20], geometrically stable points [19] (covariance sampling) and uniform sampling (we set a large search radius to obtain around 350 points). For this evaluation, we used the registration procedure explained in section IV. In addition to the previously mentioned measures, we also compared the average keypoint extraction time (in seconds) and average number of extracted keypoints. The results of these experiments are outlined in table I. The results show that our proposed method significantly outperforms the other methods in terms of both the translational and rotational accuracy. In terms of computational efficiency, our method was slightly slower than ISS keypoints, covariance sampling and uniform sampling. However, this does not significantly affect the registration efficiency, as they all required similar times for registration (covariance sampling was slightly slower as it requires a larger number of constraining feature points). This is due to the fact that the keypoint extraction method accounts for only around 10% of the total processing time. B. Sparse keypoints vs a denser point cloud In this experiment, we demonstrate the effectiveness of our informative sampling keypoint extraction method by comparing its registration accuracy and computational efficiency with another method that uses a much denser point cloud for registration. The point cloud was processed as it was described in section IV-A. Table III shows the results of this experiment. The results show that our method was able to achieve a very similar accuracy, despite using only 10% of the processed point cloud points and 0.11% of the original point cloud. The computational time for registration was also significantly improved (by a factor of 12). C. Comparison between registration methods In the final experiment, we evaluate the performance of our registration method by comparing it to the state of the art registration techniques. In the first of those methods, SIFT3D keypoints were extracted and matched using SHOT descriptors. Then SAC-IA [25] (a sampling consensus method that is similar to RANSAC but instead of selecting random samples between the source and target clouds, samples are selected based on points with the most similar feature histograms) was applied for rejecting false matches and initial transformation estimation, followed by ICP for refining the initial transformation. For the second method, the same procedure is repeated with replacing the ICP with a 3D normal distribution transform (NDT) [26] refinement step. The last method uses the above keypoint extraction and matching, followed by a pre-rejection RANSAC algorithm [27]. This method adds an additional verification step to the standard RANSAC algorithm which eliminates some false matches by analyzing their geometry. Table II shows the result of this comparison.

8 TABLE II: Comparison of the accuracy of different 3D registration methods. Method Trans. error(m)rot. error( )Extr. time(s) Proposed registration SIFT3D+SAC-IA+ICP SIFT3D+SAC-IA+NDT SIFT3D+pre-rejective RANSAC Method TABLE III: Sparse vs dense. Trans. error(m)rot. error( )Reg. time(s)# of points Proposed registration Uniform sampling Our proposed method outperforms the other methods both in terms of accuracy and computational efficiency. VI. CONCLUSIONS In this paper, we presented a novel real-time 3D registration and mapping system that uses only the depth information provided by RGB-D sensors, without relying on any visual information. As such, our method is well suited to perform localization and mapping in environments with insufficient visual texture information, as long as there is sufficient constraining geometric information in the scene. Since registration using the dense point cloud is a very computationally expensive operation, we propose a novel sampling scheme that informatively selects the points carrying the most useful information. We showed that our proposed registration method outperforms other well known registration methods in terms of accuracy and computational efficiency. Having said that, depth only registration methods struggle when registering scenes with very little structure. As such, we plan on aiding the depth information with visual information when available. Another aspect that our method struggled with was the registration in the presence of multiple motions (such as people walking in the scene). Since we use a high-breakdown estimator in our method, we think it should be possible to segment different motions explicitly and register those separately. This is part of our future work. ACKNOWLEDGMENT The first author would like to acknowledge the financial support provided by the Australian Postgraduate Award (APA) scholarship. REFERENCES [1] H. Durrant-Whyte, D. Rye, and E. Nebot, Localization of autonomous guided vehicles, ROBOTICS RESEARCH-INTERNATIONAL SYMPOSIUM-, vol. 7, pp , [2] S. Thrun, Robotic mapping: A survey, Exploring artificial intelligence in the new millennium, pp. 1 35, [3] R. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon, Kinectfusion: Real-time dense surface mapping and tracking, in Mixed and Augmented Reality (ISMAR), th IEEE International Symposium on. IEEE, 2011, pp [4] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, Rgb-d mapping: Using kinect-style depth cameras for dense 3d modeling of indoor environments, The International Journal of Robotics Research, vol. 31, no. 5, pp , [5] F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Burgard, An evaluation of the rgb-d slam system, in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp [6] A. Bab-Hadiashar and D. Suter, Robust segmentation of visual data using ranked unbiased scale estimate, Robotica, vol. 17, no. 6, pp , [7] K. Yousif, A. Bab-Hadiashar, and R. Hoseinnezhad, 3d registration in dark environments using rgb-d cameras, in Digital Image Computing: Techniques and Applications (DICTA), 2013 International Conference on, 2013, pp [8] E. Rosten and T. Drummond, Machine learning for high-speed corner detection, Computer Vision ECCV 2006, pp , [9] A. Segal, D. Haehnel, and S. Thrun, Generalized-icp. in Robotics: Science and Systems, vol. 2, 2009, p. 4. [10] H. Bay, T. Tuytelaars, and L. Van Gool, Surf: Speeded up robust features, Computer Vision ECCV 2006, pp , [11] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, Orb: an efficient alternative to sift or surf, in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp [12] H. Du, P. Henry, X. Ren, M. Cheng, D. Goldman, S. Seitz, and D. Fox, Interactive 3d modeling of indoor environments with a consumer depth camera, in Proceedings of the 13th international conference on Ubiquitous computing. ACM, 2011, pp [13] T. Whelan, M. Kaess, M. Fallon, H. Johannsson, J. Leonard, and J. McDonald, Kintinuous: Spatially extended kinectfusion, [14] G. Hu, S. Huang, L. Zhao, A. Alempijevic, and G. Dissanayake, A robust rgb-d slam algorithm, in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp [15] C. Harris and M. Stephens, A combined corner and edge detector, in Alvey vision conference, vol. 15. Manchester, UK, 1988, p. 50. [16] R. B. Rusu and S. Cousins, 3d is here: Point cloud library (pcl), in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp [17] S. Filipe and L. A. Alexandre, A comparative evaluation of 3d keypoint detectors. [18] D. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision, vol. 60, no. 2, pp , [19] N. Gelfand, L. Ikemoto, S. Rusinkiewicz, and M. Levoy, Geometrically stable sampling for the icp algorithm, in 3-D Digital Imaging and Modeling, DIM Proceedings. Fourth International Conference on. IEEE, 2003, pp [20] Y. Zhong, Intrinsic shape signatures: A shape descriptor for 3d object recognition, in Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on. IEEE, 2009, pp [21] F. Tombari, S. Salti, and L. Di Stefano, Unique signatures of histograms for local surface description, in Computer Vision ECCV Springer, 2010, pp [22] R. B. Rusu, Semantic 3d object maps for everyday manipulation in human living environments, KI-Künstliche Intelligenz, vol. 24, no. 4, pp , [23] F. Fraundorfer and D. Scaramuzza, Visual odometry: Part ii: Matching, robustness, optimization, and applications, Robotics & Automation Magazine, IEEE, vol. 19, no. 2, pp , [24] M. Quigley, K. Conley, B. P. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, Ros: an open-source robot operating system, in ICRA Workshop on Open Source Software, [25] R. B. Rusu, N. Blodow, and M. Beetz, Fast point feature histograms (fpfh) for 3d registration, in Robotics and Automation, ICRA 09. IEEE International Conference on. IEEE, 2009, pp [26] M. Magnusson, A. Lilienthal, and T. Duckett, Scan registration for autonomous mining vehicles using 3d-ndt, Journal of Field Robotics, vol. 24, no. 10, pp , [27] A. G. Buch, D. Kraft, J.-K. Kämäräinen, H. G. Petersen, and N. Krüger, Pose estimation using local structure-specific shape and appearance context, ICRA (accepted, 2013), 2013.

A Real-Time RGB-D Registration and Mapping Approach by Heuristically Switching Between Photometric And Geometric Information

A Real-Time RGB-D Registration and Mapping Approach by Heuristically Switching Between Photometric And Geometric Information The 17th International Conference on Information Fusion (Fusion 2014) Khalid