Kinect 3D Reconstruction

Size: px

Start display at page:

Download "Kinect 3D Reconstruction"

Estella Marsh
5 years ago
Views:

1 Kinect 3D Reconstruction Mahsa Mohammadkhani Computing Science University of Alberta Edmonton, Canada Abstract This report presents a Kinect-based framework that can reconstruct a scene and object in both real-time and offline processes according to the preference of a user. The user can capture images of the scene/object from different view-points to reconstruct it or he can capture the whole view-points like a video sequence and give it as an input to the framework in order to get the 3D model of the scene/object. By using a user-friendly and simple interface, a user can observe the scene/object RGB color and depth. Then he can start creating the 3D model of the desired scene/object. The resulting reconstruction is produced in a form of textured point cloud. Keywords:3D Reconstruction, Kinect, Scene Reconstruction, Object Reconstruction. I. INTRODUCTION Kinect [1, 2, 3] is a depth camera which has become an important 3D sensor. This devise has low cost, reliability and speed in measuring the depth has made it to one of the primary 3D measuring device for 3D scene reconstruction[4, 5, 6], and object recognition [7]. In computer vision, there are several well-studied methods for multi-view matching and reconstruction with the aim of reconstruction complicated real indoor and outdoor scenes and objects with accurate representation. However, the problem is still an open research problem. There are some methods which using a single image reconstruction commonly use cues such as shading, silhouette shapes, texture, and vanishing points. But, these schemes put some constraints on the features of the reconstructed objects and they restrict the allowable 3D modeling. Moreover, there are various approaches in computer vision and graphic which they uses active sensors [8], passive RGB cameras [9, 10], online images [11, 12], or image matching [13]. Also, these is a research about augmented reality and robotics community on Simultaneous Localization and Mapping (SLAM) which create a map from surroundings while the user or robot whirling around. In addition, some of the existing reconstruction systems support off-line tracking [13] and they try to detect images features and match them in a best way in order to find the corresponding features using some feature matching algorithms. In [5, 6], the new GPU-based implementation is introduced by Microsoft research group in They used Kinect to capture an indoor scene environment and their application is able to reconstruct it using the depth value gained by Kinect. Also, there is another software with the name of Reconstructme [14] which can reconstruct a 3D real model by moving Kinect around the desired object. The result of this application can be seen on Fig. 1. However, the available method are using GPU implementation which need a very powerful GPU graphic card. So, not all the computer can execute and test these programs. Also, they represent the model in mesh or surfel formats with no texture on it. Given this broad topic, I decided to implement a framework that can reconstruct a scene or object by using Kinect in a way that the program can run with all type of desktop computers. Also, the model can be textured because Kinect can capture both depth and RGB color of an image. In order to decrease the amount of processing of the reconstruction processes, I add a feature to the framework that one can choose some appropriate view-point from a scene and take a picture from them and then giving

them as an input to the framework to start reconstructing the scene or object. Figure 1. Reconstructme result. [15] This model is my 3D reconstruction. II.

2 them as an input to the framework to start reconstructing the scene or object. Figure 1. Reconstructme result. [15] This model is my 3D reconstruction. II. RELATED WORK Multi-view matching and reconstruction is a key element for acquisition of objects and scene models from several images or video sequences. This process is called image-based modeling. The multi-view matching algorithm can be categorized into four different groups depends on the way that they represent the final model. The first one is voxel-based approaches that many methods used it such as [15], [16] and [17]. The second group is using deformable polygonal meshes which need a good initialization such as visual hull model [18] for optimization process (This brings a limitation for these types of algorithms.) There are some approaches which are based on multiple depth maps such as [19], and the last category is those types of techniques which are based on patches such as [20] and [21]. These kinds of methods use small patches like surfels to demonstrate scene surfaces and they show the final resulting reconstructed model or scene in point clouds which need some postprocessing to convert the result to the mesh format. As most of the efforts for 3d reconstruction are based on images or video sequences, by emerging the new technology of the Kinect camera, which is a sensor-based devise, some of researchers decided to use the new depth data that Kinect can provide us. Kinect consists of one RGB camera and one infrared sensor which can produce the real depth information about the scene around. However, the depth is really noisy and it needs some filtering to provide us better information. There are some methods which try to use the depth information from the Kinect and make a 3D reconstruction of an object. One of the software which is doing the 3d reconstruction of the scene is KinectFusion [5, 6]. This software can capture a scene by moving around the Kinect. First, it applies some filter to the depth data which is producing in real time with the Kinect. Then, it finds vertex and normal for the surface of 3D model. The method uses iterative closest point (ICP) [22] algorithm in order to find the corresponding points between each frame from the incoming video sequence. Then it uses a ray casting process to get the new vertices and normal from the scene and provide new information for updating the scene. So, this algorithm is updated the result in each iteration to add more information to it. The implementation of this algorithm is based on GPU implementation which needs a very powerful CUDA GPU processors (such as an NVIDIA GeForce GTX 560 or higher) to work well. So, this limitation make this software limited to work on powerful systems which is not available everywhere. Although KinectFusion is real time application, its method is very expensive and complex. The methods which are refer to the multi view stereo (MVS) reconstruction uses multiple images, in order to reconstruct the 3D models. These methods use different techniques such as using the visual hull of the object, reproducing the photo-hull of the object and trying to find the optimized cost function over the surface shape [23]. Another reconstruction method is Visual Hull which is using the silhouette information of an object and produce intersected visual cones in order to apply that information for generating the 3D representation of the object. The method which is introduced in [24], is using the visual hull and multicamera environments and it is popular for its simplicity and efficiency. There are several methods which are based on this method, such as [25] which has parallel pipeline processing and uses fusion of multi-view silhouette cues, a space occupancy grid and recovers shape by means of rotating slicing planes.

Furthermore, space carving reconstruction [26] uses repeated sweeping plane in a scene volume. The algorithm starts testing the photo consistency of each voxel on the current plane.

3 Furthermore, space carving reconstruction [26] uses repeated sweeping plane in a scene volume. The algorithm starts testing the photo consistency of each voxel on the current plane. This method is dependent on a voxel, so if a voxel is removed by error, then the other voxels will be removed in cascading way. The resulting 3D reconstruction of this method has some holes on it which make its result not very acceptable. Using surface integral minimization, one can optimize the surface integral of a consistency function of a surface shape. Some level-set methods like [27] provide a way to minimize the cost function. By using both 2D images and 3D data and combining them together one can reconstruct a surface like the method in [28]. This approach is more robust that those methods which are using 2D information or 3D stereo information only. Another method for reconstruction is Marching Cubes Algorithms which creates a polygonal representation of surfaces from 3D data information [29]. This algorithm uses the conventional graphics-rendering algorithms in order to show its result which are implemented with software or hardware. Some reconstruction methods are feature based [30] and they use edge information. Those approaches which are based on constructing a mesh, in each iteration, they gradually refine the surface of the mesh until it represents the surface of the objects very well. One of the advantages of feature based surface reconstruction schemes is that they can represent the scene model in a more efficient way with compare to stereo matching algorithms which using images or quasi-dense reconstruction. III. METHOD In this section, I will describe how I implemented the framework for the project. The libraries that I have used in this framework are Point Cloud Library (PCL) [31], Nestk [32], and OpenCV. Nestk is a library which was mainly created for the RGBDemo [33] application which I started my framework based on this library. I changed the library code according to my need in order to achieve the final goal for this project. One of the main challenges in this project was installing these libraries on my laptop. First I tried to install them in Windows, however, I encountered with an error that I couldn t solve it. So, I tried to install them in Linux, Ubuntu 10.04, and finally I was successful in installing them. Unfortunately, some of these libraries don t have proper documentation, so I started to understand how to use them by looking through the code. A. Theory Description of the Method 1) Input: The input for this framework is prepare by using Kinect. The framework is designed in a way that the Kinect devise should be faced down to capture a scene or object from above. I made a ramp for this purpose Fig. 2. If one wants to cature a whole scene or object aspects, Kinect should face down. The program will produce acceptable results if the amount of shake in capturing the scene is not a lot. So, the user should move Kinect slowly with less vibration. Figure 2. The position of Kinect while we want to capture images or video. 2) Camera Calibration: As long as the Kinect camera and Kinect depth sensor are not in a same position (Fig. 3) [34], so we need to map and calibrate the depth data with rgb frame. For knowing the translation martix and rotation matrix, we can use the available information in OpenNI [35] for calibration. However, if one wants to do more pricise work such as robatic vision, they can calibrate their own Kinect. For this project, I used the calibration information in OpenNI. Figure 3. Kinect Elements

3) Extracting RGB Features: In this step the program will extract the features from the series of an images which is catured during the time and each time a new image receive, the program will do the

4 3) Extracting RGB Features: In this step the program will extract the features from the series of an images which is catured during the time and each time a new image receive, the program will do the same senario for the new image. For extracting features, first it uses a dector to detect the features within images, then it will use an algorithm as an describtor to find the best matches between two images. 4) Finding Closest View and Best Matches: By means of feature matching the program will find the closest view from all the collected images data up to now. 5) Optimization: After we find the closest view, we need to find the transformation matrix between the closest view and current view. So, we need to use the best optimizer in order to minimize the error between the two images. For this purpose, I compared two optimization algortithm below. The result of the optimization will be accepted if the error is less than threshold. Otherwise, the current image will be discarded. a) RANSAC (RANdom SAmple Consensus): This algorithm is useful for 2D RGB images and will iteratively find the corresponding points (feature in this project) and minimize the error between matching points. (Fig. 4) b) ICP(Iterative Closest Point): This algorithm is can be used for point clouds and it will find the corresponding points iteratively 6) Rendering 3D Points: By finding the transformation matrix between current view and closest view, the program will compute the normals for each point and some filtering processing will be executed in order to render the applicable points in the 3D viewer. (Fig. 5) B. Comparision In order to extract features, there are many algorithms that are introduced for detecting features. Algorithms that I have used to check whether they are useful for 3D reconstruction of a scene or object using Kinect or not. As long as the resolution of the Kinect camera is , the quality of images are not very high. Therefore, we should compare the available algorithm whether they are useful for extracting features from images. I used FAST (Features from Accelerated Segment Test)[36], SIFT (Scale-invariant feature transform) [37] and SURF (Speeded Up Robust Features) [38]. FAST is useful for corner detection. SIFT is useful for local features and SURF which is similar to SIFT algorithm but it is more robust to the different image transformation. After we detected our features from images, we need to find best matches between two images. So, as a descriptor I compared SIFT, SURF and BRIEF (Binary Robust Independent Elementary Features) [39]. In the Table 1, the results of this comparisons can be found. (Fig. 5) Table 1. Comparision of detector and descriptor for feature matching. Detector Descriptor Result FAST/SURF BRIEF Slow / Not complete reconstruction / Wrong matching SURF SURF Fast/ Good result SIFT/FAST SIFT/SURF Slow / Not complete reconstruction As you can see in the table 1, the SURF and SURF is the best option for this project. So, I used this combination for feature matching in scene and object reconstruction. Figure 4. RANSAC 3D representation C. Final Reconstrucion The program is able to reconstruct the 3D model of a scene or object by the preference of the user. 1) Scene Reconstruction: The program is able to reconstruct a point cloud representation of the scene using the method described earlier.

IV. EXPERIMENTAL RESULTS A. Final Results for Scene Reconstruction: The result for scene reconstruction can be seen in Fig. 6. B.

The second image refer to the wrong matching caused by BRIEF.

The program will detect the object in 2D image and then I will detect the object in 3D space. So, the object will not reconstruct the points which is not related to the object. V.

5 IV. EXPERIMENTAL RESULTS A. Final Results for Scene Reconstruction: The result for scene reconstruction can be seen in Fig. 6. B. Final Results for Object Reconstruction: The result for object reconstruction in Fig. 7. Figure 7. Object recontruction. Figure 5. The top image refer to the result of SIFT-SIFT. The second image refer to the wrong matching caused by BRIEF. 2) Object Reconstruction: The program will reconstruct an object by capturing the images from the object from different view-points and it will gradually reconstruct an object. The program will detect the object in 2D image and then I will detect the object in 3D space. So, the object will not reconstruct the points which is not related to the object. V. CONCLUSION AND FUTURE WORK In this project, we represented the method that uses Kinect and its depth data in order to reconstruct 3D model of a scene or object. The program will simply capture some images or get the whole view-points of the scene/object. As a future work, this framework can be extended in a way that the algorithm become more robust to the movement of the Kinect while obtaining the data from a scene or object. ACKNOWLEDGMENT I thank Dr. Martin Jagersand for his helpful advises for defining the correct course project. Figure 6. Scene Recustroction using ICP. REFERENCES [1] Microsoft: Kinect for X-BOX (2010) [2] Wikipedia: Kinect. [3] Freedman, B., Shpunt, A., Machline, M., Arieli, Y.: Depth mapping using projected patterns.us Patent (2010) [4] Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: RGB-D mapping: using Kinect-style depth cameras for dense 3d modeling of indoor environments. Int. J. Robot. Res. (2012). doi: / [5] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie hotton, Steve Hodges, Dustin Freeman, Andrew Davison, and Andrew Fitzgibbon KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software

6 and technology (UIST '11). ACM, New York, NY, USA, [6] Newcombe, Richard A.; Davison, Andrew J.; Izadi, Shahram; Kohli, Pushmeet; Hilliges, Otmar; Shotton, Jamie; Molyneaux, David; Hodges, Steve; Kim, David; Fitzgibbon, Andrew; "KinectFusion: Real-time dense surface mapping and tracking," Mixed and Augmented Reality (ISMAR), th IEEE International Symposium on, vol., no., pp , Oct [7] Lai, K., Bo, L., Ren, X., Fox, D.: Sparse distance learning for object recognition combining RGB and depth information. In: IEEE International Conference on Robotics and Automation (2011) [8] M. Levoy et al. The digital Michelangelo Project: 3D scanning of large statues. ACM Trans. Graph., [9] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, [10] P. Merrell et al. Real-time visibility-based fusion of depth maps. In Proc. of the Int. Conf. on Computer Vision (ICCV), [11] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. In Proc. of the Eurographics Symposium on Geometry Processing, [12] K. Zhou, M. Gong, X. Huang, and B. Guo. Dataparallel octrees for surface reconstruction. IEEE Trans. on Visualization and Computer Graphics, 17, [13] J. Frahm et al. Building Rome on a cloudless day. In Proc. Europ. Conf. on Computer Vision (ECCV), [14] [15] O. Faugeras and R. Keriven, Variational principles, surface evolution, PDE s, level set methods and the stereo problem, IEEE Trans. Im. Proc., vol. 7, no. 3, pp , [16] S. Paris, F. Sillion, and L. Quan, A surface reconstruction method using global graph cut optimization, in ACCV, January [Online]. Available: [17] J.-P. Pons, R. Keriven, and O. D. Faugeras, Multiview stereo reconstruction and scene flow estimation with a global image-based matching score, IJCV, vol. 72, no. 2, pp , [18] B. Baumgart, Geometric modeling for computer vision, Ph.D. dissertation, Stanford University, [19] D. Bradley, T. Boubekeur, and W. Heidrich, Accurate multi-view reconstruction using robust binocular stereo and surface meshing, in CVPR, [20] Y. Furukawa, and J. Ponce, Accurate, Dense, and Robust Multi-View Stereopsis, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, Issue 8, Pages , August [21] L. Kobbelt and M. Botsch, A survey of point-based techniques in computer graphics. Computers & Graphics, vol. 28, no. 6, pp , [22] BESL, P., AND MCKAY, N A method for registration of 3D shapes [23] Herrera C., Daniel; Kannala, Juho; Heikkilä, Janne, "Joint Depth and Color Camera Calibration with Distortion Correction," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.34, no.10, pp.2058,2064, Oct [24] Pons, J., Keriven, R., Faugeras, O. Multi-view stereo reconstruction and scene flow estimation with a global image-based matching score, IJCV, Springer, 72(2), (2007). [25] Grauman, K., Shakhnarovich, G., Darrell, T.: A Bayesian Approach to Image-Based Visual Hull Reconstruction. In CVPR, 187{194, (2003). [26] Franco, J., Boyer, E.: Fusion of multi-view silhouette cues using a space occupancy grid. In ICCV, 2, , (2005). [27] Kutulakos, K., Seitz, S.: A Theory of Shape by Space Carving. In International Journal of Computer Vision, 38(3), 199{218, (2000). [28] Osher, S., Sethian, J.: Fronts propagating with curvature-dependent speed: algorithms based on hamilton-jacobi equations. J. of Comp. Physics, 79, 12{49,(1988). [29] Lhuillier, M., Quan, L.: A Quasi-Dense Approach to Surface Reconstruction from Uncalibrated Images, TPAMI, 27(3), 418{433, (2005). [30] Lorensen, W.E,Cline, H.E.Marching cubes: A high resolution 3D surface construction algorithm. ACM [31] [32] senestk [33] [34] Herrera C., Daniel; Kannala, Juho; Heikkilä, Janne, "Joint Depth and Color Camera Calibration with Distortion Correction," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.34, no.10, pp.2058,2064, Oct [35] [36] E. Rosten and T. Drummond (May 2006). "Machine learning for high-speed corner detection,". European Conference on Computer Vision. [37] Lowe, David G. (1999). "Object recognition from local scale-invariant features". Proceedings of the International Conference on Computer Vision. 2. pp [38] Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool "SURF: Speeded Up Robust Features", Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp , [39] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua BRIEF: binary robust independent elementary features. In Proceedings of the 11th European conference on Computer vision: Part IV (ECCV'10), Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.). Springer-Verlag, Berlin, Heidelberg,

Multi-view stereo. Many slides adapted from S. Seitz

Multi-view stereo. Many slides adapted from S. Seitz Multi-view stereo Many slides adapted from S. Seitz Beyond two-view stereo The third eye can be used for verification Multiple-baseline stereo Pick a reference image, and slide the corresponding window