A COMPREHENSIVE TOOL FOR RECOVERING 3D MODELS FROM 2D PHOTOS WITH WIDE BASELINES Yuzhu Lu Shana Smith Virtual Reality Applications Center, Human Computer Interaction Program, Iowa State University, Ames, IA, USA. yuzhu@iatate.edu, sssmith@iastate.ed ABSTRACT Recovering 3D objects from 2D photos is an important application in the areas of computer vision, computer intelligence, feature recognition, and virtual reality. This paper describes an innovative and systematic method to integrate automatic feature extraction, automatic feature matching, manual revision, feature recovery, and model reconstruction as an effective recovery tool. This method has been proven to be a convenient and inexpensive way to recover 3D scenes and models directly from 2D photos. We have developed a new automatic key point selection and hierarchical matching algorithm for matching 2D photos, which have less similarity. Our method uses a universal camera intrinsic matrix estimation method to omit camera calibration experiment. We have also developed a new automatic texture-mapping algorithm to find the best textures from 2D photos. In this paper, we include some examples and results to show the capability of the developed tool. KEY WORDS: 3D recovery, Stereo match, Computer modeling, Wide baseline. Introduction With the rapid and wide application of virtual models in many areas, creating 3D models from real scenes is greatly needed. Traditional manual model building is labor-intensive and expensive. Thus, automatically constructing 3D computer models has recently received much attention [][2]. Much work has been conducted concerning recovering existing 3D environments. These recovery efforts can be classified into two categories: using scanning devices [3][4], and using cameras [5][6][7][8][9]. Scanning devices can automatically reconstruct objects precisely, but they are very expensive and inconvenient to carry, especially in an outdoor environment. Here, we chose to focus on 3D model recovery from 2D images by cameras. This technology uses two or more photos of the same objects to recover the 3D information of the overlapped areas and, subsequently, reconstruct the model. The process includes four steps: key feature selection, feature matching, recovery computation, and model reconstruction. The features of an image are often expressed as discontinuities in image signals. In the prior research, these discontinuities were extracted as corner points [][8][9][0][][2], edges [3][4] or regions [7][5][6] by using the first or second derivative information of the image signals. Feature matching is both the focus and the bottleneck of recent research in the area of recovering 3D information from 2D images. These matching processes are applied according to the different attributes of detected features: corner points [][8][9][][2][7], line edges [0][8], curved edges [9], and regions [5][7][20]. The point matching methods have been most widely used in stereovision research because corners are easy to detect 476-07 08
and they are more stable and robust when perspective is changed. Almost all of these point-matching algorithms are designed according to the images similarity, uniqueness, continuity, and Epipolar information [][2][5][9][][7][2]. Recovery computation (also called stereo triangulation) is relatively stable and sophisticated when a user knows the camera s parameters. However, if these parameters are not given, it is necessary to calibrate the camera [][2][2], an operation which is inconvenient for many common users. Thus, camera calibration and selfcalibration research has become another major research focus. Even though there have been many studies in this area, problems still arise because of the lack of sophistication with current methods. These problems include difficulties in completing 3D recovery automatically, and difficulties in dealing with images having wide baselines. In response to these problems, we have developed a systematic semi-automated method to recover 3D models directly from 2D photos with less similarity. We have also developed an automatic feature information extraction method and a hierarchical matching algorithm for images with less similarity, as well as a tool for users to edit key points, revise possible mismatches, and select triangles to reconstruct a model with surfaces. A universal camera intrinsic matrix estimation method from statistical analysis is used to recover 3D information without camera calibration, and ultimately, a new texture-mapping algorithm is developed to automatically select the better textures from different photos. 2. Methodology 2.. Key Points Extraction Feature points are those holding the main characteristics of a 2D image. In our application, geometric information is the main character(s) to be recovered. Again, we return to the Harris corner detection method [][9][0] and the Canny edge detection method [8][3]. We have chosen to utilize the Canny method to extract segment information for two reasons: first, because edge detection will hold more complete geometric information; and second, because edge segments can be displayed using only two end points, which can be easily edited and revised manually. 2.2. Hierarchical Matching Algorithm The epipolar constraints have a great contribution in stereo matching [][2][9][0][]. Unlike other matching methods, epipolar constraints are more robust. Only eight well-matched points information are necessary to compute the epipolar geometry []. However, getting these eight well-matched points is a challenging problem. It is almost impossible to check all the possible combinations of extracted feature points. Therefore, an initial seed matching is necessary to provide candidate matching for epipolar geometry computation. The most widely used method to obtain an initial matching set is the classical cross correlation method []. Although there are other methods which could be used to obtain the initial seed matching, these algorithms usually do not work very well when they are applied to images with less similarity, because high similarity is a fundamental requirement for those matching methods. Our method presents a new hierarchical method to obtain the initial correspondence set, as shown in figure. First, segments will be matched. Because segments have more attributes than points (such as length, position, direction and background color information), matching accuracy could be increased, particularly when using large baseline images with less similarity. Second, two end points of the segments will be matched based on the segments matched from the first step. If these segments from the first step are well matched, the accuracy of the second step will be very high. Our approach uses four indexes for segment matching. The first index is the relative position of the center points of the extracted segments, and is represented 09
CP=a*(I x - α* I x )+b*(i 2x - α* I 2x ) +c*(i 3x - I 3x )+d*(i 4x - I 4x ) () Figure. Hierarchical matching algorithms by the summation of the vectors from the segment centers to all other segment centers. For example, the index vector for p is found by adding up all the vectors from p to other points as shown in figure 2 (a), and the index vector for p2 is found by adding up all the vectors from p2 to other points as shown in figure 2 (b). The second index is the length of a segment, and it is represented by the distance between two end points. The third index is the background information of each segment, which is represented by the mean color value of the neighborhood of the center point of the segment. The fourth index is the direction of a segment, which is represented by the angle of the segment vector. (a) Relative position of p (b) Relative position of p2 Figure 2. Index one for p and p2 - Relative position The four indexes for each segment in two images are compared, and the difference between each index value for each pair of segments is computed and added together, as shown in equation. The potential matched segments will be the ones having the least differences between index values. In the above equation, I x, I 2x, I 3x, I 4x, I x, I 2x, I 3x, and I 4x are the four index values for a pair of segments in two images. Similarly, a, b, c, and d are the weights of the four indexes. These weights are determined by the relative importance of the four indexes. Through our study, we have found that the indexes of relative position and segment length are more crucial when matching images with wide baselines. Therefore, weights of these two indexes are larger than others. α is the estimated scale parameter of two photos used, which is estimated by the ratio of the bounding box size of the object in the two images. This factor could help to solve the scaling problem in the photos used. In the first level of the matching process, we can find potential matched segments, but we cannot determine their matching directions. Consequently, we use the classical cross correlation method in level two to match the end points of the matched segments. If segments are corresponded correctly in level one, level two matching will be easier because there will be few candidate points. Finally, the correspondences for the initial set of key feature points are obtained from the proposed hierarchical matching algorithm. A least squares method was used to find eight points which are best matched. The eight best-matched points are then used to calculate the fundamental matrix, which is useful in finding the most reasonable correspondences. The fundamental matrix is the algebraic representation of the epipolar geometry. After the fundamental matrix is calculated, it is used to find inliers of the seed matches found in the level two matching and reject the outliers. Figure 3 shows the comparison of the final matching result of the proposed hierarchical matching method and the classical cross correlation method [][8][9][][2][7]. This result shows that our algorithm could give more correct correspondent key points (6 matches) than the classical cross correlation algorithm (9 matches). In figure 3(b), there is an obvious mismatch as the result of using the classical cross correlation algorithm. 0
(a) Our matching method (b) Classical cross correlation method Figure 3. Matching result comparison To improve the results and accuracy, various strategies can be implemented, such as the relaxation process and searching the points correspondence a second time with the constraint of epipolar geometry, as suggested by Zhang []. However, this will again cost more time and resources. 2.3. 3D Recovery The relationship of a 3D point coordinate to its image plane coordinate through a camera is shown in equations 2 and 3. S is a scaling factor, while (x,y,z) and (u,v) are correspondent 3D point coordinates and their camera image coordinates. P is a three by four perspective projection matrix, which can be decomposed as camera intrinsic matrix A and extrinsic matrix with rotation and translation information (R, T), as shown in equation 3. x u y S v = P z (2) P = ART [ ] (3) Thus, the relationship of a 3D point and its two image coordinates on two image planes through two cameras can be expressed through equation 4, where A and A 2 are the two cameras intrinsic matrices. Here, we set the first camera as the original, and the second camera as the transformed one. These two equations work as the principle of stereo triangulation to recover 3D coordinates when all parameters are known. After the correspondent key points are matched and triangular surfaces are constructed, the triangulation x u y s v = A [ I0] z x u 2 y s 2 v 2 = A 2 [ RT] z (4) method will be carried out to recover the 3D information [7][2]. Prior research has shown that camera calibration (or self-calibration) should be carried out to obtain the camera s intrinsic parameter matrix [][2][6]. The intrinsic matrix A (shown in equation 5) and the fundamental matrix F we obtain in the matching process can then be used to calculate the rotation and translation parameters between the two cameras, with which we could calculate the 3D information of the object. However, camera calibration experimentation is very inconvenient and time consuming, and is sometimes even impossible when, for example, we attempt to recover a historical scene from old photographs. A= fku fku cotθ u 0 0 fkv /sinθ v 0 0 0 (5) There are six intrinsic parameters: focal length f of the camera, aspect ratios k u and k v, angle θ between the retinal axes, and coordinates of the principal camera points u 0 and v 0. [][2][2]. Xu, Terai, and Shum [2] suggest that if high precision is not required, we can assume that the two cameras are π /2 in between (θ = π /2), that the aspect ratio is (k u = k v = ), and that the principal point is at the image s center. Thus, the only unknown parameter left is the focal length f. The intrinsic matrix can then be rewritten as equation 6: f 0 pixel x / 2 A= 0 f pixel / (6) y 2 0 0
After surveying and analyzing focal lengths used from different research areas and obtained 2 samples of focal lengths, we have found that most of focal lengths (80%) vary in a narrow range [700-300]. Thus, it is possible to estimate a camera s intrinsic matrix if we recover 3D information from photos taken by a normal camera. Our 3D recovery tool allows the user to adjust the focal length to find the best results. Figure 4 shows an example of recovering a 3D scene using a focal length of 000. Figure 5. Reconstruction with texture mapping Figure 4. Results by using f = 000 2.4. Reconstruction by texture mapping After we obtain all the 3D information of the feature points, a new model can be constructed based on the connectivity information given by the triangles created above. Since the reconstructed solid model does not contain surface texture, texture mapping is necessary to make the model more real. However, since we will have at least two photos, and since each photo will be taken from a different angle (and will therefore show certain details in a slightly different way), the question arises: Which photo to use? Normally, a better surface texture might be created from the camera, which captures a larger area of that surface of the object and contains clearer details. Using this principle, we have designed an algorithm to compare every correspondent triangle texture and select the texture with larger area. Figure 5 shows an example of our algorithm. 3. Conclusion and Discussion We have proposed a hierarchical featurematching algorithm for wide baseline images with less similarity.. We have presented a universal camera intrinsic matrix, by which the camera calibration experimentation could be neglected (saving time and resources). 2. We have presented a new texture-mapping algorithm which will automatically select better and clearer triangle textures to be mapped to the reconstructed model. 3. We have presented an integrated and comprehensive process for recovering a 3D model from 2D images, while almost all previous research only focused on one or two topics of this process. Since our method only recovers a 3D scene from two 2D photos, the unseen areas in either photo cannot be recovered. Objects with simple geometric shapes (like buildings) are easier to recover than other complicated objects like trees and grass. Because the estimated camera intrinsic matrix does affect the results recovered, users can adjust the focal length within the suggested range. In the near future, more work will be done to register and fuse model parts retrieved from more photos to recover a complete model. The result will also be output as a VRML model to be used in more applications. References [] K. Cornelis, M. Pollefeys, M. Vergauwen, & L, Van Gool. Augmented Reality using Uncalibrated Video Sequences, 2th European Workshop on 3D Structure 2
from Multiple Images of Large-Scale Environments (SMILE2000), Dublin, Irleand, 2000, 44-60. [2] M. Pollefeys, Self-Calibration and Metric 3D Reconstruction from Uncalibrated Image Sequences, PH.D thesis, Katholieke Universiteit Leuven, Heverlee, Belgium, 999. [3] M. Reed, & P. Allen, 3-D Modeling from Range Imagery: An Incremental Method with a Planning Component, Image and Vision Computing, 7, 999, 99-. [4] I. Stamos, & P. Allen, 3-D Model Construction Using Range and Image Data, Computer Vision & Pattern Recognition Conf. (CVPR), 2000, 53 536. [5] H. Shum, & R. Szeliski, Stereo Reconstruction from Multiperspective Panoramas, 7th International Conf. on Computer Vision (ICCV'99), Kerkyra, Greece, 999,4-2. [6] T. Jebara, A. Azarbayejani, & A. Pentland, 3D Structure from 2D Motion, IEEE Signal Processing Magazine, 6(3). 999, 66-84. [7] S. Baker, R. Szeliski, & P. Anandan, A layered approach to stereo reconstruction, IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR98), 998, 434-44. [8] C. Baillard, & A. Zisserman, A plane-sweep strategy for the 3d reconstruction of buildings from multiple images, International Archives of Photogrammetry and Remote Sensing, 32(2), 2000, 56 62 [9] A. W. Fitzgibbon, & G. Cross, A. Zisserman, Automatic 3D Model Construction for Turn-Table Sequences, Proc. European Workshop on 3D Structure from Multiple Images of Large-Scale Environments, 998, 55-70. [0] C. G. Harris, & M. J. Stephens, Combined Corner and Edge Detector, Proc. 4th Alvey Vision Conf., Manchester, England, 988,47-5. [] Z. Zhang, R. Deriche, O. Faugeras, & Q. Luong, A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry, Artificial Intelligence, 78, 995,87-9. [2] P. Tissainayagam, & D. Suter, Assessing the performance of corner detectors for point feature tracking applications, Image and Vision Computing, 22(8), 2004, 663-679. [3] J. F. Canny, Finding edges and lines in images, Master s thesis, MIT. AI Lab, 983. [4] R. Gonzalez, & R. Woods, Digital Image Processing (2 nd edition, Prentice Hall, 2002). [5] J. Gao, A. kosaka, & A. Kak, A Deformable Model for Human Organ Extraction, Proc. IEEE International Conf. on Image Processing, 998, 3: 323-327 [6] D. L. Pham, C. Xu, & J. L. Prince, A Survey of Current Methods in Medical Image Segmentation, Annual Review of Biomedical Engineering, 2, 998, 35-337. [7] Y. Ma, S. Soatto, J. Kosecka, & S. Sastry, An Invitation to 3-D Vision: From Images to Geometric Models (Springer-Verlag, 2003). [8] H. Loaiza, J. Triboulet, & S. Lelandais, Matching segments in stereoscopic vision, IEEE Instruction & Measurement Magazine, 4(), 200, 37-42. [9] R. Deriche, & O. Faugeras, 2-D Curve Matching Using High Curvature Points: Application to Stereo Vision, Proc. International Conf. on Pattern Recognition. New Jersey, USA, 990, 240-242. [20] T. Tuytelaars. M. Vergauwen, M. Pollefeys, & L. Van Gool, Image Matching for Wide baseline Stereo, Proc. International Conf. on Forensic Human Identification, 999. [2] G. Xu, J. Terai, & H. Shum, A Linear Algorithm for Camera Self-Calibration, Motion and Structure Recovery for Multi-Planar Scenes from Two Perspective Images, Computer Vision & Pattern Recognition Conf. (CVPR), 2000: 2474-2479. 3