Photo-Consistency Based Registration of an Uncalibrated Image Pair to a 3D Surface Model Using Genetic Algorithm

Size: px

Start display at page:

Download "Photo-Consistency Based Registration of an Uncalibrated Image Pair to a 3D Surface Model Using Genetic Algorithm"

Erick Hopkins
6 years ago
Views:

1 Photo-Consistency Based Registration of an Uncalibrated Image Pair to a 3D Surface Model Using Genetic Algorithm Zsolt Jankó and Dmitry Chetverikov Computer and Automation Research Institute and Eötvös Loránd University Budapest, Hungary {janko,csetverikov}@sztaki.hu Abstract We consider the following data fusion problem. A 3D object with textured Lambertian surface is measured and independently photographed. A triangulated model of the object and two uncalibrated images are obtained. The goal is to precisely register the images to the model. Solving this problem is necessary for building a geometrically accurate, photorealistic model from laser-scanned 3D data and high quality images. Recently, we have proposed a novel method that generalises the photo-consistency approach by Clarkson et al. [2] to the case of uncalibrated cameras, when both intrinsic and extrinsic parameters are unknown. This gives a user the freedom of taking the pictures by a conventional digital camera, from arbitrary positions and with varying zoom. The method is based on manual pre-registration followed by a genetic optimisation algorithm. A brief description of the pilot version of the method [8] has been given together with the results of a few initial tests. In this paper, we report on some new significant developments in this project. The critical issue of robustness against illumination changes is addressed and various colour representations and cost functions are tested and compared. Natural constraints are introduced and experimentally validated to simplify the camera model and accelerate the algorithm. Finally, we present synthetic and real data with ground truth, apply the improved method to the data and measure the quality of the results. 1. Introduction Precise registration of images to a 3D surface model is needed in a number of computer vision areas. An important application is building and visualising photorealistic 3D models of real-world objects based on multimodal sensor data. In our view, a photorealistic model has three major components: geometry, appearance and dynamics. Within these components, precision, continuity, high-level description (geometry), texture, realistic surface models, presentation at varying level of detail (appearance), motion and deformable shapes (dynamics) are required. In this paper the problem of combining precise geometry with high quality images is addressed. One of the most important application areas of registration is medical imaging. Registering optical images to 3D models obtained from CT or MRI images helps surgeon plan and perform the operation. This application, called image guided surgery, has been shown to improve the accuracy of operation and to reduce the operation time [2]. Another promising application, which is closer to our approach, is visualising exhibits of museums or exhibitions in fine details. When an expert wishes to examine the exhibits through a web site, providing realistic textures on the surface is as important as providing the precise geometry. There are several approaches to the registration problem. The task is frequently referred to as pose estimation, which assumes calibrated cameras, thus only the pose of the object in the world needs to be estimated. This can be achieved by extracting features on the 3D model as well as in the images, and by searching for the corresponding feature pairs [11, 3, 5]. Clarkson et al. [2] approached the problem in a different way. They have presented an algorithm based on photoconsistency. The method needs a calibrated setup. In [8] we have proposed a novel method which generalises this approach to the case of uncalibrated cameras, when both intrinsic and extrinsic parameters are unknown. The method is based on manual pre-registration followed by a genetic optimisation algorithm. In this paper we improve the method presented in [8] and apply it to real and synthetic data with ground truth. We consider the following scenario: The surface of an object is measured by an accurate 3D laser scanner and a dense point set is captured. Triangulation of the point set yields a 3D triangular mesh with the surface normals. Furthermore, an uncalibrated digital camera is used to acquire high quality

2 images about the object. The user has the freedom of taking the pictures from arbitrary positions and with varying zoom. The task is to register the images to the 3D model to obtain a geometrically accurate, photorealistic model. The contributions of the paper are as follows. The critical issue of robustness against illumination changes is addressed and various colour representations and cost functions are tested and compared. Without loss of generality, constraints are introduced and experimentally validated to simplify the camera model and make the algorithm more efficient. Finally, we present synthetic and real data with ground truth, apply the improved method to the data and measure the quality of the results. The structure of the paper is the following. Section 2 summarises our method introduced in [8]. The cost function and the selected optimisation strategy are presented. Section 3 is devoted to the improvement of the original method. The improved method is tested on real data and synthetic data with ground truth in section 4. Finally, section 5 sums up the results. 2. Method In this section we give a short overview of the method presented in [8]. Although we formulate the task in a special case of a 3D model and two images, the approach can be easily extended to multiple images Cost function The input data consist of two colour images, I 1 and I 2, and a 3D model. An example is shown in figure 1. The images and the model represent the same object. Fixed lighting conditions and identical sensitivities of the cameras are assumed. All other camera parameters may differ and are unknown. Furthermore, we assume that the surface of the object is textured and Lambertian. The 3D model consists of a 3D point set P and a set of normal vectors. P is obtained by a hand-held 3D scanner and then triangulated by the robust algorithm [9] which provides the normal vectors. The finite projective camera model [6] is used to represent the projection of the object surface to the image plane: u P X, where u is an image point, P is the 3 4 projection matrix and X is a surface point [6]. ( means that the projection is defined up to an unknown scale.) The task of the registration is to determine the precise projection matrices, P 1 and P 2, for both images. The projection matrix P has 12 elements but only 11 degrees of freedom (DOFs), since it is up to a scale factor. Decomposing P as P = K [R Rt] shows the meaning of these DOFs [6]: K is the 3 3 upper triangular camera matrix, R the 3 3 rotation matrix and t the 3 1 translation vector. The elements of K are the intrinsic camera parameters, while R and t are the extrinsic camera parameters, namely the orientation and the position of the camera. Let us denote by p the collection of the 11 unknown parameters (5 intrinsic and 6 extrinsic). p represents the projection matrix P as an 11-dimensional parameter vector. We search for values of p 1 and p 2 such that the images are consistent, that is the corresponding points different projections of the same 3D point have the same colour value. The definition is valid only when the surface is Lambertian. Formally, we say that images I 1 and I 2 are consistent by P 1 and P 2 (or p 1 and p 2 ) if for each X P: u 1 = P 1 X, u 2 = P 2 X and I 1 (u 1 ) = I 2 (u 2 ). (Here I i (u i ) is the colour value in point u i of image I i.) This type of consistency is called photo-consistency [10, 2]. The photo-consistency holds for accurate estimates for p 1 and p 2. Inversely, misregistered projection matrices mean much less photo-consistent images. The cost function introduced in [8] is the following: C φ (p 1, p 2 ) = 1 P I 1 (P 1 X) I 2 (P 2 X) 2. (1) X P Here φ stays for photo-inconsistency and P is the number of points in P. Difference of the colour values I 1 I 2 will be defined later. Finding the minimum of the cost function (1) over p 1 and p 2 yields estimates for the projection matrices Optimisation strategy Because of the 22-dimensional parameter space and the unpredictable shape of C φ (p 1, p 2 ), finding the minimum of C φ (p 1, p 2 ) is a difficult task. We have attempted to approach the problem in different ways. Auto-calibration [6] is a widely used process to determine internal camera parameters directly from multiple uncalibrated images. The method is based on the absolute conic Ω. This conic is fixed under the rigid motion of the camera. Determining Ω from the images yields the metric geometry of the model. Ω can be computed by using constraints on the internal or external camera parameters. There are several methods for auto-calibration, for instance estimating the absolute dual quadric Q, or using the Kruppa equations. However all of these methods assume that the projective reconstruction or the fundamental matrix has already been computed from point correspondences across the image set. In the registration problem considered, precise initial point correspondences are not available. The estimated projective reconstruction, hence the constraints and the computed internal parameters can only be approximations. This means that auto-calibration cannot be used to eliminate the internal parameters from the parameter space.

3 Figure 1. The Shell dataset. Centre: 3D model. Sides: Image pair. As a result, we preferred to minimise the cost function over all 22 parameters. To initialise the local search, manual pre-registration was introduced. Several standard nonlinear optimisation methods were tested. However, due to the non-smoothness of Cφ (p1, p2 ), gradient based methods (such as the Levenberg-Marquardt algorithm [6]) failed to provide reliable results. Finally, we decided to apply a global search strategy, namely a genetic algorithm. The proposed two-stage method is illustrated in figure 2. The rough estimates P10 and P20 provided by the manual pre-registration are refined by minimising Cφ (p1, p2 ). Note that the human assistance to initialise the search is reasonable because this operation is simple and fast compared to the 3D scanning, which is also done manually. The task of the photo-consistency based registration is to make the method more accurate. In section 4 we demonstrate by tests with ground truth that the gain in accuracy is essential. Reg 1 image 1 Reg 2 3D model image 2 manual pre registration 0 0 P2 P1 image 1 3D model image 2 photo consistency based registration P1, P2 Figure 2. Block-diagram of proposed method. For the implementation of the genetic algorithm we have chosen the GAlib genetic algorithm package [14] written by Matthew Wall at the MIT. Different settings were tested and the best ones were chosen. The steady-state algorithm proved to converge faster than the simple one, hence we use it, with the default elitist option. The algorithm applies uniform crossover and Gaussian mutation operator. Instead of the default roulette wheel selection method we use the tournament selector. To avoid premature convergence (when the population becomes homogenous before finding the minimum) we set the mutation probability to 0.1. For the same reason the algorithm runs until 5 times 100 generations are created, instead of 1 times 500. This means that the algorithm starts 5 times from the beginning, preserving only the best individual from the previous population. The population is set to contain 500 or 1000 individuals. The intervals of the genes are set to the pre-registered values plus a margin of ±. (The concrete values are given in section 3.2.) 3. Improving the algorithm 3.1. Robustness and colour model Several preliminary experiments were run to study the robustness of the method, which is a critical issue in registration, as well as in correspondence. It is clear that the cost function (1) is not robust, due to the inconsistencies produced by outliers, typically, by occluded points. In [2], the visibility is checked by ray tracing, but in [8] we use surface normals for this purpose. Our implementation is less accurate but much faster, which is more important in this case. The essence of our algorithm is to discard the point when the scalar product of the normal vector and the unit vector pointing towards the camera falls below a threshold. The product is the cosine of the angle between the two vectors. Typically, the threshold is set at 0.5. Since this algorithm cannot guarantee that all outliers will be filtered out, in [8] we modified the cost function (1) to remove the remaining outliers. The Trimmed Squares (TS) and the α-trimmed mean [12] techniques were applied. Both techniques have a single parameter, α. In TS, α is the

4 rate of the largest squares which are discarded. In the α- trimmed mean, both smallest and largest values are rejected: when α is close to 0.5, the median is used. In our experiments, we used α = 0.2. The α-trimmed mean performed slightly better. In attempts to improve the initial method [8], we later tested a few other cost functions. However, the variance of the colour values [2] or the Modified Normalised Cross Correlation [13] yielded worse results than the robust leastsquares described above. Another important question related to robustness is how to compare two colour images, what colour differences should be used to eliminate the influence of illumination changes. We have tested various colour models: CIE XYZ ITU [1], HSI [4] and CIE LUV [7]. In the literature CIE LUV is usually used to compute colour differences as the simple sum of squared differences in the three components L, U, and V. This model proved to be the best in our experiments, as well. However, it should be emphasised that in our experimental data the illumination changes were small. The size of our test images is or It seemed reasonable to reduce the size and apply image pyramids for the registration. We tried this as well, but the results did not improve significantly Constraints on the camera model As already mentioned, our original method applied optimisation in the full 22-dimensional parameter space. The size of the space and the non-smoothness of the cost function are two critical problems that make the search difficult and time-consuming despite the restrictions due to the manual pre-registration. To improve the efficiency of the optimisation process, we impose several reasonable constraints on the camera model, as suggested in [6]. Note that using the finite projective camera model without camera distortion is already a constraint which works well in practice. As mentioned above, the projection matrix can be decomposed as P = K [R Rt], where the calibration matrix can be expressed in form K = α x s x 0 α y y 0 1. (2) Here α x = fm x and α y = fm y represent the focal length of the camera in terms of pixel dimensions in the x and y direction respectively, s is the skew parameter and (x 0, y 0 ) is the principal point [6]. For most cameras the skew parameter is zero. It is also usual to assume that the pixels are squared, that is the ratio of m x and m y, which is referred to as the aspect ratio, is equal to 1. Thus the calibration matrix can be simplified: K = f f p x p y 1, (3) where p x = x 0 /m x and p y = y 0 /m y. This simplified model is usually called pinhole camera model. These simplifications reduce the number of the DOFs from 22 to 18. Although the decrease is not large, in this case every reasonable reduction is important. Hence we also applied a commonly used assumption, namely that the principal point is close to the origin. It does not reduce the number of the parameters, but the search space becomes more restricted. To exploit the relation between the internal and the external camera parameters, a technical simplification is implemented as well. This relation is the following: If we use the simplified camera model (3) and assume that the principal point is close to the origin, then one may give a rough estimate for the ratio of the focal length f and the distance d of the camera from the centroid of the object: f d w W, (4) where the width of the object in the image is denoted by w and in the 3D world by W (figure 3). Although w and W are unknown, the initial state of the camera provided by the manual pre-registration gives a good approximation to them. W O Figure 3. The relation among the distances. C is camera centre, O object s centroid. It is clear that the estimate in (4) is quite rough. Nevertheless, applying this constraint to determine the initial d w f C

5 population of the genetic algorithm makes the method much more efficient. In section 2.2 we did not specify the ɛ values for the intervals of the genes. Considering the simplifications detailed above, the values we use are the following: focal length: ±2% principal point: ±0.5% camera translation: ±3% camera rotation: ±1. 4. Tests In [8] pilot experimental results were presented to demonstrate the feasibility of the method. In this section, we run the improved method on the same input data as well as on new synthetic data with ground truth. When registration results are visualised, differences between original and the improved methods are not easy to perceive. These differences become clear when one uses the metric derived from the ground truth. We obtain the synthetic ground-truthed data by covering the triangular mesh of the original Shell dataset with a synthetic texture. Two views of this object produced by a visualisation program provide the two input images for which the projection matrices are completely known. To quantitatively assess the registration results, the projection error is measured: The 3D point set P is projected onto the image planes by both the ground truth and the estimated projection matrices, then the average distance between the corresponding image points is calculated. Formally, E(P 1, P 2 ) = 1 2 i=1,2 1 P X P P G i X P i X, (5) where Pi G are the ground truth, P i the estimated projection matrices. Evaluating the result of the manual pre-registration by this metric we obtained that the average error is pixels. The original method [8] brought it down to pixels. The results of the improved method presented in this paper are shown in table 1 and figure 4. The algorithm was tested 10 times with 500 individuals in the population and 10 times with The average error of 5 6 pixels is acceptable, considering the dimension of the images which is The typical running time with a 3D model containing 1000 points 1 was 15 minutes for 500 individuals and 40 minutes for The test was performed on a 2.40 GHz PC with 1 GB memory. 1 3D models generated by laser scanners usually contain tens of thousand points. However, for registration a less dense point set is sufficient. 500 individuals 1000 individuals average: 6.54 average: 5.63 Table 1. The projection error. 10 runs are performed with 500 individuals and 10 runs with 1000 individuals. Finally, in figures 5 7 we visualise typical registration results for the Shell and the Frog datasets. The precision of the registration can be best judged at the feet, the mouth and the tie of the Frog and the stripes of the Shell. Note that the accuracy is not uniform over the surface. The most accurate areas are those which are clearly visible from both viewpoints. 5. Summary Figure 4. Plot of the projection error. We have presented an improved method for registering a 3D model and two high-quality images assuming uncalibrated cameras. After manual pre-registration the method uses a genetic algorithm to minimise a photo-consistency based cost function. To ensure the robustness, different cost functions and colour models were tested. Imposing reasonable constraints on the camera model the 22-dimensional parameter space could be reduced, hereby increasing the efficiency of the search.

Figure 5. Shell registration. (See figure 1.) Left: textured model. Right: textureless model. Figure 6. The Frog dataset. Centre: 3D model. Sides: Image pair. Figure 7. Frog registration.

In the course of the runs, worse results have also occurred occasionally. In these cases the genetic algorithm failed to come close to the global optimum.

To increase the probability of the good results the number of the individuals must be set higher. On the other hand, this slows the algorithm down.

6 Figure 5. Shell registration. (See figure 1.) Left: textured model. Right: textureless model. Figure 6. The Frog dataset. Centre: 3D model. Sides: Image pair. Figure 7. Frog registration. Left: textured model. Right: textureless model. Tests have shown that the projection error of the registration can be decreased from to 5-6 pixels. In the course of the runs, worse results have also occurred occasionally. In these cases the genetic algorithm failed to come close to the global optimum. The genetic algorithm is a global search strategy, but due to the non-smoothness of the cost function, it may stop in a local minimum far from the global one. To increase the probability of the good results the number of the individuals must be set higher. On the other hand, this slows the algorithm down. Although the genetic algorithm is not deterministic, the variance of the projection error is quite small. In our experiments the algorithm ran until a fixed number of generations were created. One may set the algorithm to run until a given level of cost function is reached. This is possible since the projection error and the cost function correlate: the smaller

7 the cost function the smaller the error. (See figure 8.) [8] Z. Jankó and D. Chetverikov. Precise registration of an uncalibrated image pair to a 3D surface model. Proc. 17 th International Conference on Pattern Recognition, Accepted for publication. [9] G. Kós. An algorithm to triangulate surfaces in 3D using unorganised point clouds. Computing Suppl., 14: , [10] K. Kutulakos and S. Seitz. A Theory of Shape by Space Carving. Prentice Hall, [11] M. Leventon, W. Wells III, and W. Grimson. Multiple view 2D-3D mutual information registration. Image Understanding Workshop, [12] I. Pitas. Digital Image Processing Algorithms. Prentice Hall, [13] R. Sara. Finding the largest unambiguous component of stereo matching. 7th European Conference on Computer Vision, 2: , [14] M. Wall. The GAlib genetic algorithm package. URL: Figure 8. Cost function vs. projection error. The presented method works with image pairs. Further research will show if the approach should be extended to more images. Using more images allows one to impose more constraints on the intrinsic parameters, but the number of the extrinsic parameters would also grow. Thus the advantages of using more images are not obvious. Further tests are also needed to prove the robustness of the method against changes of lighting conditions and shadows. Acknowledgement. This work was supported by the Hungarian Scientific Research Fund (OTKA) under grants T and M28078 and the EU Network of Excellence MUSCLE (FP ). References [1] CIE Publ. No Colorimetry. Second Edition, [2] M. J. Clarkson, D. Rueckert, D. L. Hill, and D. J. Hawkes. Using photo-consistency to register 2D optical images of the human face to a 3D surface model. IEEE Tr. on Patt. Anal. and Machine Intell., 23: , [3] P. David, D. DeMenthon, R. Duraiswami, and H. Samet. SoftPOSIT: Simultaneous pose and correspondence determination. 7th European Conference on Computer Vision, Copenhagen, Denmark, pages , [4] R. C. Gonzalez and R. E. Woods. Digital Image Processing. Addison-Wesley Publishing Company, [5] R. Haralick, H. Joo, C.-N. Lee, X. Zhuang, V. Vaidya, and M. Kim. Pose estimation from corresponding point data. IEEE Tr. Systems, Man, and Cybernetics, 19: , [6] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, [7] R. Hunt. Measuring Colour. Ellis Horwood Limited, 1987.

Fusing Spatial, Pictorial and Photometric Data to Build Photorealistic Models

Fusing Spatial, Pictorial and Photometric Data to Build Photorealistic Models Z. Jankó MTA SZTAKI, Budapest E. Lomonosov MTA SZTAKI, Budapest D. Chetverikov MTA SZTAKI, Budapest Abstract We are working