3D reconstruction from uncalibrated images

Size: px

Start display at page:

Download "3D reconstruction from uncalibrated images"

Sophia Barker
5 years ago
Views:

1 3D reconstruction from uncalibrated images Christian Edberg Anders Ericsson December 17, 2000

2 Abstract This thesis will show in theory and in practice how to implement a 3D reconstruction algorithm. It is possible, using only images as input and allowing general camera motion and varying camera parameters, to reconstruct a scene up to an unknown scale factor. Feature extraction and matching, epipolar configuration (fundamental matrix), projective reconstruction and self-calibration are the main topics, stressing self-calibration. We discovered that the method for self-calibration we have implemented (see [5]), is not robust enough for real images. However, we will present results for synthetic images, where the algorithm performs well. We will also briefly discuss other methods for image based modeling and rendering. Sammanfattning Denna rapport beskriver i teorin och i praktiken hur man kan implementera en algoritm för 3D rekonstruktion. Det är möjligt att återskapa en scen upp till en okänd skalfaktor, utgåendes enbart från bilder tagna från godtyckliga kamerapositioner och med varierande kameraparametrar. Extrahering och matchning av featurepunkter, epipolar konfiguration (fundamentalmatris), projektiv rekonstruktion och automatisk kalibrering är de huvudsakliga delarna, med automatisk kalibrering som tyngdpunkt. Våra tester visar, att den metod vi använt för automatisk kalibrering av kamerorna (se [5]), inte är stabil nog för riktiga bilder, tagna med riktiga kameror. Vi kommer dock att presentera resultat för syntetiska bilder, där algoritmen uppför sig bra. Vi kommer också kortfattat beskriva andra metoder för bildbaserad modellering och rendering. 1

3 Preface This thesis is the final part of the Master s program in Mathematics and Electrical engineering at Chalmers University of Technology, Gothenburg - Sweden. Most of the work has been carried out at Summus Ltd. in Raleigh, North Carolina - USA, during the period June until late October We would like to thank Prasanjit Panda and Yu Chen - our supervisors at Summus, Jiangying Zhou - head of research, Johan Råde - for reading and correcting this report and the rest of the staff at Summus, especially Henrik Storm and Mona McCall, for taking care of us during our stay. We would also like to thank our supervisor at Chalmers - Vilhelm Adolfsson. 2

4 Notation Coordinates X = [ X Y Z W ] T x = [ x y w ] T u = [ u v t ] T Homogeneous 3D coordinate Homogeneous 2D coordinate Homogeneous 2D coordinate in pixel coordinate system {X 1..X N } Set of vectors (coordinates) {x 1..x N } Set of scalars l Π T t e ij Line in 2D space (3-vector) Plane in 3D space (4-vector) 3D translation 2D translation Epipole in image j (Projection of camera center C i into image j ) Cameras P = K [ R T R T C ] = [M m] 3 4 camera projection matrix R 3 3 rotation matrix 3

5 K = f u s u 0 0 f v v Camera calibration matrix [ u0 v 0 ] T Principal point f u = f p x Horizontal scale factor f v = f p y Vertical scale factor f p x p y Focal length Camera pixel width Camera pixel height s = f v tan(α) Skewness C Camera center Special matrices I H H Πi H Π ij F Π Ω Ω ω ω ω ω Eye matrix Homography Homography from plane Π to image i Homography through plane Π from view i to view j Fundamental matrix Plane at infinity Absolute conic Absolute dual quadric Absolute conic embedded in the plane at infinity Dual absolute conic embedded in the plane at infinity Image of the absolute conic Dual image of the absolute conic 4

6 G P G A G M G E q i q j Projective transform Affine transform Metric transform Euclidean transform Column vector i in matrix Q Row vector j in matrix Q Operators Equivalence up to scale Vector cross product [a] b = a b = 0 a 3 a 2 a 3 0 a 1 a 2 a 1 0 b 1 b 2 b 3 Anti-symmetric cross product matrix [. ] T Transpose. Determinant. = Equivalence in sign Others P 3 A 3 M 3 R 2 R 3 Projective 3D space Affine 3D space Metric 3D space Euclidean 2D space Euclidean 3D space 5

7 Contents 1 Image based modeling and rendering Introduction Interpolation Using the trilinear tensor D reconstruction D modeling overview Introduction Overview of the method Relating the images Calibrating the cameras Dense correspondence Building the model Projective geometry Introduction Homogeneous coordinates Transformations Stratification of 3D geometry Self-calibration Projective stratum Affine stratum Metric stratum Euclidean stratum Conics and quadrics Camera model Definitions Properties Multiview relations Homographies Two views More views Projection matrices and the fundamental matrix Extracting and matching feature points Introduction Method Feature extraction

8 4.2.2 Matching Results Estimating the Fundamental matrix Introduction Linear methods for estimating the fundamental matrix Non-linear minimization RANSAC Results Creating a projective reconstruction Introduction How to make a projective model Estimation of the initial cameras Building an initial reconstruction The triangulation problem Linear triangulation Mid-point method The polynomial method Updating the initial reconstruction Estimating the cameras from the initial reconstruction Dealing with outliers Results Self-calibration Introduction Self-calibration in general Direct methods Stratified approach The plane at infinity Calibrating when the plane has been located Hartley s approach towards calibration Pollefeys approach towards calibration Different ways to locate the plane at infinity Finding the plane from parallel lines Pollefeys method of locating the plane at infinity Hartley s method of locating the plane at infinity Results

9 Chapter 1 Image based modeling and rendering 1.1 Introduction There are plenty of commercial applications, where views are created by rendering an existing 3D model. This is a well-studied problem, but even so, the results are often not photorealistic. Suppose we have modeled an object by measuring each item and then building an identical structure in a 3D rendering program. Then, we add textures and lighting effects. When rendering, we enter the coordinates of a virtual camera and the rendering program produces a 2D image of the object. However, knowing the position of the virtual camera, we could instead take a picture with a real camera from the same angle. This would produce a far better result and would be by definition photorealistic. Further, we would not have to go through the time consuming process of building the 3D structure in the rendering program. On the other hand, if we want to create novel views, it takes time and effort to take another picture. It would be much easier to enter new coordinates for the virtual camera in the 3D rendering program and let the computer do the rest. Also, using a real camera, we must have access to the object for as long as we are creating new views. One solution to this problem could be taking hundreds of pictures from different angles. For every novel view we want to create, we choose the image with the most similar angle. Still, we do not get exactly the correct angle and hundreds of pictures might be difficult to store in an efficient way. Image based modeling and rendering tries to solve this problem. With a limited number of images of a static object, we want to be able to create as many new views as possible, including views that are not present in the input images. Most algorithms work in two phases. First, there is a registration phase, where data is collected, processed and stored in a suitable way. This is often the time-consuming phase and most crucial for the outcome. However, it needs to be done only once. The next phase is called rendering. The data from the registration part is used to create novel views, as specified by the user. 8

10 1.2 Interpolation View synthesis, using image interpolation, is an appealing method, since it avoids some the most common problems in image based modeling and rendering. This can be exploited to produce fast and stable algorithms. Here, we will discuss physically valid interpolation, as described in [8]. The main concern, when producing physically valid in-between views using interpolation, is to enforce the monotonicity constraint (see figure 1.1). Suppose that u 1 is a pixel in image 1, u 2 is the corresponding pixel in image 2 and u represents the corresponding pixel in a synthesized view. If another pixel u 1 appears on the same side of u 1, as u 2 does with respect to u 2, then u must appear on that same side with respect to u for all synthesized views. This is generally not true, when dealing with rotating cameras, but it might be relaxed and still produce acceptable results. X X' I 1 I 2 u 1 u' 1 I u u' u 2 u' 2 Figure 1.1: Monotonicity constraint Another limitation, when using interpolation, is that in-between views can only be synthesized along the baseline of the two cameras (see figure 1.2). The baseline is defined as the line connecting the two camera centers. A possible solution to this problem could be using several input images and interpolate between intermediate synthesized views. Suppose P 1, P 2 and P 3 are three cameras with corresponding images I 1, I 2 and I 3, used as input to the algorithm. The three cameras span a triangular area. Any view inside that area could be synthesized by first interpolating between I 1 and I 2 and then interpolating between the synthesized view and I 3. The main drawback when doing several consecutive interpolation operations is that the images are resampled several times and pixel errors propagate through each operation. Also, the position of the cameras would have to be known. That would require camera calibration and the algorithm would have a much higher complexity. When creating physically valid synthesized views using interpolation, it is important to rectify the images first. Rectification means that the epipolar lines (see section 3.6.2) are aligned with the scanlines of the image, i.e. all pixels occurring along a scanline in one image must have its corresponding pixels (see 9

11 Figure 1.2: Baseline chapter 4) occurring along the same scanline in the second image. Combined with the monotonicity constraint, interpolation along the scanline will then produce physically valid views. In order to calculate the epipolar lines, the authors of [8] suggest that the user manually define at least 4 correspondences, typically somewhere between 5 and 10. Since the scope of this thesis is to investigate methods that do not need any user interaction, this is an unacceptable restriction. Instead, the correspondences would have to be retrieved automatically. There are several different ways of doing so, but they all introduce new problems and the complexity of the algorithm would increase (see chapter 4). Figure 1.3: Example of interpolation from [8]. Images a and c are the original input views. Image b is an interpolated view. 10

1.3 Using the trilinear tensor One way of image based rendering, without building the entire 3D model and still using 3D geometry, is presented in [2].

12 1.3 Using the trilinear tensor One way of image based rendering, without building the entire 3D model and still using 3D geometry, is presented in [2]. The authors show that, given two images as input, it is possible to calculate the trilinear tensor and in an algebraic way obtain a third view. Since the trilinear tensor describes correspondences of points and lines in three views, all three views should be used in the calculations (see 3.6.3). However, if two of the three views are set equal, only two views are necessary. The registration part of the algorithm consists of calculating a dense correspondence map and a seed tensor. A novel view is created by computing a new tensor, based on the seed tensor. The strength of the algorithm is simplicity. The intrinsic camera parameters (see section 3.5) are set to fixed values. The extrinsic camera parameters depend on a small rotation of the second camera and by letting the translation of the second camera be small, an accurate dense correspondence map can be acquired. As these approximations promote speed and simplicity, they also introduce limitations and instability. It is not trivial how to cover a wider area of the scene, since both rotation and translation are confined to small values. Also, there is no redundancy, when using only two cameras. Often the process of calculating image correspondences is too delicate without redundancy. Figure 1.4: Example from [2] using the trilinear tensor. Images a and e are the original input views. The rest are rendered novel views. 11

13 1.4 3D reconstruction Several methods exist that take a shortcut towards the goal of automated image based rendering. They all have their unique advantages, but also unavoidable restrictions. Two prominent difficulties in image based modeling and rendering are feature extraction and camera calibration. Developing fast and efficient methods, often means approximating the solution or bypassing one or both of these difficulties. Our goal has been to investigate general methods that do not require user interaction and extra hardware. Such methods must deal with both difficulties. 3D reconstruction, as described in this thesis, is automatic and enables every user with access to a camera, to construct 3D models without any knowledge of 3D geometry and camera hardware. There are obvious advantages with obtaining a complete 3D model. For example, the computer game industry often use large 3D models. An object could first be photographed and reconstructed, then refined and altered to fit a certain purpose. A general 3D reconstruction algorithm consists of the following steps: General 3D reconstruction 1. Relate images (a) Feature extraction (b) Feature matching (c) Calculate constraints 2. Calibrate cameras (a) Initial reconstruction i. Estimate projection matrices ii. Triangulate 3D points (b) Refine reconstruction i. Estimate remaining projection matrices ii. Triangulate and update 3D points (c) Self-calibration (projective to metric) 3. Dense matching 4. Building the 3D model (a) Expand set of 3D points (b) Fit surfaces (c) Add texture Algorithm 1 : A general scheme for 3D reconstruction Most steps in algorithm 1 are well studied and reasonably stable solutions exist. Camera calibration is an exception and much research is currently taking place in that area. 12

14 Figure 1.5: Example of 3D reconstruction from [7]. These five images are used as input to the algorithm. 13

15 Figure 1.6: Reconstruction made from the images in figure 1.5. The only difference between the images is that texture has been added in the first image. 14

16 Chapter 2 3D modeling overview 2.1 Introduction The goal in 3D reconstruction is, given a set of images of a scene, to create a 3D model that can be used to render new views. This can be done, if the relation between the images and the calibration of the cameras are known. A related image pair consists of two images, where at least a number of pixels (preferably all) have been related (matched). Knowing the calibration of a camera, means that all information about the position and rotation of the camera is known and also that the intrinsic parameters, such as focal length and principal point, are known. One way of solving the 3D reconstruction problem, is to work with calibrated cameras. This means however that specialized hardware must be used to generate the sequence and would result in high costs for the development of the model. Many new applications require, however, robust low-cost acquisition systems. To fulfill these requirements, we must assume that the images were taken by an arbitrary camera and that nothing is known about the calibration. The relation between the images and the calibration of the cameras must therefore be extracted from the image sequence. These are the main steps in building a 3D model. This chapter describes how to build a 3D model. First, an overview in section 2.2 is presented. Then, every step of the algorithm is briefly described. For details, see chapter 3 to chapter Overview of the method To achieve our goal of building a 3D model, the relation between the images and the calibration of all cameras must be retrieved. To say anything about the camera, something about the image relation must be known. Therefore, the algorithm must first relate the images. When the images are related, calibration of the cameras is possible. On the other hand, relating the images becomes easier, when the calibration is known. For that reason, the next step in the 15

17 algorithm is another more detailed relation. This part of the relation is called dense correspondence. The dense correspondence part relates each pixel in every image to pixels in the other images. The 3D reconstruction algorithm can be divided into the following categories. 1. Relating the images 2. Calibrating the cameras 3. Dense correspondence 4. Building the 3D model Compare this to algorithm 1 in section 1.4. After establishing the relation and calibration, the last step is to build the 3D model. When the dense correspondence and the cameras are known, it is possible to calculate the depth of every pixel, i.e. the 3D point of every pixel. This is the most computationally expensive part of the algorithm. The result is called the dense depth map of the image sequence. To obtain a solid 3D model out of the dense depth map, one can either interpolate linearly between the 3D points or fit parametric surfaces through the 3D points. To fit surfaces between the 3D points, is the better choice, but also the most complex. Once a surfaced 3D model has been obtained, the graphics in the images are mapped onto these surfaces using texture mapping and novel views can be rendered. 2.3 Relating the images In this section, the relating procedure is described briefly. The details are given in chapter 4 and chapter 5. The relating procedure can be divided into the following steps 1. Finding points of interest (features) 2. Matching points 3. Calculating the fundamental matrices between image pairs. The purpose of this part of the algorithm is to relate the images. The relation is called the epipolar configuration. This is a relation between points in two images and can be expressed by the fundamental matrix. The properties and behavior of this matrix is described in chapter 3 and how to estimate it, is described in chapter 5. The fundamental matrix is a 3 3 matrix, which encodes the epipolar configuration between two images. To estimate this matrix, seven correct matches are necessary. First, feature points have to be extracted in both images. Extracting feature points, which are points with certain properties, is therefore 16

18 the first step in the relation algorithm. These points are found by using some feature point extractor, such as Harris and Stephens Corner Detector. To study this in more detail, see chapter 4. After the feature points have been extracted, the next step is to match these points. The problem of matching points, finding the same feature in different views of the same scene, is difficult. There are several algorithms to do this. It is important that the matches are reliable and does not contain any mismatches. Since mismatches may occur (in practise, maybe 10% - 20% are mismatches), it is desirable to choose the best seven matches to use for calculation of the fundamental matrix. We used the RANSAC-method (Random Sampling Consensus). See section 5.4 for more details. This part of the algorithm relates the images and calculates the fundamental matrix between image pairs. If the fundamental matrix is given, weakly calibrated cameras can be calculated. The point correspondences, together with weakly calibrated cameras, make it possible to build a projective model. Details are given in chapter 6. A projective model is the first step of calibrating the cameras. The fundamental matrix is also used to guide the relating procedure in both the matching part and the dense correspondence part. 2.4 Calibrating the cameras Calibrating the cameras can be done in many ways. Most methods known today use a projective model as input to the calibration phase. To read more about how this is done, see chapter 7. A projective 3D model is a model in projective space. In metric space, which we are used to in real life, distances and angles are invariant. This is not the case in projective space. Lines, which would be parallel in metric space, are not parallel in projective space (see chapter 3). The projective model can be calculated from the matched points and the weakly calibrated cameras (see chapter 6). This model could be of use in robot vision, but is not useful for visualization. In self-calibration the projective model is of interest, since it is possible to transform a projective model to metric (see chapter 7). Self-calibration is about finding a transformation that brings a non-metric model to metric space. Since distances and angles are invariant in metric space, we seek a transformation, that transforms the model so that it fulfills these requirements. Finding this transformation is the most difficult problem in 3D reconstruction. 2.5 Dense correspondence Suppose that an image sequence, which is taken by a camera with only translation along the horizon, is given. The sequence is numbered, so that image 1 is the leftmost image and image N is the rightmost image. In a sequence like that, a point in image i is always to the left of its corresponding point in image j, if i < j. 17

19 Since the camera is only translating along the x-axis, the correspondences of a point throughout the sequence can be found by searching along horizontal lines. This would simplify the matching procedure a great deal. Suppose that a point correspondence is found in image i and image k. Then, if i < j < k, the correspondence in image j will lie in between the correspondences in image i and image k. Thus, it will lie to the right of the correspondence in image i and to the left of the correspondence in image k. Also, all pixels in between two point correspondences in image i must lie in between the same point correspondences in image j. Using these facts, it is possible to set up a dynamic programming scheme, as described in [26], to find the dense correspondence for a sequence with purely translating cameras along the x-axis. Assume that the fundamental matrix between an image pair is known. Then, the epipoles in the two images are known. Now, transforming the epipoles, so that the x-part of the epipoles lie at infinity, would make all epipolar lines horizontal. This procedure is called rectification. In a rectified image pair, it is only necessary to search along horizontal lines in order to find corresponding points. When an image sequence has been rectified, it appears to be taken with a camera purely translated along the horizon. The requirements of using the proposed dynamic programming scheme for dense correspondence are therefore fulfilled. Thus, rectification followed by the dynamic programming scheme, makes dense correspondence possible. 2.6 Building the model At this stage, the cameras are recovered and the dense correspondence is found. Using the dense correspondence together with the cameras, it is possible to estimate the depth to each correspondence. Thus, a 3D model consisting of thousands of points could be retrieved. A 3D model consisting of only points would not be satisfying. Somehow, the image data must be mapped onto the model. By fitting parametric surfaces through the points, it is possible to achieve a model consisting of surfaces, instead of just points. If all these surfaces are sampled in a fine scale triangular pattern, a realistic 3D model can be obtained. The photo-realism is accomplished by mapping the graphics from the images onto these triangles. This would give us a complete 3D model. All known 3D effects could be applied to the model, e.g. light sources, textures and ray-tracing. Then, new views could be rendered from all angles and directions. 18

20 Chapter 3 Projective geometry 3.1 Introduction This chapter will discuss the theoretical foundation of this thesis. Projective geometry provides the necessary mathematical framework. Together with knowledge of the cameras, we show theoretically, how automated 3D reconstruction can be achieved, using only 2D images as input. 3.2 Homogeneous coordinates It is often more convenient to use a homogeneous representation of coordinates. Such coordinate is of one dimension higher than its normal representation. For example, x = [ x y ] T has the homogeneous representation x = [ x y w ] T, where x = x/w and y = y/w. A 2D point x = [ x y ] T lies on a 2D line l = [ l1 l 2 l 3 ] T if and only if xl 1 + yl 2 + l 3 = 0. This can also be written as the inner product [ x y 1 ] [ l1 l 2 l 3 ] T = 0. Multiplication with a non-zero scale factor µ does not change the expression and [ µx µy µ ] [ ] T l 1 l 2 l 3 = 0 still holds. Any coordinate x = [ ] x y T in the 2D plane can be represented by its homogeneous version x = [ x y w ] T, where x = [ x/w y/w ] T Thus, a multiplication with a non-zero scalar does not affect the coordinate. x y 1 w x y 1 = wx wy w x y w The same reasoning applies to 3D coordinates. A homogeneous representation of a 3D point can be written as X = [ X Y Z W ] T and lies in a plane Π = [ Π 1 Π 2 Π 3 Π 4 ] T if and only if Π T X = 0 If the homogeneous factor is equal to one, the coordinate is said to be normalized. 19

Most of the transformation rules presented here, will be used and referenced to later in this chapter.

21 3.3 Transformations Throughout this chapter, a variety of different entities will be introduced. Transforms often apply to each entity differently and due to the amount of different rules, it might be useful with a small reference. Most of the transformation rules presented here, will be used and referenced to later in this chapter. X X GX 3D point P P PG 1 Camera Π Π ΠG 1 3D plane C C G T CG 1 Conic C C GC G T Dual conic Q Q G T QG 1 Quadric Q Q GQ G T Dual quadric Using the transformation rules described above, we see for example that PX is unchanged under transformation. PX (PG 1 )(GX) = PX 3.4 Stratification of 3D geometry Stratification of 3D geometry means that different strata are defined based on invariants. For every stratum, there are a set of invariants and a set of transforms, that preserve the invariants. The stratum with the largest set of transforms (and therefore the smallest set of invariants) is the projective stratum. The affine stratum is a subset of the projective stratum, the metric stratum is a subset of the affine stratum and the Euclidean stratum is a subset of the metric stratum. Upgrading the 3D model from one stratum to another, means that the invariants are identified and the set of transforms is shrunk (see figure 3.1). Figure 3.1: Relation between different strata. The projective stratum has the largest set of transforms and the Euclidean stratum has the smallest set of transforms. The affine stratum (perhaps even the projective stratum) might be good enough for some robot vision applications, but not for human vision. Humans see the world as if it was in the metric stratum. We are good at spotting parallel lines and estimating angles, but we can not perceive absolute length. An object might be small and up close, or large and far away. It will still look the same to us. 20

22 The only difference between the Euclidean stratum and the metric stratum, is that scale is not invariant in the metric stratum. To understand why the metric stratum is good enough for human vision, one can look at special effects in movies, to realize that scale is not a relevant factor, when creating a realistic reconstruction of a scene. Small-scale models, filmed with high speed cameras, look very realistic on screen Self-calibration A variety of different approaches to self-calibration exist and the majority of them use a projective reconstruction in order to find the missing camera parameters. The difference between a metric and a projective reconstruction can be seen in figure 3.2. It is clear, that a projective reconstruction is not nearly good enough for visualization. The model has to be upgraded to something the human visual system can understand. Figure 3.2: Difference between metric and projective space Self-calibration is the process of upgrading the model from projective to metric space. A general approach is given below. 1. Obtain a projective reconstruction {{P 1..P N }, {X 1..X M }} 2. Find transform G P M that upgrades the reconstruction to metric {{P 1..P N }G 1 P M, G P M{X 1..X M }} However, in a stratified approach, a transform is first identified that upgrades the reconstruction to affine. Then a transform is found that upgrades to metric. 1. Obtain a projective reconstruction {{P 1..P N }, {X 1..X M }} 2. Find transform G P A that upgrades the reconstruction to affine {{P 1..P N }G 1 P A, G P A{X 1..X M }} 3. Find transform G A M that upgrades the reconstruction to metric {{P 1..P N }G 1 P A G 1 A M, G A MG P A {X 1..X M }} Upgrading the model from projective to metric, is similar to how we normally think of camera calibration, i.e. using calibration objects and references present in the scene and the images. Often a grid is placed in front of the camera. Knowing how the grid should project onto the images, camera projection matrices can be defined that satisfies this a priori information. The most difficult problem, when dealing with calibration objects, is probably to find the 21

23 calibration objects in the images. Once they are found, it is often easy to find the camera parameters. Self calibration, in the context of projective geometry, uses invariants instead of calibration objects. The most common way to calibrate the cameras, is to search for the absolute conic (see section 3.4.6) that is embedded in the plane at infinity (see section 3.4.3). In other words, if the absolute conic is considered to be the calibration object, we must first find it by locating the plane at infinity. Since we know how the absolute conic should project onto the images, we can calculate the missing camera parameters. Figure 3.3: Different strata Projective stratum Projective three-dimensional space will be referred to as P 3. The projective stratum has the least number of invariants and the largest set of transforms. A projective transform in P 3 can be represented by a 4 4 invertible (full rank) 22

24 matrix. Scale is undefined, so the transform has 15 degrees of freedom (d.o.f.). D.O.F Transforms Invariants 15 p 11 p 12 p 13 p 14 p 21 p 22 p 23 p 24 p 31 p 32 p 33 p 34 p 41 p 42 p 43 1 Intersection and tangency of surfaces, the sign of gaussian curvature, crossratio The fact that the sign of gaussian curvature is invariant, means that saddle points on a surface can not be transformed to extreme points and vice versa. Another interesting property in P 3, is the cross-ratio. Assume four collinear points X 1, X 2, X 3 and X 4. If no point is coincident with X and X, then they can be represented by X i = X + λ i X. The cross-ratio is defined as ( ) ( ) λ1 λ 3 λ2 λ 4 cross-ratio = λ 1 λ 4 λ 2 λ 3 A projective transformation is any invertible 4 4 matrix that preserve all the projective invariants. Since there are 15 degrees of freedom and undefined scale, any invertible 4 4 matrix will suffice Affine stratum Affine three-dimensional space will be referred to as A 3. The affine stratum is a subset of the projective stratum. It has more invariants and therefore a smaller set of transforms. It differs from the projective stratum by identifying a plane called the plane at infinity. A plane has three degrees of freedom, which implies that the affine stratum has three degrees of freedom less than the projective stratum, namely 12 compared to 15. D.O.F Transforms Invariants 12 a 11 a 12 a 13 0 a 21 a 22 a 23 0 a 31 a 32 a 33 0 a 41 a 42 a 43 1 Projective invariants, parallelism of planes, volume ratios, centroids, the plane at infinity An affine transformation is any invertible 4 4 matrix that preserve all the affine invariants. Note that the projective invariants constitute a subset of the affine invariants. Plane at infinity The plane at infinity Π = [ ] Π 1 Π 2 Π 3 Π T 4 is the plane where parallel planes and lines intersect. It has three degrees of freedom and is fixed 23

25 under affine transformations. Points in the plane remain in the plane under affine transformations, but their positions are not fixed. Figure 3.4: The plane at infinity is defined by vanishing points. Vanishing points lie at the intersection of parallel lines. Once the projective coordinates of the plane is found, a transform G P A that moves the plane to its canonical position in affine space [ ] T, can be applied. This is how the reconstruction is upgraded from projective to affine space. G P A = Π 1 /Π Metric stratum Π 2 /Π Π 3 /Π 4 Metric three-dimensional space will be referred to as M 3. The metric stratum is a subset of the affine stratum. The metric stratum identifies an entity called the absolute conic. It is a symmetric 3 3 matrix, defined up to a scale. It has therefore 5 degrees of freedom. Since the affine stratum has 12 degrees of

26 freedom, the metric stratum must have 7 degrees of freedom. D.O.F Transforms Invariants 7 µr 11 µr 12 µr 13 0 µr 21 µr 22 µr 23 0 µr 31 µr 32 µr 33 0 T 1 T 2 T 3 1 Affine invariants, absolute conic Rotation (3 d.o.f.), translation (3 d.o.f.) and scale (1 d.o.f.) are the allowed transforms in the metric stratum. In order to upgrade the reconstruction to metric, the absolute conic has to be identified. Once the affine coordinates for the absolute conic is found, a transform is applied, that moves the conic to its canonical form in the metric stratum. Conics are described in more detail in section Euclidean stratum In the euclidean stratum, only translation and rotation are possible transforms. It is of little interest here, since metric is sufficient for visualization. Further, upgrading from metric to euclidean space would require either user input or calibration objects, present in the scene. Knowing the length between two points in the scene, would be sufficient to upgrade the entire model to the euclidean stratum Conics and quadrics Conics and quadrics play a very important part in self-calibration. The absolute conic is a conic embedded in the plane at infinity and it is invariant under metric and Euclidean transforms. Recovering the affine coordinates of the absolute conic and then applying a transform that brings the absolute conic to its canonical form in metric space (eye matrix), will upgrade the reconstruction to metric. Conics Conics, discussed here, exist in P 2 and can be represented by 3 3 matrices. They are symmetric and defined up to a non-zero scale factor, which results in 5 independent parameters. Definition A conic in P 2 consists of all points x, satisfying x T Cx = 0 (3.1) where C is a 3 3 symmetric matrix, defined up to a non-zero scale factor. Absolute conic The absolute conic Ω is embedded in the plane at infinity. In the metric strata, where Π = [ ] T, the absolute conic has the canonical form Ω M =

27 Points on Ω M in the metric frame satisfy X X X 2 3 X 4 } = 0 (3.2) On the plane at infinity (X 4 = 0), equation (3.2) can be written as [ ] X1 X 2 X X 1 X 2 = 0 } 0 0 {{ 1 } X 3 Ω M where Ω M corresponds to a conic C = I 3 3. Figure 3.5: The absolute conic embedded in the plane at infinity. Dual conics Definition A dual conic in P 2 consists of all lines l, satisfying l T C l = 0 (3.3) where C is a 3 3 symmetric matrix, defined up to a non-zero scale factor. Figure 3.6: Illustration of a conic and a dual conic. 26

28 Theorem Suppose conic C is of full rank and the tangents of C belong to the dual conic C. Then C is the inverse of C. C C 1 (3.4) Proof. Let S(x) = x T Cx and S(x, x ) = x T Cx. Further, let x and x define a line. All points along this line can be expressed in λ as x + λx. Points along this line lie on a conic C iff This can also be written as S(x + λx ) = 0 S(x) + 2λS(x, x ) + λ 2 S(x ) = 0 (3.5) This is a second degree polynomial with two solutions. Therefore, a line in general intersects with a conic in two points. These two points coincide, if the discriminant of equation (3.5) is zero. S(x, x ) S(x)S(x ) = 0 (3.6) Suppose x is fixed to lie on the conic. Then S(x) = 0 and equation (3.6), describing the tangents, is simplified. S(x, x ) = x T Cx = 0 This is linear in x, which implies that there is only one tangent at each point belonging to the conic. This tangent can be represented by Using this result in equation (3.3), we get l C T x = Cx (3.7) l T C l = (Cx) T C (Cx) = x T C T C Cx = x T CC Cx where C = C T. If C = C 1, then l T C l = x T Cx = 0 Quadrics Similar to conics in P 2, quadrics exist in P 3. They are represented by symmetric 4 4 matrices, defined up to a non-zero scale factor. Thus, they have 9 independent parameters. Definition A quadric in P 3 consists of all points X, satisfying X T QX = 0 (3.8) where Q is a 4 4 symmetric matrix, defined up to a non-zero scale. 27

29 Dual quadric Definition A dual quadric in P 3 consists of all planes Π, satisfying Π T Q Π = 0 (3.9) where Q is a 4 4 symmetric matrix, defined up to a non-zero scale. Consider an ellipsoid, that is made infinitely flat. Planes belonging to Q are the tangents that exist at the contour of the infinitely flat ellipsoid. In a metric frame, where Π = [ ], the contour of the infinitely flat ellipsoid is equal to the absolute conic (see figure 3.7). Figure 3.7: Dual absolute quadric embedded in the plane at infinity. Proposition Suppose quadric Q is of full rank and the tangent planes of Q belong to the dual quadric Q. Then Q is the inverse of Q. Q Q 1 (3.10) Similar to equation (3.7), the tangent plane Π can be written as Π = QX Image of the absolute conic and the dual absolute conic 3D points on the plane at infinity Π may be written as X = [ x T 0 ] T. If they are projected onto the pixel plane of a camera P i, we get u i = P i X = K i [ R T i R T i C i ] [ x 0 ] = K i R T i x (3.11) We see that K i R T i maps points in the plane at infinity to the pixel plane of camera P i. An important property of equation (3.11) is that the mapping (actually a planar homography, see section 3.6.1) is independent of the camera position C i. 28

30 Figure 3.8: Image of the absolute conic. Since the absolute conic is located on the plane at infinity (see figure 3.8), we can apply the transform in equation (3.11) and calculate an image of the absolute conic ω (see section 3.3). ω = (K i R T i ) T Ω(K i R T i ) 1 (3.12) In metric and Euclidean space, the absolute conic is represented by its canonical form I 3 3. Equation (3.12) then becomes ω i = (K i R T i ) T I 3 3 (K i R T i ) 1 = K T i = K T i R 1 i R i K 1 i = K T i K 1 i R 1 i R T i K 1 i (3.13) where R T = R 1. This is an important result, since it shows the image of the absolute conic can be directly related to the calibration matrix. The same reasoning can be applied to the image of the dual absolute conic. The difference is how transformations apply to the dual absolute conic (see table 3.3). 29

31 ω i = (K i R T i )I 3 3 (K i R T i ) T = K i R T i R i K T i = K i R 1 i R i K T i = K i K T i (3.14) This shows that the image of the dual absolute conic is the inverse of the image of the absolute conic. 3.5 Camera model ω i = ω i 1 (3.15) Research in the area of 3D reconstruction is very much concentrated around camera calibration. The goal is to find all twelve elements in the projection matrix P. Given a camera model, the projection matrix describes how 3D points are projected by the camera. The most commonly used camera model, is the projective camera model, which corresponds to an ideal pinhole camera. This section will define and explain the necessary parameters of the projective camera model Definitions The camera coordinate system has its origin in the camera center C. Orthogonal to its Z-axis and at distance Z = f, where f represents focal length, lies the image plane. The Z-axis will from here on be referenced as the principal ray. The intersection between the principal ray and the image plane is called the principal point [ u 0 v 0 ] T. There are two coordinate systems in the image plane, spanned by orthogonal base vectors (x, y) and non-orthogonal vectors (u, v). The first system has its origin in the principal point and is called the image coordinate system. The latter has its origin in the upper left corner of the pixel area and is called the pixel coordinate system. Parallel to the image plane lies the principal plane, which contains the camera center. Studying the most simple case, when the camera coordinate system is aligned with the world coordinate system and the focal length is set to one. A projected point in the image coordinate system is then given by [ x y ] T = [ X Z The projective camera model tells us how to map coordinates in the image coordinate system to the pixel coordinate system. How the mapping is done, depends on the intrinsic parameters of the camera. Table (3.1) explains what parameters are covered by the projective camera model. Usually, parameters f, p x and p y are not given explicitly. Instead, they are given as f u = f p x and f v = f p v. The advantages are that it is a shorter notation and that they have no dimension. Y Z ] T 30

32 Figure 3.9: Camera model A point u = [ u v ] T in the image coordinate system can be described in the pixel coordinate system as { u = fu x + sy + u 0 (3.16) v = f v y + v 0 Using homogeneous coordinates and introducing the calibration matrix K, the mapping from image to pixel plane in equation (3.16) can be written as u v = f u s u 0 0 f v v 0 x y (3.17) 1 } 0 0 {{ 1 } 1 K So far, it has been assumed that the camera coordinate system coincides with the world coordinate system. The camera might be translated with a vector T and rotated with a rotation matrix R. In a camera coordinate system transformed with a translation and a rotation (see figure 3.10), it is possible to apply the inverse transformation and we would be back in the simple case. This is possible, since it does not change the relation between the cameras and the 3D points, it merely redefines the coordinate system. X Y Z = R T X T 1 Y T 2 Z T 3 Equations (3.17) and (3.18) combined, results in (3.18) 31

33 f p x p y [ u0 s v 0 ] Focal length - the distance between the projection center and the retinal plane. Focal length corresponds to zoom. Pixel width Pixel height Principal point - the intersection between the principal ray and the image plane in the pixel coordinate system. Skewness - a pixel does not have to be rectangular, but can be a parallelogram. The skewness s is defined by s = f p y tan(α), where α is the angle between the pixel base and the pixel side. However, a common assumption is that pixels are square, i.e. s is zero and p x = p y Table 3.1: Intrinsic parameters u v t = KR T X T 1 Y T 2 Z T 3 (3.19) Using the homogeneous notation for 3D coordinate X, equation (3.19) can be written as a 3 4 matrix, multiplied with the homogeneous coordinate. u = KR T X T 1 Y T 2 Z T = [KR T KR T T 1 T 2 T 3 [ ] = K R T R T T X } {{ } P ] X Y Z 1 (3.20) This important equation describes how 3D points are projected onto the pixel plane, using the projection matrix 32

34 Figure 3.10: Translated and rotated camera where C is the camera center. P = K [ R T R T C ] (3.21) It is common to divide the projection matrix into one 3 3 matrix M and one 3 1 vector m Properties Principal plane and principal ray P = [M m] The principal plane is the plane, that contains the camera center and is perpendicular to the principal ray. All points X pp, located on the principal plane, projects to [ u v 0 ] T. u v 0 = PX pp (3.22) Equation (3.22) suggests that the principal plane is defined by the last row of P. 0 = [ X ] p 31 p 32 p 33 p 34 Y Z = p 3 X pp (3.23) W A plane is represented by the general form AX + BY + CZ + D = 0, where [ A B C ] T is the normal vector of the plane. In the case of the principal plane, the normal vector coincides with the principal ray. The normal vector is described by the first three coefficients [ p 31 p 32 p 33 ] T of the last row in the projection matrix P. 33

35 Camera center Theorem Let p j be column vector j in P = [p 1.. p 4 ]. Then form a matrix Q i = [.. p j..] j i consisting of all columns in P, except for column i. The coefficients of the camera center C = [ c 1 c 2 c 3 c 4 ] T of a camera with projection matrix P can be written as c i = ( 1) i Q i (3.24) Proof. The camera center projects onto the pixel plane as c 1 P c 2 c 3 = c 4 This means that for every row p i in the camera matrix, the following equation should hold c 1 [ ] pi1 p i2 p i3 p i4 c 2 c 3 = 0 (3.25) c 4 Now form a 4 4 matrix A consisting of P and a 1 4 row vector v added at the bottom. p 11 p 12 p 13 p 14 A = p 21 p 22 p 23 p 24 p 31 p 32 p 33 p 34 v 1 v 2 v 3 v 4 Cofactor expansion of the determinant A along the last row results in A = v 1 ( 1) 1 Q 1 + v 2 ( 1) 2 Q 2 + v 3 ( 1) 3 Q 3 + v 4 ( 1) 4 Q 4 ( 1) 1 Q 1 = v ( 1) 2 Q 2 ( 1) 3 Q 3 ( 1) 4 Q 4 Now, by setting v equal to p i for some i, we know that A is equal to zero, since two rows are the exactly the same. ( 1) 1 Q 1 p i ( 1) 2 Q 2 ( 1) 3 Q 3 = 0 (3.26) ( 1) 4 Q 4 Combining equations (3.25) and (3.26) shows that ( 1) 1 Q 1 C = ( 1) 2 Q 2 ( 1) 3 Q 3 ( 1) 4 Q 4 34

36 and c i = ( 1) i Q i Definition A camera P = [M m] is said to normalized if M = 1 Recall equation (3.24), where it is shown that the homogenous factor of a 3D point can be written as c 4 = ( 1) 4 Q 4 = M Thus, for a normalized camera, the homogeneous factor is equal to one. In order to normalize a camera, we have to divide P by M 1 3, since µm = µ 3 M. Cheirality The word cheirality is greek and means hand or side. Hartley introduced cheirality in [3], as the property of a 3D point that specifies if it lies in front or behind of a camera. Though it might seem obvious that all 3D points seen by a camera must lie in front of that camera, that is rarely the case in projective space. Enforcing cheirality on a reconstruction in projective space limits the search for the camera parameters in P. Proposition If M > 0 for a camera P = [M m], then X lies in front of the camera iff t > 0 in PX = [ u v t ] T. Equation (3.23) shows that [ ] T p 31 p 32 p 33 is the normal vector for the principal plane. It coincides with the principal ray, but is only defined up to a non-zero scale factor. The normal vector is the last row of M. Recall definition 3.5.1, where a camera is said to be normalized if M = 1. Multiplication with a scale factor on the last row, would mean that the whole determinant is multiplied by the same scale factor. m 11 m 12 m 13 m 21 m 22 m 23 µm 31 µm 32 µm 33 = µ m 11 m 12 m 13 m 21 m 22 m 23 m 31 m 32 m 33 Thus, the sign of the normal vector is defined by the value of M. Proposition says that all positive projections of 3D points onto the principal plane for a normalized camera (or all cameras that could be normalized with a positive scale factor), lies in front of that camera. Proposition is only valid for cameras with M > 0. Now we define a function, where this assumption does not have to be fulfilled. Definition Let X = [ X Y Z W ] T be a 3D point and u = [ u v t ] T be its projection [ u v t ] T = P [ X Y Z W ] T onto the pixel plane of camera P = [M m]. The cheirality of X and P is defined as χ(x, P) = M 1 3 W/t (3.27) 35

37 Note that scaling of X,u or P does not change the value of χ. A multiplication of X or P with a scalar, would multiply u (and t) with the same scalar. If P and X are normalized, then M and W are equal to one and χ(x, P) = 1/t. = t. Thus, according to proposition Proposition A 3D point X lies in front of a camera P iff χ(x, P) > 0 It is also true that χ is negative for a 3D point behind a camera, zero for a 3D point on the plane at infinity and infinite for a 3D point on the principal plane. 3.6 Multiview relations A single projection of a scene, does not contain any 3D information. Neither do several independent projections. But if we know that the images are projections of the same scene and what pixels are projections of the same 3D points, it is possible to triangulate and determine the position of the 3D points. The first part of most image modeling and rendering algorithms relates the images. That is, given a number of projections of a static scene, we want to group pixels according to what 3D points they belong to. Pixels, that are projections of the same 3D point, are said to be corresponding (matching) pixels. When finding corresponding pixels, we use constraints that restrict how the position of matching pixels can move from one image to another. There are different constraints, depending on the number of images. Usually, the images are related in groups of two, three or four images Homographies The ultimate goal, when relating images, is to find a function that uniquely relates every pixel in every image. Such information is encoded in the homography between the images. Here, we will use homographies as point to point transformations P 2 P 2 from one plane to another. A homography, mapping points {u 1i..u Ni } in plane Π i to points {u 1j..u Nj } in plane Π j, is a 3 3 matrix and {u 1j..u Nj } = H ij {u 1i..u Ni }. If the mapping is done through an intermediate plane Π, first from Π i to Π and then from Π to Π j, the homography is written as H Πj H 1 Πi = HΠ ij. If Π is a special plane, for example the plane at infinity Π, the homography is represented by H ij. Theorem Let X be a 3D point in the plane Π 0 = [ ] T, corresponding to the plane at infinity Π in affine, metric and Euclidean space. X projects onto the pixel planes Π i and Π j of the two cameras P i and P j, so that u i P i X and u j P j X. The relation between u i and u j can be described by u j H 0 iju i, where H 0 ij is the homography defined by 36

38 H 0 ij K j R T ijk 1 i, R ij = R T i R j (3.28) Proof. Let Π = [ π T 1 ] T be an arbitrary plane and X = [ x T 1 ] T. X belongs to Π iff Π T X = 0. Since Π = [ π T 1 ] T, we can write Π T X = π T x + 1 = 0. Thus, [ ] [ ] [ ] x x I3 3 X = 1 π T = x π T x Let P i = [M i m i ]. Then u i P i X = [M i m i ] [ I3 3 π T ] x = ( M i m i π T ) x This means that the homography between points in plane Π and in plane Π i can be written H Πi ( M i m i π T ) (3.29) The homography between plane Π i and plane Π j can be seen as relating the two planes through a third plane Π. Thus, H Π ij H Πj H 1 Πi (3.30) Let Π = Π 0 = [ ] T. This corresponds to the canonical form of the plane at infinity in the affine, metric and Euclidean space (see section 3.4.3). H 0 ij M j M 1 i where R ij = R T i R j. = K j R T j ( ) 1 K i R T i = Kj R T ijk 1 i (3.31) In a projective reconstruction, it is common to assume the first camera to be P 1 = [I ] (see section 3.6.4). It is interesting to note, that for all homographies H Π1, equation (3.29) says that H Π1 = I 3 3. Further, equation (3.30) yields Two views P 1 = [I ] H Π 1i = H Πi (3.32) Calculating the homography between two images, is an ill-posed problem and can rarely be solved. However, it is possible to make use of the fact that corresponding pixels can not move arbitrarily from one image to another. This puts a constraint on the correspondences. For two views, the constraint is called the fundamental matrix. How the fundamental matrix is related to homographies, will be shown in section First a geometrical interpretation will be given. 37

39 L 1 C 1 u 1 C 2 Figure 3.11: A back-projected ray, defining an epipolar line. Consider a point u 1 in the pixel plane of a camera P 1. The point u 1 could be the projection of any 3D point along the ray L 1, which intersects with both u 1 and the camera center C 1. Now, introduce a second camera P 2. If u 1 has a corresponding point in the pixel plane of camera P 2, it must be the projection of the same 3D point. Since there is no 3D information at this stage, all we know is that the 3D point must lie somewhere along the ray L 1. The projection of all possible points along L 1 forms a line in the pixel plane of P 2. This line is called the epipolar line of u 1. Thus, we know that any potential corresponding point to u 1 must lie somewhere along the epipolar line of u 1 in camera P 2 (see figure 3.11). C 1 u 1 l 2 epi. C 2 Figure 3.12: A plane, defining an epipolar line. The epipolar line of u 1 can also be seen as the intersection between the im- 38

40 age plane of camera P 2 and the plane that contains u 1, C 1 and C 2 (the camera centers of P 1 and P 2, see figure 3.12). Note that u 1 is considered to be both a 2D point in a plane and a 3D point defined by the coordinates of the plane. Figure 3.13: Planes, defining a set of epipolar lines that intersect in the epipole. What happens if u 1 lies along the line that connects the two camera centers C 1 and C 2? What we get is not a unique plane, but a set of planes, that all contain the line between the camera centers. The intersection, between the set of planes and the image planes of P 1 and P 2, form a pencil of lines in both image planes. In each image plane, these lines have one common point, where they all intersect. This is the same point as where the line connecting the camera centers intersect with the camera planes. It is called the epipole. The epipole e ij, in the pixel plane of camera P j, is the projection of camera center C i in camera P j (see figure 3.13). e ij = P j C i (3.33) Suppose there exists a 3 3 matrix F ij, so that, given a 2D point u i in the pixel plane of P i, the epipolar line in the pixel plane of P j can be written as l epi. j = F ij u i If a point u j is a projection of the same 3D point as u i, then u j must lie on the epipolar line l epi. j and u j l epi. j = 0 must be satisfied. Thus, u T j F ij u i = 0 (3.34) Equation (3.34) is called the epipolar constraint and F ij is the fundamental matrix of views i and j. If we take the transpose of equation (3.34), we get (u T j F ij u i ) T = u T i F T iju j = 0 39

41 Thus, if F ij is the fundamental matrix for the pair of views [ i F T ij is the fundamental matrix for the pair of opposite order [ j i ]. j ], then Since all the epipolar lines intersect, equation (e T ij F ij)u i = 0 must be satisfied for all u i. It follows that e T ij F ij = F T ije ij = 0 and e ij must be the null-space of F T ij. Similarly, e ji is the null-space of F ij. (e T ijf ij )u i = 0 u i F T ije ij = 0 u T j (F ij e ji ) = 0 u j F ij e ji = 0 Since all the epipolar lines intersect in the epipole, F ij can not be of full rank and the determinant must be zero. F ij = 0 (3.35) Calculation of the fundamental matrix will be described in chapter More views Constraints also exists for three and four views. It has been shown, that no further constraints can be defined for more than four views. The trifocal constraint relates points and lines in three views. It is relatively well-studied, but will only be briefly discussed here. The quadrifocal constraint relates four views. The quadrifocal constraint is a new concept and will not be discussed here. Trifocal constraint Consider three cameras P 1, P 2 and P 3. The cameras project a particular 3D point to the pixel planes as u 1, u 2 and u 3 respectively. Section showed that, if u 2 and u 3 correspond to the same 3D point as u 1, they must lie on the epipolar lines of u 1 in the pixel planes of P 2 and P 3. If u 2 is fixed, then view 3 contains two epipolar lines, one resulting from u 1 and one from u 2. The corresponding point u 3 in view 3 is then uniquely determined by the intersection of the two epipolar lines (see figure 3.14). This does not apply, if the 3D point is located in the trifocal plane, defined as the plane containing the three camera centers C 1, C 2 and C 3. A line l 1 in view 1 would correspond to a projection of a line in a plane defined by l 1 and C 1. If the line is present in view 2, then the two planes of back-projected lines would intersect in a unique line. This line could be projected onto the image plane of camera 3. In other words, two corresponding lines in view 1 and 2 would define a corresponding line in view 3 (see figure 3.15). All the constraints discussed here, can be expressed in form of a tensor called the trilinear tensor Projection matrices and the fundamental matrix If the point correspondences and the cameras were fully known, reconstruction could be achieved by simple triangulation. But usually, only the correspondences for a set of points is known, often containing mismatches. Section

42 u F23u2 1 u 2 I 3 F13u1 I 1 I 2 Figure 3.14: Points and the trifocal constraint. I 1 I 2 I 3 Figure 3.15: Lines and the trifocal constraint. showed that the fundamental matrix can be recovered from a set of point correspondences, using equations (3.34) and (3.35). The question is, can the cameras be recovered knowing the correspondences for a set of points and the fundamental matrices between pairs of views? The answer is that the cameras can only be determined up to a projective transformation. A reconstruction made from cameras, estimated using only the fundamental matrices, would end up in the projective stratum described in section Theorem showed that H 0 ij K j R T ijk 1 i, R ij = R T i R j 41

43 It is always possible to apply a projective transform, so that one of the cameras can be written as P i = [I 3 3 0]. This corresponds to a camera with KR T equal to I 3 3, centered at the origin. For such a camera, equation (3.31) shows that H 0 ij = K j R T [ j. Since P j = [M j m j ] = K j R T j R T ] j C j, camera P j can be written as P j [H 0 ij m j ] (3.36) The epipole e ij in view j is the projection of camera center C i (see equation (3.33)), which is centered at the origin and therefore equal to [ ] T. 0 e ij = P j C i = [M j m j ] 0 0 = m j 1 Put into equation (3.36), we get From equation (3.29) and (3.32), we see that P j [H 0 ij e ij ] (3.37) Now, equation 3.37 becomes H Π ij H Πj M j m j π T (3.38) H 0 ij e ij π T (3.39) P j [H Π ij + e ij π T e ij ] (3.40) Theorem Given a fundamental matrix F ij, the homography between views i and j can be written as H Π ij [e ij ] F ij. Proof. Take an arbitrary plane Π, that intersects with the principal plane Π j of camera P j in a line l j. A point X in plane Π projects to camera principal plane Π i as u i P i X. The epipolar line in view j, corresponding to 2D point u i in view i, is determined by the equation l epi j F ij u i. If u j is the projection of X in view j, then u j must be at the intersection between the lines l j and l epi j. Substituting l epi j u j l j l epi j with F ij u i and u j with H Π iju i results in H Π iju i [l j ] F ij u i H Π ij [l j ] F ij Since we do not want l j and l epi j to coincide, we choose l j so that it does not pass through the epipole e ij. Since e T ij e ij 0 for e ij with at least one element not equal to zero, we choose l j e ij which implies that l T j e ij 0. This results in H Π ij [e ij ] F ij (3.41) 42

44 X u i I i F ij u i lj I j Figure 3.16: Homography in relation to the fundamental matrix. Finally, knowing that P i = [I ] and using equations (3.40) and (3.41), we can express camera P j in terms of F ij as P i = [I ] P j [[e ij ] F ij + e ij π T e ij ] (3.42) Note that both P j and F ij are defined up to an unknown scale factor. 43

45 Chapter 4 Extracting and matching feature points 4.1 Introduction A 3D reconstruction algorithm has to deal with two major difficulties. Matching feature points is one of them, self-calibration is the other (see chapter 7). A feature is defined by a pixel and its surrounding area. Two features are equal if and only if the pixels and their surrounding areas have identical values. Saying that two features match (correspond), is the same as saying that they are projections of the same 3D point. The ultimate goal for feature matching, is to match every pixel in every image. That is, however, often not possible, because of the aperture problem (see below). Instead, we choose a small subset of features and try to match them using assumptions of how the pictures were taken. Aperture problem The aperture problem arises, when an image contain similar features. It is not possible to find a match, if there is no way of differentiating between features. This is often the case with man-made objects. Houses, for example, usually have similar windows and it could be a problem to tell the difference between different windows. There are often uniform areas between the windows, so features within those areas would be difficult to separate. The ideal situation would be that each feature is unique within the image and identical to exactly one feature in each of the images, where its corresponding 3D point is visible. 4.2 Method Feature extraction Knowing that there are good and bad features, we want a method to extract only the good features. A good feature is unique within the image and resistant to translation, rotation and scaling. 44

46 We used the Harris and Stephens corner detector described in [1]. It detects features with high derivatives in orthogonal directions, making it robust against rotation and translation. The first step consists of calculating an image sized matrix R. R(u, v) = M uv (M uv 11 + M uv 22 ) M uv is a 4 4 matrix, calculated for each pixel. [ ( I uv M uv u = )2 I uv I uv u v I uv I uv u v ( I uv v )2 indicate Gaussian filtering. Both the derivation kernel and the Gaussian kernel are parameters of the algorithm. ] Figure 4.1: Part of an image and its corresponding R-matrix. A surface is then fitted to R and corners are considered to be maxima of that surface. The surface fitting is done to achieve sub-pixel resolution. It might be a good idea to use a threshold and only pick the features with the highest maxima in R. Also, features tend to end up in clusters if the image contains an area with high derivatives. It is better to extract features that cover a large area Matching The feature extraction, described in section 4.2.1, resulted in a set of features for each image, using sub-pixel accuracy. Now it is time to relate the images by finding matches between the feature sets. Often features are only visible in one image. It is then impossible to find a match and such features must be eliminated. Our method assume that features do not move too far between the images. A feature in one image is only compared to features in the other images that lie 45

47 within a radius from the pixel position in the first image. This is done to reduce the complexity. Another assumption is that features in a small neighborhood move in a similar way. If u in image 1 and u in image 2 are considered to be a match, then we can define a displacement vector, that run from the pixel position of u to the pixel position of u. All the neighboring displacement vectors must have similar length and orientation. (see figure 4.2) Figure 4.2: Example showing several displacement vectors. Correlation The first step of matching consists of correlating the features. It can be quite time consuming. Therefore, features are only correlated within a radius, preferably the same radius used later in the matching process. This is the only part of the relating process, that uses image data. The correlation c, between feature u = [ u x u y 1 ] in image I and feature u = [ u x u y 1 ] in image I, can be written as c = K/2 L/2 k= K/2 l= L/2 ( ) ( I(k + ux, l + u y ) Īc I (k + u x, l + u y) Ī c σ Ic where K and L is the size of the correlation window. Ī c and Ī c are the mean values of the correlation windows and σ Ic and σ I c are the standard deviations. This can be done for both grey-scale images and color images, correlating each color layer separately. σ I c ) Strength of match The correlation gives only an estimate of how features should be matched. In order to improve the result, it is necessary to use the second assumption, saying 46

48 that feature displacement vectors are similar in a small neighborhood. Using different weight functions, a strength of match is calculated for the most plausible combinations. The features with the highest strength of match is said to be corresponding features. Figure 4.3: Strength of match. 1. For each feature u in image I i... (a) Select K features {u 1..u K} within a radius in image I j with the highest correlation to u. For each possible match u u k, k = 1..K... i. In each image, select all neighbors {u 1..u M } and {u 1..u N} within a radius from u and u k. For each neighbor u m in image I i, m = 1..M... A. Calculate a weight value w n = c mn λ n for each feature pair u m u n, n = 1..N, where c mn is the correlation between features u m and u n. λ n is a value between 0 and 1 based on direction and length of the displacement vector (u mu n) compared to (uu k) weighted with the distance between u and u m. See figure 4.4 for more information. ii. The feature pair u m u n with the highest weight value wm max = max({w 1..w N }) is considered to be the most plausible match for the neighboring feature u m. (b) The strength of match for feature pair u u k is the sum of weight 47

49 values for all the plausible matches of neighbors to u. SM k = c k M m=1 w max m (4.1) The scalar c k is the correlation between features u and u k. (c) Feature u in image I i is considered to be a possible match with the feature u in image I j that shows the highest strength of match SM = max({sm 1..SM K }). The match u u is added to a list of possible matches. 2. Final matches are selected from the list of possible matches. If two or more possible matches share a common feature, the match with the highest strength of match is selected and the others are removed from the list. Figure 4.4: Weight functions used in the matching procedure. The displacement vector is represented by m and norm(m) represents the length of m. 48

50 4.3 Results The results here are based on the wolf-sequence, presented in chapter 5. The sequence consists of two pictures, taken with only a small translation along the x-axis and almost no rotation. The ideal matching result would be only horizontal displacement vectors. However, the sequence is severely affected by the aperture problem, described in section points were extracted, using the Harris and Stephens corner detector. A Gaussian filter with σ = 1 and kernel size 9 was used in the feature extraction part. Features were correlated and matched within a radius of 75 pixels and 10 neighbors were used to calculate the strength of match. Figure 4.5: Views a and b are input to the algorithm. Views c and d are the results after feature extraction. View e shows the matches and their displacement vectors, from image a to image b. 49

51 Chapter 5 Estimating the Fundamental matrix 5.1 Introduction Figure 5.1: The epipolar configuration of two image planes. The fundamental matrix is a key concept in computer vision, when using uncalibrated images. It encodes the epipolar geometry associated with the camera motion. This may be used for recovering projective structure or for rectification of image pairs. The fundamental matrix describes the epipolar configuration between two images. The epipolar configuration is the configuration of epipolar lines and epipoles. Suppose that X is a coordinate in 3D space and that x i and x j is the projection of that point onto the images i and j. These three points define a plane. The intersection between this plane and the image planes define two lines. These lines are called epipolar lines. Now, consider a new 3D point and the projections of this point. These three points, the 3D point and its projections, define a new plane and two new lines - epipolar lines, one in each image. Any 3D 50

52 point together with two projection matrices will in the same way define a plane. Since all these planes must go through the optical centers of the cameras, the intersection of the planes and the image planes will meet in two points, one in each image - the epipoles. This configuration of epipolar lines and epipoles is encoded in the fundamental matrix. The theoretical background is described in section The fundamental matrix has proven complicated to estimate, since enforcing the constraint det(f) = 0 means that a set of non-linear equations has to be solved. This chapter describes the method we used to estimate the fundamental matrix. It begins by declaring some linear methods, that are easy to understand and continues with a non-linear approach. Once this has been done, the robust RANSAC method is described. Finally, some results, obtained when using this algorithm, are presented at the end of this chapter. 5.2 Linear methods for estimating the fundamental matrix Let u ki be the projection of the 3D point X k by camera P i in image i and u kj be the projection by camera P j in image j. Then, the fundamental matrix F ij describes the epipolar configuration between image i and image j. F ij is defined by the equation u T kjf ij u ki = 0 (5.1) for any pair of corresponding points u ki u kj in the two images. It is possible, given a sufficient number of known correspondences, to compute the unknown matrix F ij. Writing the known matches in homogenous coordinates, u ki = [ uki v ki 1 ] T and ukj = [ u kj v kj 1 ] T, equation (5.1) can easily be rewritten in the unknown entries of F in terms of the known coordinates u ki and u kj as where u kj u ki f 11 + u kj v ki f 12 + u kj f 13 + v kj u ki f v kj v ki f 22 + v kj f 23 + u ki f 31 + v ki f 32 + f 33 = 0 F = f 11 f 12 f 13 f 21 f 22 f 23 f 31 f 32 f 33 Let f be the vector containing the unknown entries of F in row-major order. Then equation (5.1) can be expressed by a vector inner product [ ukj u ki u kj v ki u kj v kj u ki v kj v ki v kj u ki v ki 1 ] f = 0 A linear equation system is obtained from a set of n correspondences and the system can be written on matrix form Af = 0 51

53 Since the equation system is written in homogenous coordinates, F can only be determined up to scale. This means that, for a solution to exist, A must not have higher rank than 8. Since noise is always present in measured correspondences, the data is never exact and it is common that A has rank 9. This is often the case and therefore the equation system is solved using the least-square method. Solving by least-square is equivalent to minimizing the algebraic residuals min F rk 2 = k k (u kj F ij u ki ) 2 (5.2) The method of estimating the fundamental matrix described here, is called the eight-point algorithm. It is fast, but a major drawback is that the det (F) = 0 constraint is not enforced. That is an important property of the fundamental matrix, so we have to come up with something better. The minimum case - seven correspondences Suppose that we are using the method described here, but only for seven matches. This enforces the rank of A to be seven. From linear algebra, it is known that the dimension of the solution-space of the equation system Ax = 0, where A has m linearly independent rows and n columns, is n m. In our case, we have nine columns and seven rows. Thus, we will have a two dimensional solution-space. If f 1 and f 2 are generators of the null-space, all solutions up to scale can be written as f = αf 1 + (1 α)f 2, where α is a scalar. F 1 and F 2 come directly out of f 1 and f 2. F is given as F = αf 1 + (1 α)f 2 The constraint det (F) = 0 may now be enforced as det (F) = det (αf 1 + (1 α)f 2 ) = 0 which leads to a cubic polynomial in α. This polynomial will either have one or three real solutions (complex solutions are discarded). Therefore one or three possible solutions of F are retrieved using this method. If seven exact matches could be found, this linear method would give us the true fundamental matrix. RANSAC is an iterative method, where seven matches are chosen randomly and for all estimated F, one keeps the F that fits best to all correspondences. More about this method in section Non-linear minimization The gradient criterion The eight-point method, that we have already discussed, minimizes the residuals. This is not an optimal estimate of F in the context of statistics, because the distribution of the residuals are not being considered. If the residuals were Gaussiantly distributed, having all the same variances, equation (5.2) would have lead to an optimal estimation of F. But the residuals have different variances and are not Gaussiantly distributed. Thus, it is not an optimal estimation. The mathematician Sampson proved, assuming that the measured correspondences are Gaussiantly distributed, that these residuals can be weighted, so that an optimal estimate can be calculated. He 52

54 also showed that the weights should be chosen, so that the contribution of each term to the total criterion will be inversely proportional to its variance. min F (r k ) 2 /σr 2 k = k k (u T kj F iju ki ) 2 σ 2 r k (5.3) Compare this to equation (5.2). We will now deduce the expression for σr 2 k. Since our correspondences are measured in different images, u ki and u kj are uncorrelated. Therefore, the classical assumption that their covariance is isotropic and uniform is made [ ] σ 2 0 Cov(u ki ) = Cov(u kj ) = 0 σ 2 The variance of r 2 k, given as a function of the points u ki and u kj, can be written as [ ] σr 2 k = r T Cov(uki ) 0 k r 0 Cov(u kj ) k = σ 2 r k 2 r k denotes the gradient of r k with respect to the four-dimensional vector [ uki v ki u kj v kj ] T ( the entries of uki and u kj ). Thus, the gradient r k is r k = [ (F T ij) 1 u kj (F T ij) 2 u kj (F ij ) 1 u ki (F ij ) 2 u ki ] T where A i means the i:th row of matrix A. This, together with equation (5.3), gives the following criterion min (rk)/σ 2 r 2 F k = k ) 2 (u T kj F (5.4) iju ki k ((F ij ) 1 u ki ) 2 + ((F ij ) 2 u ki ) 2 + ((F T ij) 1 u kj ) 2 + ((F T ij) 2 u kj ) 2 Under the assumption that u ki and u kj are Gaussiantly distributed, this is an optimal criterion for estimating the fundamental matrix. The distance to epipolar lines Luong and Faugeras [9] examined Sampson s weighting function 1/σr 2 k and suggested that marginally better results could be obtained by using the distance of a point to its epipolar line as the error to be minimized. If F ij describes the epipolar relationship between image i and image j, then F T ij describes the relation from image j to image i. This fact yields the following criterion for minimizing the sum of distances from all correspondences to their epipolar lines respectively min F dist (u kj, F ij u ki ) 2 + dist i ( u ki, F T iju kj ) 2 (5.5) The distance dist (p 1, l 1 ) 2 from a point p 1 = [ x 1 y 1 1 ] T to a line l 1 = [ l 1 l 2 l 3 ] T is expressed by dist (p 1, l 1 ) 2 = (l 1x 1 + l 2 y 1 + l 3 ) 2 l l2 2 53

55 Figure 5.2: Luong and Faugeras suggested to minimize the distances between points and their epipolar-lines as a cost criterion, when estimating the fundamental matrix. This, and the fact that u T kj F iju ki = u T ki FT iju kj, yields that expression (5.5) can be written ) 2 ) 2 (u T kj F iju ki (u T min F ((F ij ) 1 u ki ) 2 + ((F ij ) 2 u ki ) 2 + kj F iju ki ((F T ij) 1 u kj ) 2 + ((F T ij) 2 u kj ) 2 k (5.6) Note the similarity between this criterion and the first, derived from Sampson s theory. The methods described here, for solving the fundamental matrix, results in non-linear minimization problems. To solve these kinds of problems, iterative nonlinear minimization algorithms have to be used. We have been using the Levenberg-Marquardt algorithm in order to find the absolute minima of this function. The theory and implementation details of this and many other useful algorithms can be read about in Numerical Recipes in C [14]. Iterative methods must have an initial estimate to start from. To avoid getting stuck at a local minima, it is important that the initial estimate is close to the correct solution. It is possible to use any of the linear methods to find the initial estimate for the minimization procedure. A very robust method, for computing the fundamental matrix, is called the RANSAC-method, which will be described in the next section. It is optimal to use this method to find an initial estimate and then refine F by using either the gradient minimization criterion or the method suggested by Luong and Faugeras (the distance to epipolar lines). 54

56 Parameterization To enforce the rank 2 constraint, when using the nonlinear minimization, a parameterization of the fundamental matrix is required (see [9]). One way of doing this, is to let the third row in the fundamental matrix be linearly dependent of the first two a 1 a 2 a 3 F ij = a 4 a 5 a 6 αa 1 + βa 4 αa 2 + βa 5 αa 3 + βa 6 This implies that F ij has rank 2 and that the epipole e ji will be [ α β 1 ] T, since F T ije ji = 0. This parameterization has a total of eight different parameters. To achieve the minimum of seven parameters, one of them should be normalized to 1. Using this parameterization, the constraint det (F) = 0 is automatically fulfilled. There are a lot of different ways to parameterize F, all having their advantages and drawbacks. The parameterization above is the one that we have been using, since it enforces the constraint det (F) = RANSAC With only seven exact matches, it is possible to calculate the fundamental matrix uniquely. This can be done linearly by using the method described in section 5.2. The problem is that we can not for certain say that we have seven correct matches. If seven correct matches could be retrieved somehow, an accurate estimate of the fundamental matrix could be calculated. If time is not a matter, the best way would be to do an exhaustive search through our data set of correspondences. The exhaustive search would test all possibilities and choose the best solution. The problem with this method is that it gets very time consuming, even for quite small data sets. That is because the algorithm has the complexity of O(n!/7!), where n is the size of the data set. A common and very successful way of reducing the complexity is RANSAC, which is short for Random Sampling Consensus. This method can be applied to a number of different minimization problems and is useful when it is likely that other minimization algorithms get stuck at local minima. Two things are needed, when using the RANSAC-method - a set of samples and a cost-function. In our case the sample-set is the correspondences u ki u kj (one sample is seven point correspondences) and the cost-function will be the inverse of the percentage of inliers. An inlier is a corresponding pair of points, where each point lie within a certain distance to its epipolar line. This is how the RANSAC-method works: pick a random sample from the sample-set. Retrieve a cost-value from the cost-function. If this is the lowest cost so far, keep it as the best estimate. Now repeat this N times or until a certain acceptable error is achieved. This is summarized in the algorithm below. 55

57 RANSAC Repeat for N samples, or stop when a certain threshold for the cost-value is reached. This threshold is set to what would be an acceptable error. In our case, this is when the percentage of inliers is larger than some η. 1. Pick a sample from the sample-set. In other words, select a random sample of 7 correspondences from the set of correspondences. 2. Calculate a cost-value for the cost-function (a) Compute the fundamental matrix F as described in section 5.2. There will be one or three real solutions. (b) For all these solutions, compute the percentage of inliers (see next paragraph) consistent with F. Since the percentage of inliers is high for accurate F, the percentage of inliers corresponds to an inverse costfunction. 3. Pick the solution corresponding to the smallest cost value. Since we are using the inverse of a cost-function, choose the F with the largest percentage of inliers. In the case of ties, choose the one that has the lowest standard deviation. Algorithm 2 : The RANSAC algorithm for robust estimation of the fundamental matrix. Calculate the percentage of inliers A corresponding pair of points is considered being an outlier, if any of the points has a distance to its epipolar line larger than some threshold σ. The squared distance dist (u, l) 2 from a point u = [ u v 1 ] T to its corresponding epipolar line l = [ l 1 l 2 l 3 ] T is dist (u, l) 2 = (l 1u + l 2 v + l 3 ) 2 l l2 2 In order to determine, if a point u is an inlier, do the following test (l 1 u + l 2 v + l 3 ) Inlier: dist(u, l) = 2 l1 2 + < u α σ l2 2 (l 1 u + l 2 v + l 3 ) Outlier: dist(u, l) = 2 l1 2 + > u α σ l2 2 where σ is the standard deviation of the distance to the epipolar line and u α is calculated from the N(0, 1) distribution, e.g u 0.95 = The test must be true for both points in a pair, if the pair should be considered being an inlier. If u i u j is a corresponding pair and l i,l j are the 56

58 Figure 5.3: A point is considered to be an inlier, if the distance between the point and its epipolar line is less than σ. corresponding epipolar-lines, it is also possible to do a single test Inlier: dist (u i, l i ) 2 + dist (u j, l j ) 2 = (l i1 u i + l i2 v i + l i3 ) 2 l 2 i1 + l2 i2 Outlier: dist (u i, l i ) 2 + dist (u j, l j ) 2 = (l i1 u i + l i2 v i + l i3 ) 2 l 2 i1 + l2 i2 + (l j1u j + l j2 v j + l j3 ) 2 l 2 j1 + l2 j2 + (l j1u j + l j2 v j + l j3 ) 2 l 2 j1 + l2 j2 < x α σ 2 > x α σ 2 where x α is calculated from the chi-square distribution. To compute the percentage of inliers, apply this test for all matches and calculate the percentage of true tests. Robust determination of the standard deviation This section describes how to estimate σ. The standard deviation is related to the characteristics of the image, the feature detector and the matching procedure. Often the value of σ is unknown and must be estimated from the data. An estimate of σ can be derived from the median squared error. It can be shown that med i dist (ui, l i ) 2 /Φ(0.75) is an asymptotically consistent estimator of σ, if the distances are distributed like N(0, σ 2 ). Φ is the cumulative distribution function for the Gaussian probability density function. Noting that 1/Φ 1 (0.75) we get the estimate of σ dist σ (med i (ui, l i ) 2 ) 57

59 Discarding all outliers Once the fundamental matrix has been estimated accurately, it can be used to separate inliers from outliers. All matches, that not are consistent with the calculated epipolar configuration, should be deleted. This is done after the RANSAC-method and could also be done once again after the minimization procedure. 5.5 Results Wolf-sequence Results illustrated in figure 5.4, 5.5 and 5.6 are based on the wolf-sequence, presented in section 4.3. The threshold, used to separate inliers from outliers, was estimated to be 1.05 pixels. 77.3% of the matches lie within this threshold from their epipolar lines. Figures 5.4 and 5.5 contain all feature points, both inliers and outliers. The wolf-sequence was chosen, not only to demonstrate results of the algorithm, but also illustrate some of the problems when dealing with real image sequences. Since there is almost no rotation between the images, the epipolar lines should be parallel. Every configuration of parallel lines, regardless of direction, would satisfy the epipolar constraint. However, the RANSAC-method tries to find the fundamental matrix, that has the largest support among the correspondences. This implies that outliers can contribute to the outcome of the fundamental matrix estimation. Since all sets of parallel epipolar lines, for a sequence without rotation of the camera, are equally correct, the set that supports most outliers will be chosen. In other words, the RANSAC method will always produce a sub-optimal solution for a sequence similar to the wolf-sequence. An epipolar configuration, consisting of parallel lines, implies that the epipole is situated at infinite distance from the center of the image. The wolf-sequence is severely affected by the aperture problem (see section 4.1) in the areas outside wolf himself. The support for the chosen epipolar configuration is therefore much concentrated in the center of the image. The estimation process will then tolerate if the epipolar lines are not exactly parallel, since they can be approximated to be parallel where the support is. Thus, the epipole will be situated closer to the image center, than it would be for an optimal solution. Cube-sequence Figure 5.7 shows the configuration of the cube-sequence. It is a synthetic model, consisting of D points and 27 cameras. The 3D points were perfectly projected onto the pixel planes of the cameras, resulting in perfect correspondences (except for rounding errors). Gaussian noise were added to the feature points and the epipolar configuration was estimated. The results are presented in table

than one pixel from its epipolar line. Figure 5.

60 Figure 5.4: Epipolar lines for the wolf-sequence. 77% of the matches are inliers and lie within slightly more than one pixel from its epipolar line. Figure 5.5: Epipolar lines for the wolf-sequence. Same as figure 5.4, but only showing the nose. 59

calibrated coordinates Linear transformation pixel coordinates

1 calibrated coordinates Linear transformation pixel coordinates 2 Calibration with a rig Uncalibrated epipolar geometry Ambiguities in image formation Stratified reconstruction Autocalibration with partial