Line Based Estimation of Object Space Geometry and Camera Motion

Size: px

Start display at page:

Download "Line Based Estimation of Object Space Geometry and Camera Motion"

Robert Burke
5 years ago
Views:

1 Line Based Estimation of Object Space Geometry and Camera Motion Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Panu Srestasathiern, B.E., M.S. Graduate Program in Geodetic Science The Ohio State University 2012 Dissertation Committee: Alper Yilmaz, Advisor Alan Saalfeld Ralph von Frese

2 c Copyright by Panu Srestasathiern 2012

3 Abstract In this dissertation, two problems of 3D structure and camera motion recovery are addressed. The first problem is the 3D reconstruction problem using multiple images. Particularly, in this dissertation, the line estimation using multiple views is researched. The second addressed problem of 3D structure and camera motion recovery is the line-based bundle adjustment. A novel cost function for line based bundle adjustment is proposed. For the line based 3D structure and camera motion recovery, the first problem is the 3D line estimation which provides an initial solution for the bundle adjustment process. In order to facilitate this I represent the 3D line by its Plücker coordinates. A typical requirement of this representation is the use of the Plücker constraint. I leverage the state of art by waiving the Plücker constraint and propose two streamlined solutions to 3D line estimation problem. The first proposed 3D line estimation model is based on the preservation of coincidence in the dual projective space. The second method is based on the averaging of a set of 3D lines which are generated by the intersection of the back-projection planes from multiple images viewing the estimated 3D line. The second component of my proposal is to develop a new bundle adjustment model. More precisely, a new line-based cost function that defines a geometric error in the object space is proposed. The proposed cost function is derived by using the equivalence between the image plane and the unit Gaussian sphere with its center positioned at the optical center of the image plane. Particularly, the geometric error is defined as the integrated squared ii

4 distance between the projection plane of a 3D line estimate and point on the perimeter of the circular sector equivalent to the image of the 3D line estimate. iii

5 This is dedicated to my family. iv

6 Acknowledgments First of all, I would like to give my sincere appreciation to my advisor, Dr. Alper Yilmaz for encouragement throughout this research and all of the time and energy he spent in meeting with me and discussing my research. Also, I would like to thank Dr. Alan Saalfeld and Dr. Ralph von Frese for serving on my Dissertation Defense Committee. I would like to thank the faculty in the Department of Civil and Environmental Engineering and Geodetic Science and Department of Earth Science for making The Ohio State University one of the best place to study geodetic science. It is a great honor to join the geodetic science program at The Ohio State University. I would like to thank members of photogrammetric computer vision laboratory, and all of my friends in The Ohio State University for making the time I have studied here one of the best time in my life. My appreciation also goes to the Thai Government. This support gave the most valuable chance for me to persue the Ph.D. degree at The Ohio State University. To my parents, your encouragement, love and support have guided me through this process. To my sister, I would like to thank you for having good conversations with me every weekend. v

7 Vita February Born - Bangkok, Thailand May B.Eng. Electrical Engineering, Chulalongkorn University, Bangkok, Thailand December M.S. Geodetic Sciences, The Ohio State University, Columbus, OH, USA Publications Research Publications P. Srestasathiern and A. Yilmaz. View invariant object recognition. In ICPR, pages 1-4, P. Srestasathiern and A. Yilmaz. Planar shape representation and matching under projective transformation. Comput. Vis. Image Underst., 115(11): , November G. Barsai, A. Yilmaz, S. Nagarajan, P. Srestasathiern, : Registration of Images to LIDAR and GIS data without Establishing Explicit Correspondences, Photogrammetric Engineering & Remote Sensing, submitted paper (2 nd round of revision) P. Srestasathiern and A. Yilmaz. A Line based approach to recovering the 3D scene geometry and camera motion, IEEE Trans. Image Process., submitted paper Fields of Study vi

8 Major Field: Geodetic Science and Surveying Studies in: Topic 1 Topic 2 Topic 3 Digital photogrammetry Multiple view geometry Mathematical image analysis vii

9 Table of Contents Page Abstract Dedication Acknowledgments ii iv v Vita vi List of Tables x List of Figures xi 1. Introduction Scope of dissertation Organization of the dissertation Background Projective geometry Representation of geometric entities Pinhole camera model The equivalence between the image plane and the Gaussian sphere Duality principle Riemannian manifold Non-linear mean shift on Riemannian manifold Ordinary mean shift Non-linear mean shift viii

10 3. 3D line estimation Literature review Overview of the proposed methods D line triangulation by incidence property in dual space D line triangulation by averaging approach Effect of scaling on 3D line representation Generating 3D line samples D line averaging by averaging in special orthogonal groups Line based Bundle Adjustment Literature review Overview of this chapter Geometric error in object space Parameterization of camera motion and 3D line Parameterization of camera motion Parameterization of 3D line Numerical optimization Experimental results Performance evaluation Experiments on synthetic data Experiment on real data The model house sequence The book sequence Fish eye camera Conclusion Bibliography ix

11 List of Tables Table Page 2.1 The comparison between operators in vector spaces and manifolds Comparative results from the experiment with the model house sequence Comparative results from the experiment with the book sequence x

12 List of Figures Figure Page 2.1 A 3D line L can be represented by its Plücker coordinates L = [L h L o ], which is 6-vector. The first 3-vector, L h is the direction of the line and the second one is the moment of the line which is the normal vector of the plane containing origin and the line Back-projection planes from a stereo pair intersect at a line in space A scene point X is projected to an image point x The back-projection plane from the image line l. Its is also known as the interpretation plane The mapping between Gaussian sphere S 2 and 2D dimensional projection space P 2. A line l on P 2 is equivalent to a great circle with normal vector N on the Gaussian sphere. If a point x on the image line l is equivalent to a point V on the Gaussian sphere, the point V must be on the great circle equivalent to the image line l A line segment on the image plane is equivalent to a circular arc which is a segment of the circumference of the great circle equivalent to the line passing x 1 and x The dual entity of a point x on 2D projective space is a line x in the dual projective space or vice versa. A proposition also has a dual. This Figure demonstrates the dual of collinearity which is intersection in the dual projective space. Three collinear points on the line l become three concurrent lines in the dual projective space where the intersection point is the dual of the line l xi

13 2.8 Mapping between manifold and tangent space. The logarithm operator log x maps a point y on the manifold to a point on the tangent space T x ; while the exponential operator exp x maps the point back onto the manifold. Note that these two operators depend on the point x on the manifold The non-linear mean shift on manifold. To compute the mean shift vector, all data points are projected to the tangent space of the current mode estimate. On the tangent space, the mean shift vector is computed. The updated mode estimate can be obtained by projecting the mean shift vector back to the manifold D line estimation using two views. The location of the estimated line is not well constrained because the back-projection planes from two views always intersect in space Estimated 3D lines on the Klein quadric denoting the proposed final error. Red denotes the estimated 3D line, and other colors denote pairwise intersections of back-projected lines. (a) Initial solution, (b) final solution The dual projective space. Given two planes intersecting at a line L, the dual elements of these two planes are incident with dual line L The plane containing the origin of the object space coordinate system and the 3D line L intersect the back-projection plane at the 3D line. As a result, the cross product between the moment L o and the normal of the backprojection plan parallel to the line direction L h Given a point in R 2 where all points are on a 1-D manifold i.e. circular arc. The average of there points computed by the barycentricn mean does not lay on the circular arc An image line feature can be represented by raw edge which is a sequence of points. A cost function for bundle adjustment is then defined as a geometric relation between an image point and its corresponding 3D line e.g. collinearity equation xii

14 4.2 The proposed object space error is defined as the integrated squared-distance between the projection plane and point on the boundary of the sector formed by O, V 1 and V 2. Points V 1 and V 2 are from mapping endpoints x 1 and x 2 of an image line segment onto Gaussian sphere s surface where the line is the image of 3D line L. The projection plane of the 3D line L has normal vector M in the camera coordinate system A 3D line under changing coordinate system from object space to camera coordinate system. The moment of of the 3D line in camera coordinate system, L o, becomes the normal of projection plane Local parameterization of the parameter manifold. Each point on the manifold has a Euclidean structure. The current estimate is updated locally such that solution is still on the manifold and minimizes the cost function g The similarity measure between the re-projection of the estimated line and its corresponding line segment is the orthogonal distance from line segment endpoints to the re-projected line Camera setup for wide baseline Experiment result on wide base line case Camera setup for short baseline Experiment result on short baseline case Two images from the model house sequence overlaid with image lines (yellow dash lines) and re-projected line estimates (black solid lines) D line reconstruction and estimated camera motions from the proposed method after bundle adjustment. 5.7a shows the 3D structure and camera poses. 5.7b shows the side view of the model house Reconstructed 3D scene viewed from two different viewpoints Top view of the reconstructed model house sequence from different methods xiii

15 5.10 Reprojection of estimated 3D lines onto two sample images of the book sequence. The manually detected image lines are plotted in yellow dash lines and re-projected line estimates in black solid lines Zoom on an area of an image from the book sequence overlaid with image lines (yellow dash lines) and re-projected line estimates (black solid lines). The result from initial 3D line triangulation is improved by the bundle adjustment process The reconstructed 3D scene and camera poses Top and front views of the reconstructed scene Re-constructed 3D scene viewed from different angles Re-projection of the estimated lines to the image. The re-projected lines are shown in yellow solid line Reconstructed 3D scene and camera poses. Figure 5.16b shows the side view of the scene Top view of the reconstructed calibration pattern The re-projection of initial and adjusted results. The re-projected lines are shown in yellow solid line xiv

16 Chapter 1: Introduction A main objective in the photogrammetry and computer vision fields is to recover the scene geometry and the camera motions. Broadly speaking, the goal is to infer 3D information from 2D data i.e. a set of images of a static or moving scene. Determination of camera motions and 3D scene reconstruction can be also applied for other applications. For example, for video augmentation, known camera motion and 3D scene provides us with the ability to place an artificial object in the reference frame. Another application is the orthophoto generation for which the camera motion recovery is very crucial. In robotics, the recovery of camera motions and 3D scene geometry are used to simultaneously build a map of the environment and localize the robot within that environment. The 3D scene recovery deals with the estimation of unknown geometric entities, such as points; while the recovery of camera motion deals with the estimation of the optical centers and orientations of cameras in an object coordinate system. Broadly speaking, 3D structure and camera motion recovery can be solved by the two following approaches: Solving camera motions first then estimating 3D structure, Solving the 3D structure and the camera motions simultaneously. 1

17 The cameras used for the 3D information recovery can be either non-calibrated or calibrated. When a camera is calibrated, its interior orientation is known. The interior orientation of a camera refers to geometric transformation between the 3D image coordinate system and camera s perspective center. Especially, the interior orientation of a digital perspective camera consists of the position of the image center, scaling factor for row and column pixels, and the skew factor. These parameters, which are referred to as the calibration parameters are compactly embedded in a calibration matrix. The lens distortion such as the radial and tangential distortion can be taken into account in order to obtain a more accurate 3D information recovery. In order to obtain 3D metric information on both the recovered 3D structure and the camera motion, cameras usually are assumed to be calibrated. In other words, non-calibrated cameras only give solution projectively equivalent to the metric one. However, this does not limit the availability of obtaining metric recovery from non-calibrated cameras. A technique called auto-calibration can be performed to upgrade solution from projective to the metric one [1, page 272]. The auto-calibration is a process of directly estimating the calibration parameters from images of un-calibrated cameras with no or few assumptions on the scene structure. Moreover, while traditional camera calibration methods require a special calibration pattern or object, the auto-calibration methods do not. After the calibration matrix is obtained both the camera motion and the scene geometry can be upgraded to metric space. In terms of the input data, traditionally, most of the 3D scene and camera motion recovery methods use point features as the input data [2, 3, 4]. For example, in photogrammetric activities, the colinearity equation is predominantly used. The model is formulated based on the collinearity of rays emanating from the optical center of the camera and passing an 2

18 image point and its corresponding 3D point. In computer vision, the linear camera model is originally formulated for the point projection. The point based approach in both fields is very popular due to the fact that the geometric notion of a point is well defined. However, the disadvantage of the point based approach is that the point feature can be easily occluded. This problem can be fixed by using higher order features such as lines and conic sections. In a typical 3D recovery framework, the camera motions should first be estimated. The estimation of the camera motions is known as the pose estimation problem. The motion can be defined either as the displacement of cameras with respect to a global coordinate system (absolute orientation), or the relative motion between the cameras (relative orientation). A simple setup for the relative motion recovery is to fix a local coordinate system to the coordinate system of a selected camera. For example, lets consider an image sequence. The origin of the local coordinate system can be set to coincide with the projection center of the first image. The orientation of the local coordinate system s axes are then the same as that of the first image s coordinate system. In some applications, the reference to the global coordinate system is not necessary; such that only the relative orientation is required. Moreover, in some scenarios, only the relative orientation can be recovered when no or insufficient control information is available. With the emerging of the Global Positioning System (GPS) and Inertial Navigation System (INS), the absolute orientation of the camera viewpoints with respect to the world coordinate system can be obtained directly. This approach is so called direct orientation or direct georeferencing. Direct orientation approach has been a popular approach, especially in photogrammetry. The principle of this approach is to attach a navigation unit to the imaging system. The GPS is used to determine the optical center of camera at each 3

19 exposure by establishing pseudorange from at least four satellites. The position, orientation and velocity are obtained from INS. The INS motion information can also facilitate the interpolation of position information because the positioning information from GPS is discrete. The main limitation of this approach is their susceptibility to jammed GPS signals in GPS denied environments. An alternative approach, which is called indirect orientataion in photogrammetry, can be used to recover the absolute orientation with respect to a specific local or global coordinate system. This approach requires the availability of the control information or features in the coordinate system. For example, the ground control points from a field survey can be used to compute camera poses. In an urban scene, most architectural buildings have 3 orthogonal line pairs. Their images are used to compute vanishing points in 3 directions which are subsequently used to estimate the displacement of a camera s viewpoint [5]. Alternatively, the absolute orientation of cameras viewpoints can be obtained by first determining the relative orientation and then transforming the reference frame defined for relative orientation to the object coordinate system. Moreover, without the control information in 3D object space, the relative orientation is used instead. The most fundamental problem in computing the relative motion is the stereo vision [6]. Since it is a versatile method in the computation of scene geometry and most similar to biological visual system, it has been the most studied and on going research topic in computer vision, photogrammetry, computer graphics and robotics. Moreover it can be used as a fundamental building box for the computation of multiple camera motions. A well-known geometric description of the stereo view is the epipolar geometry. The expression of the epipolar geometry contains the translation and rotation between two camera coordinate systems. In computer vision, the algebraic expression describing the epipolar 4

20 geometry is called the fundamental matrix, which is also referred to as the essential matrix when cameras are calibrated [7]. Precisely, the fundamental matrix shows the relation between a point on an image and its conjugate point on the second view. One of main disadvantages of the stereo based vision is that some scene features can be occluded and there may be some degenerate or critical configurations. In other words, estimating the camera motions using more view reduces the chances to encounter a critical configuration. An important critical condition, which is called the degenerate, configuration must be avoided in the estimation of camera motion because it cannot be recovered uniquely. For example, lets consider the trifocal tensor or trilinear constrain [8] which is a projective geometric relationship describing the relative motion between three viewpoints. Namely, the relative motions between 3 cameras are compactly embedded in a tensor which is a cube of numbers. While the degenerate configuration for 3 views is just the case that all image features used to estimate the camera motion arise from a common plane in 3D, some degenerate configuration of the two views are the cases when corresponding 3D of image points lie on a quadric surface passing through the two optical centers, image points are on a common plane or a line. Another advantage of using more than 2 views is the ability of transferring geometric entities from one view to another view. After the motions of cameras are determined, the next step is to recover the 3D structure. This process is also known as triangulation or structure from motion [9, 10]. Given the images of the geometric entities in space taken with camera with known pose, the goal is to find the unknown geometric entities e.g. point, line, plane or curve in space. For instance, point reconstruction is the simplest case in the reconstruction problem. The basic principle for point triangulation is to find the intersection of converging lines in space. Such lines are from the back-projection of matched points on the images to object space. 5

21 In the noiseless case, finding the intersection of converging lines in space is trivial with no significant difficulties. In contrast, with the presence of noise, the back-projection lines do generally not intersect in space. Under some assumed noise model, the triangulation problem then becomes finding the points that most likely represent the point of intersection. To choose the most likely point of intersection [9], using calibrated cameras has less difficulty than the uncalibrated ones. Precisely, by using calibrated camera, the reference frame is already Euclidean and the metric information such as distance or angle is meaningful and can be used the define the triangulation model. In contrast, when cameras are not calibrated, the camera motions are in the projective reference frame; the notion of distance and angle have no meaning. Therefore most reconstruction problem using uncalibrated camera use the concept of re-projection error which is invariant to projective transformation in the reference frame. The basic camera setup for the 3D structure reconstruction is the stereo view where two viewpoints see the same scene or have an overlap area in the image. However, using only two views are limited to only the reconstruction of points in space. By using more views, higher order geometric entities such as line, plane, conic section or algebraic curve can be reconstructed. In other words, the higher order entities has more degrees of freedom which requires more redundancy to estimate and use more views to increase the redundancy in the computation. Moreover, using more view can lead to greater accuracy in reconstruction because of the added stability from multiple views, and the more strict constraints on the position of reconstructed entities in space. Additionally, redundancy contributes to the elimination of error in the reconstruction. Another approach to recovering the camera motions and 3D scene geometry is to estimate them simultaneously. This approach itself can also be classified into linear and 6

22 non-linear methods. Both of these methods locally minimize a cost function. A well studied linear method for recovering camera motion and 3D scene geometry is the factorization [2, 11] which is a popular approach in computer vision and robotics. The principle of the factorization method is to factorize the observation matrix, whose elements are matching image points, into the matrix of camera motions, which are embedded in the camera projection matrix, and the matrix the 3D scene geometry such as the 3D point. The advantage of this approach compared to the non-linear approach is waiving of an initial solution requirement. In spite of the advantage of linear methods, the factorization method is limited to the case when the reconstructed points are visible on all images. Moreover, the minimized cost function is not geometrically meaningful and is usually non-linear. The most common non-linear method for recovering both 3D structure and camera motion is the bundle adjustment [12, 4, 3]. It was first developed in photogrammetry and has been widely used e.g. computer vision and robotics. The concept of this method is adjust a bundle of rays between each 3D point and the optical centers of cameras viewing the 3D point. In contrast to the linear approach, it does not require any 3D point to be visible on all images. The process is to minimize a non-linear cost function by iteratively solving its linearized version. Since the iterative computation requires a starting estimate, this initial solution, for instance, can be obtained from the factorization approach. Since its objective is to determine 3D geometric features, camera poses and sometime camera calibrations, the bundle adjustment becomes a large spare optimization problem. The complexity of the bundle adjustment depends on the parameterization of a 3D geometric entity and camera pose and also constrains that must be taken into account in order to get the valid estimate. For example, the orthogonality constraint needs to be imposed in order to get valid rotation matrix or some additional constrains may need to be taken into 7

23 account when some higher order geometric entities are estimated. This is also an advantage of the bundle adjustment because it lets a wide range of parameterizations. Moreover, the choices of the camera s projection models and cost functions are not limited. Although, in terms of the computational model, the bundle adjustment problem is classically formulated as a non-linear least square model assuming quadratic cost function, it has been adapted to be used by a variety of models e.g. non-quadratic or robust cost functions. Thus, bundle adjustment is shown to be very versatile and applicable in a wide range of applications. 1.1 Scope of dissertation In traditional photogrammetry and computer vision activities, most algorithms for 3D scene and camera motion recovery are widely developed based on point feature. Precisely, point matches between images are primarily used as input data. In some application, using higher order features is more suitable than using point. For example, in urban scene, most meaningful features are higher order features e.g. straight line or conic. Using such higher order features can eliminate the occlusion problem that is very critical in a point based technique. Many researchers have investigated feasibility of using lines in computer vision and photogrammetry. In this dissertation, two problems in structure and camera motion recovery using line features are addressed; especially the 3D line estimation and the line-based cost function for bundle adjustment are studied. The aim of this research is to develop new methods for 3D line estimation method and a novel cost function for bundle adjustment. For the first problem, either relative or absolute camera motions are assumed to be known ahead of time. The initial camera motion can be obtained from existing linear algorithms or alternatively from an INS solution. Two possible methods for estimating a 8

24 3D line which are not limited to either calibrated or un-calibrated cameras are introduced. The proposed methods estimate 3D line by using the back-projection planes of image lines. This method estimates a 3D line by minimizing a cost function based on the dual property of the projective space. Specifically, in dual projective space, the duals of back-projection planes are points, while duals estimated lines are lines. Since the coincidence property is preserved, it is hypothesized that the dual of the optimal line estimate coincident with the duals of back-projection planes. Namely, the dual of optimal line estimate minimize the orthogonal distances to the dual of back-projection planes. Since the estimated line is represented by Plücker coordinates, traditionally, the Plücker constraint needs to be imposed in the estimation model. However, by formulating the cost function in the dual projective space, the requirement of the Plücker constraint can be waived. The second line estimation method is based on the non-parametric statistical method, especially the mean shift method. It is a non-parametric method for finding the mode of the sampled data. In terms of geometry, it seeks to find the mode of a set of discrete points in space. In the proposed line estimation method, the sampled set of lines in space is generated by the intersection of back-projection planes from multiple images viewing the estimated line. By representing a line using rotation matrices in 2D and 3D, the mean shift clustering method for 3D line estimation can be developed because the distance metric in the space of rotation matrix is well defined. In order to develop line-based cost function, a line segment is used instead of analytical line, which is presented two end-points. The use of endpoint provides the location of the estimated 3D line. An important fact used to derive the cost function is the equivalence between the two models of 2D projective space. The first model is the image plane, which is commonly used in many applications, and the Gaussian sphere center at the optical center 9

25 of the camera. Under this equivalency property, a line segment is equivalent to a circular segment on the plane containing optical center and the line on the image plane. That is, a line segment is represent by its equivalent circular segment instead. It is then hypothesized that the optimal camera motions and 3D scene geometry is the one that minimizes the integration of squared distance from all points on the boundary of the circular segment and the plane containing optical center and the estimated line. 1.2 Organization of the dissertation The rest of this dissertation is organized as follows. The backgrounds on projective geometry, camera model and mean shift clustering are reviewed in Chapter 2. The proposed 3D line estimation methods are presented in Chapter 3. Chapter 4 is dedicated for the proposed line based bundle adjustment model. The experimental results are reported in Chapter 5 and the conclusion is in Chapter 6. 10

26 Chapter 2: Background In order to facilitate the discussion on the development of the proposed methods, in this chapter, I review necessary background information. This chapter particularly starts with the introduction of related concepts in the projective geometry including the representations of geometric entities in the projective space, camera model, which is followed by the projection and back-projection using camera matrices, and duality in projective space. More detail in projective geometry used in computer vision can be found in [1, 13, 14]. 2.1 Projective geometry The projective geometry is a topic in geometry studying the transformation of geometric entities; especially the projective transformation. An example of the projective transformation is the imaging of a 3D point which is projected onto an image plane. In the global view, the Euclidean transformation is a special case of projective transformation. There are also two geometric transformations between them which are similarity and affine transformations, such that Euclidean similarity affine projective transformation. One of the drawbacks in using the projective geometry is that many geometric quantities are not invariant under the transformation. For instance, orthogonality of lines and parallel lines are not preserved under the projective transformation. Despite this drawback, an interesting property of the projective geometry is the duality principle. For example, in 11

27 2D projective space, this principle makes points and lines equivalent. Moreover, with the existence of homogeneous coordinates, the algebra of projective transformation is linear and has simpler analytical forms Representation of geometric entities I first introduce the notations used in the review of projective geometry. A vector is written in boldface i.e. x or l while a matrix is in typewriter font e.g. R. The notation [a] represents the cross-product between two 3-vectors i.e. [a] b = a b and is skew-symmetric. Geometric primitives are represented in the projective space by its homogeneous coordinates. The homogeneous coordinate of a point in n-d Euclidean space R n is an (n+1)-vector. The coordinates of a point in n-d Euclidean space can be converted to homogeneous coordinates by concatenating the coordinates with 1. Two homogeneous coordinate are equivalent up to scale i.e. x 1 x 2 if and only if x 1 = λx 2 where λ R. For example, the homogeneous coordinates of a 3D point [x y z] is [x y z 1] which is equivalent to [λx λy λz λ]. Broadly speaking, a point in n-d Euclidean space can be represented as a point in the n-d projective space, P n, by an (n+1)-vector. The most significant projective entity used in the dissertation is the straight line. In the 2D projective space e.g. image plane, a line is represented by 3-vector i.e. l = [a b c] where ax + by + c = 0 is the line equation. The line l passing two points x 1 and x 2 can be computed by l [x 1 ] x 2. Let a 3D line be defined by two points X 1 and X 2. Among several other choices, this line can be represented by its Plücker coordinates which is a point in P 5 : L [ X 2 X 1 [X 1 ] X 2 ] = [l1 l 2 l 3 l 4 l 5 l 6 ] = [ L h L o ], (2.1) 12

28 where L h and L o respectively equivalent to the direction and moment of the line which is the normal vector of the plane containing the origin and the 3D line, see Figure 2.1. The Plücker coordinates can also be written in the form of a Plücker matrix: 0 l 6 l 5 l 1 M = l 6 0 l 4 l 2 l 5 l 4 0 l 3. (2.2) l 1 l 2 l 3 0 Here, M contains column vectors, M = [L x L y L z L ], each of which is the point of intersection between the 3D line and the x, y, z - planes and the plane at infinity. Figure 2.1: A 3D line L can be represented by its Plücker coordinates L = [L h L o ], which is 6-vector. The first 3-vector, L h is the direction of the line and the second one is the moment of the line which is the normal vector of the plane containing origin and the line. Due to the unknown scale, line L in P 3 becomes a point in P 5. In other words, points in P 5 are used to represent a line in P 3. One should bare in mind that not all points in P 5 are equivalent to a line in P 3, such that L in (2.1) represents a 3D line if and only if a bi-linear 13

29 constraint, referred to as the Plücker constraint is satisfied: 0 = ζ(l) = L h L o = l 1 l 6 + l 2 l 5 + l 3 l 4. (2.3) Points in P 5 satisfying the Plücker constraint lie on a special surface referred to as the Klein quadric. Given a point X = [ X 3 1 M] in P 3, the (Euclidean) perpendicular distance between the point and a line L is [13, page 78]: [ [ X] ] MI 3 L d (X, L) =. (2.4) ML h Alternative to using L h and L o as a 6-vector, their directions and magnitudes can be represented by Special Orthogonal groups SO(3) and SO(2), where SO(n) is the group of rotation matrix in R n. This 3D line representation is also called as orthonormal representation [15]. That is, the direction and moment of the line are embedded in the rotation matrices. Therefore, any 3D line can be also represented by: L (S, T) SO(3) SO(2), (2.5) where S = [s 1 s 2 s 3 ] (2.6) [ ] Lh L o [L h ] L o =, (2.7) L h L o [L h ] L o [ ] 1 Lh L T = o (2.8) L L o L h [ ] t11 t = 12. (2.9) t 21 t 22 Given orthonormal representation of a 3D line, we can easily converse it back to the Plücker coordinates by: L [ t 11 s 1 t 21 s 2 14 ]. (2.10)

30 In this orthonormal representation, the Plücker constraint is implicitly embedded in S and T due to the orthogonality property of rotation matrix which can be easily proved by taking the dot product between the first and the second columns of the matrix S: 0 = L h L o L h L o (2.11) = L h L o, (2.12) and those of the matrix T: 0 = L h L o + L o L h L 2 (2.13) = L h L o. (2.14) In our approach, we will make explicit use of this representation in bundle adjustment in order to waive the imposition of the Plücker constrain. The review of various 3D line representations can be found in [15, 16]. A plane in 3-space is represented by a homogeneous 4-vector whose elements are the plane parameters. Precisely, a plane AX + BY + CZ + D = 0 is represented by the vector [A B C D]. Given two 3D planes Π and Π such as the back-projection planes of image lines from a stereo pair, they always intersect at a 3D line (see Figure 2.2 for illustration) which can be computed by: L Π Π = Ξ(Π)Π, (2.15) where Ξ is a construction matrix [17]: [ [ Π] 0 Ξ(Π) = 3 1 SI 3 Π ], (2.16) and Π [ Π S]. 15

31 l' l L Figure 2.2: Back-projection planes from a stereo pair intersect at a line in space. The coordinates of geometric entities change under the changing of coordinate system. Let the rotation matrix and the translation vector between the original coordinate system and the transformed one be denoted by R and T. A point, X in the original coordinate system is then transformed by: X = R(X T) (2.17) Let a line L in the original coordinate system be defined by 2 points X 1 and X 2. The Plücker coordinates of the transformed line L is then: [ ] [ X L 2 X 1 [X 1] X = 2 R(X 2 X 1 ) [R(X 1 T)] R(X 2 T) ]. (2.18) By using an identity [Ba] B = adj(b) [a], the above equation becomes: L [ R(X 2 X 1 ) R[X 1 T] (X 2 T) ] [ = R(X 2 X 1 ) R[T] (X 2 X 1 ) + R[X 1 ] X 2 ] (2.19) [ R R[T] R ] [ X2 X 1 [X 1 ] X 2 ] [ = R R[T] R ] L. (2.20) 16

32 2.1.2 Pinhole camera model As it is traditionally assumed, I use the pinhole camera model which is composed of exterior and interior orientation parameters. The exterior orientation parameters contain the rotation matrix R 3 3 and the optical center, T 3 1 ; while the interior orientation parameters are coefficients of the calibration matrix K 3 3 : α γ u 0 K = 0 β v 0, (2.21) where α and β are the scale in x and y direction, (u 0, v 0 ) are the coordinates of the principal point, and γ is the skew parameter. Precisely, the interior orientation parameters describe the geometry between the sensor and image planes. Combining these parameters into a single projection matrix, P, results in a model for projecting a 3D point X to its image x, see Figure 2.3, where both are in the homogeneous coordinates by: x PX with P KR [I 3 T], (2.22) where I is the 3 3 identity matrix. The camera projection matrix P back-projects an image line l to a 3D plane Π called back-projection plane which is also known as interpretation plane. It contains the optical center and an image line, see Figure 2.4, and can be computed by: Π P l. (2.23) Note that Π is the plane in the object space coordinate system not in the camera one The equivalence between the image plane and the Gaussian sphere Under the equivalence relation in the projective geometry, the 2 dimensional projective plane P 2 can be modeled as the unit Guassian sphere S 2 in 3 dimensional vector space R 3. 17

33 Figure 2.3: A scene point X is projected to an image point x. This equivalence is based on a fact that two points are equivalent if and only if they are on the same projection ray. A point on P 2 can be mapped back and forth between S 2 and P 2 via gnomonic projection. Therefore, a point x P 2 can be mapped to a point V on S 2 by V = x x. (2.24) Equation (2.24) can be interpreted as the normalization of the point x to a unit vector. Therefore, V is a point on the unit sphere in R 3 and both V and x are on the same projection ray but on different surfaces. For an infinite line l on the 2D projective plane, it is equivalent to a great circle on S 2 which is the intersection between the sphere S 2 and the interpretation plane of the 2D line l. Hence, the great circle can be represented by the normal vector N of the interpretation plane. A point on the line l is mapped to a point on the great circle equivalent to the line, see Figure 2.5. Consequently, a line segment is equivalent to a circular arc on the interpretation 18

34 Figure 2.4: The back-projection plane from the image line l. Its is also known as the interpretation plane. Figure 2.5: The mapping between Gaussian sphere S 2 and 2D dimensional projection space P 2. A line l on P 2 is equivalent to a great circle with normal vector N on the Gaussian sphere. If a point x on the image line l is equivalent to a point V on the Gaussian sphere, the point V must be on the great circle equivalent to the image line l. 19

35 plane. In order to explain the concept, let the segment of the line l be defined by two endpoints x 1 and x 2 which are equivalent to V 1 and V 2 on S 2 respectively, see Figure 2.6 for the illustration. The circular arc connecting V 1 and V 2 is a segment of the circumference of the great circle equivalent to the line l. For an alternative explanation, lets consider the tracing of the projection ray. The projection ray tracing from the point x 1 to the point x 2 along the line l intersects the Gaussian sphere and the locus of the intersection point is the great circle with normal vector N passing the points V 1 and V 2 because the tracing path is on the interpretation plane. Note that the line l is equivalent to the great circle with normal vector N. Figure 2.6: A line segment on the image plane is equivalent to a circular arc which is a segment of the circumference of the great circle equivalent to the line passing x 1 and x 2. 20

36 Given the calibration matrix K of camera, an image point x can be mapped to the unit Gaussian sphere in the object space with its center positioned at optical center by: V = K 1 x K 1 x, (2.25) where K 1 x is the homogeneous coordinate of the point x = [ x 2 1 1] on the normalized image plane, particularly image plane at the unit focal length. In other words, a point on the image plane can be mapped onto the unit Gaussian sphere in object space by first mapping it to the normalized image plane and normalizing it to unit norm. To compute the normal vector of the great circle equivalent to a line l, let the line l be defined by two points x 1 and x 2. The normal vector of the great circle can be computed by using the cross product between their equivalent point on the Gaussian sphere: N = [K 1 x 1 ] K 1 x 2 K 1 x 1 K 1 x 2. (2.26) By using a fact that [Ba] B = det(b)b [a] (det(b) 0) and l [x 1 ] x 2, one obtains: N = K [x 1 ] x 2 K 1 x 1 K 1 x 2 (2.27) K l. (2.28) Note that the normal vector N is in the camera coordinate system Duality principle One of the important concepts in projective geometry is the duality principle. It is the formulation used to explain the symmetry in roles play by geometric entities in projective space. Moreover, geometric relations in the projective space also have their dual. For instance, the dual of collinearity is the intersection or vice versa. Therefore, given a 21

37 proposition in projective space, the dual of the proposition can be formed by replacing the geometric entities and geometric relations by their duals. For example, lets consider the 2 dimensional projective space. The dual entity of a point is a line or vice versa. Let x be a point in 2D projective space and its dual in the dual projective space is x. Both of them have the same coordinates, but x is interpreted as a point in projective space while x as a line in the dual projective space. Similarly, the dual of a line l is a point l in the dual projective space. A proposition also has a dual. Let x 1, x 2 and x 3 be points in 2D projective space and they are collinear on the line l. In the dual projective space, they are interpreted as the concurrent lines and the point of intersection is the dual of the line l, see Figure 2.7. Figure 2.7: The dual entity of a point x on 2D projective space is a line x in the dual projective space or vice versa. A proposition also has a dual. This Figure demonstrates the dual of collinearity which is intersection in the dual projective space. Three collinear points on the line l become three concurrent lines in the dual projective space where the intersection point is the dual of the line l. 22

38 In 3 dimensional projective space, the dual entity of a point is a plane. Let X be a point in the projective space and its dual in the dual projective space is X. Similarly, the dual of a plane Π is a point Π. For the 3D line L, its dual L is still a line in the dual projective space but their coordinates are different by: L = [ Lh L o ] = [ 0 I3 I 3 0 ] [ Lo L = L h ]. (2.29) 2.2 Riemannian manifold For better understanding and the sake of completeness, this section is dedicated to the Riemannian manifold which will be used for explaining the non-linear mean shift algorithm and computation in the proposed 3D line estimation method. A manifold is a topological space in which the neighborhood region of each point has an Euclidean-like structure. A smooth manifold can be thought of as a continuous surface lying in the Euclidean space. A manifold is called Riemannian if and only if it is smooth (differentiable) and equipped with inner product. Hence this allows one to define a metric on manifold. Unless stated otherwise, the term manifold refers to the Riemannian manifold. An important notion of the manifold is the tangent space. Let consider the infinitesimally small neighborhood region of a point x on the manifold M i.e. x M. Such infinitesimally small neighborhood region can be regarded as a flat space, which is linear approximation of the manifold M around the point x [18]. The flat space at the point x is called the tangent space at x and denoted by T x. In other words, the tangent space T x is a plane tangent to the manifold M at the point x. In Figure 2.8, an example of tangent space is demonstrated and the vector on T x is called tangent vector. Since the tangent space T x is a vector space, the inner product for tangent vector can then be defined. 23

39 Figure 2.8: Mapping between manifold and tangent space. The logarithm operator log x maps a point y on the manifold to a point on the tangent space T x ; while the exponential operator exp x maps the point back onto the manifold. Note that these two operators depend on the point x on the manifold. A point on the manifold can be mapped to a tangent space or vice versa. The mapping from a point on the tangent space T x i.e. to the manifold M is defined by the exponential map, exp x. Inversely, the logarithm map log x = exp 1 x maps a point y on the manifold to the tangent space T x. Note that these operators are dependent to the point x because different points on the manifold have different tangent spaces. The comparison between addition and subtraction operators in vector space and exponential and logarithm operator on manifold is demonstrated in Table 2.1. As mentioned earlier, the Riemannian manifold Table 2.1: The comparison between operators in vector spaces and manifolds Vector Spaces Manifold Addition y = x + y = exp x ( ) Subtraction = y x = log x (y) 24

40 is equipped with the notion of metric which is given in terms of the length of the shortest curve between two points. The curve connecting two points with the shortest distance is called geodesic and the Riemannian distance (metric) between two points on the manifold is the length of the geodesic. In Figure 2.8, the dash line is the geodesic between the point x and y and the initial velocity of the geodesic is [19]. The relations between y and are expressed as follow: exp x ( ) = y, (2.30) log x (y) =. (2.31) Note that the specific form of the exponential and logarithm operators depend on the manifold. Let d(x, y) be the Riemannian distance between points x and y on the manifold M. The gradient of the squared Riemannian distance is given by [20]: x d 2 (x, y) = 2 log x (y). (2.32) As mentioned earlier, the expressions for exponential and logarithm operators depend on the manifold. For example, the logarithm and exponential operators of the matrix manifold are different from that of the Grassmann manifold. Since the matrix manifold frequently occurs as parameter space in photogrammetry and computer vision and is used in this work, only the case of matrix manifold is discussed. Let and Y be square matrices. The matrix exponential and logarithm operators of a matrix defined about the identity matrix are expressed as follow: exp( ) = log(y) = 1 i! i (2.33) ( 1) i 1 (Y I) i. i (2.34) i=0 i=1 25

41 Namely, the above expressions can be computed accurately when their arguments are close to the identity matrix. The manifold operators at a point X i.e. exp X and log X are then defined as: exp X ( ) = X exp(x 1 ) (2.35) log X (Y) = X log(x 1 Y), (2.36) where X and Y are points on the matrix manifold and on T X. Thus, the distance between two points on the matrix manifold is expressed as follow: d(x, Y) = log(x 1 Y) F, (2.37) where F is the Frobenius norm of a matrix. The first order approximation of the gradient of squared distance d 2 (X, Y) is obtained [21]: d 2 (X, Y) log X (Y). (2.38) 2.3 Non-linear mean shift on Riemannian manifold In order to introduce non-linear mean shift on Riemannian manifold, the ordinary mean shift is first discussed and then the non-linear mean shift Ordinary mean shift Mean shift is a non-parametric statistical analysis technique. Strictly speaking, it is an iterative procedure based on the weighted average in a local region for mode detecting/clustering of a given discrete data {x i }. Originally, its purpose is for locating the peak of a density function of a given n data points in a vector space, where the multivariate kernel density estimate at a point x can be computed as follow: ˆf k (x) = c n ( ) k,h x xi k, (2.39) n h 2 i=1 26

42 with kernel function k satisfying: k(z) > 0 z 0. (2.40) c k,h is the normalization term such that ˆf integrates to 1. The bandwidth h is introduced as the scaling of the distance function i.e. Euclidean distance in vector space. By computing the gradient of the kernel density estimation function (2.39), the mean shift vector at point x can be calculated as follow: = where m h (x) = C ˆf k (2.41) ˆf k N ( ) x xi 2 xg i=1 h 2 N ( ) x xi 2 g i=1 h 2 x, (2.42) g(x) = k (x). (2.43) The expression for the mean shift vector is actually proportional to the normalized density gradient estimate. The first term in the right hand side of (2.42) is the weighted average of point in the local region at the point x. Starting at a point x j, it iteratively converges to a stationary point by a gradient ascent technique [22]: x j+1 = m h (x j ) + x j. (2.44) That is, the current mode estimate is shifted to the weighted average of data point in its neighborhood. 27

43 2.3.2 Non-linear mean shift In [21] and [19], the mean shift algorithm was generalized to the case where the set of discrete points is restricted to lie on a manifold e.g. manifold of n-dimensional rotation matrix SO(n), which is also known as special orthogonal group, or Euclidean transformation in 3-space SE(3), which is also known as special euclidean group. That is, the set of points is in a non-linear space and the mean shift vector (2.42) can shift the current mode estimate outside the manifold. For instance, let consider the manifold of 3D rotation matrix. The iterative update (2.44) cannot be used to update the current mode estimate because the sum of two rotation matrices is not a rotation matrix. The useful characteristic of a manifold for deriving the non-linear mean shift is that the tangent space at each point on the manifold is well defined. Furthermore, such tangent space is the Euclidean space. It is hence possible to modify the original mean shift algorithm to work on non-linear manifold because the tangent space at the current mode estimate x j is a vector space. Particularly, the mean shift vector is computed on the tangent space at the current mode estimate. The mode estimate is updated by back-projecting the mean shift vector from tangent space to the manifold. Another important characteristic of the Riemannian manifold is the notation of metric, also known as geodesic, between two points on the manifold. With the existence of the metric, the calculations of the direction and size of the mean shift vector are thus possible. To modify the linear mean shift algorithm, lets consider a manifold M equipped with a distance metric d. The notion of Euclidean distance in (2.39) is the replaced by the distance metric d and the term c k,h is dropped because the position of the mode is not affected by 28

44 Figure 2.9: The non-linear mean shift on manifold. To compute the mean shift vector, all data points are projected to the tangent space of the current mode estimate. On the tangent space, the mean shift vector is computed. The updated mode estimate can be obtained by projecting the mean shift vector back to the manifold. the global scaling [19]. The gradient of ˆf k at x is then: ˆf k = 1 n = 1 n n ( ) d 2 (x, x i ) k i=1 i=1 h 2 (2.45) n ( ) d 2 (x, x i ) d 2 (x, x i ) g. (2.46) h 2 h 2 Note that the gradient of d is with respect to x. The expression for the non-linear mean shift vector for the discrete points on non-linear manifold can be obtained analogously to the linear manifold case in (2.42) [19, 20]: m h (x) = N ( ) d d 2 2 (x, x i ) 2 (x, x i )g i=1 h 2 N ( ). (2.47) d 2 (x, x i ) 2 g i=1 29 h 2

45 The above mean shift vector is the weighted average of the gradient terms d 2 (x, x i ) on the tangent space at the point x. The mode estimate is then updated by projecting the mean shift vector back to the manifold: ( x j+1 = exp x j mh (x j ) ). (2.48) Namely, the mode estimate moving along the geodesic on the manifold [19]. The complete algorithm is presented in Algorithm 1. Data: A set of discrete points on manifold {x i }, i = 1,..., n Result: The mode estimate of the given set of points for iter 1 to w do x x i while m h (x) < ɛ do N d 2 (x, x i )g m h (x) i=1 ( ) d 2 (x, x i ) 2 h 2 N ( ) ; d 2 (x, x i ) 2 g i=1 x expx(m h (x)); end end Algorithm 1: Non-linear mean shift algorithm on manifold h 2 30

46 Chapter 3: 3D line estimation 3.1 Literature review 3D scene geometry reconstruction is an on-going research in many fields i.e. photogrammetry, robotic and computer vision. It is used in many applications including digital archiving of cultural heritages [23], 3D urban Scene Modeling [24], face recognition [25] and medical diagnosis [26]. The aim of 3D reconstruction is to infer the most likely locations of geometric entities in space. The most commonly used feature in the reconstruction of scene geometry is the point feature. Moreover, it is also the feature used in the early work of scene geometry recovery [27]. Since then, most of the existing 3D scene geometry recovery methods were developed for the 3D point estimation [9, 28, 29, 30, 31]. Although the research on 3D scene point recovery is the mainstream, recovering 3D scene points may not be suitable in applications that require the reconstruction of higher order features such as lines, planes or curves. For instance, the plane estimation is required for reconstructing building roofs [32, 33]. Polyhedral objects can be reconstructed by 3D line estimation [34]. Especially, to reconstruct high-rise or man-made buildings in an urban environment, the 3D line reconstruction is more suited than point reconstruction [35, 36], whereas, estimating space curve is better suited for reconstructing arbitrary objects [37, 38, 39]. 31

47 To estimate a 3D line, at least three images are required because two images do not provide adequate constraints for locating the 3D line in space [40]. To illustrate, let there be only two images and the intersection of projection planes of the matched image lines gives estimated 3D line. Without loss of generality, suppose that the motion of the first camera is fixed as the second camera moves (see Figure 3.1),so that the back-projection planes always intersect in space. Therefore, the position of the 3D line is not unique. To constrain the position of the 3D line, a third view is needed because three planes in space do not necessarily intersect at a line. Figure 3.1: 3D line estimation using two views. The location of the estimated line is not well constrained because the back-projection planes from two views always intersect in space. 32

48 In the literature, there are several 3D line estimation methods that use the minimal configuration of 3 views. Hartley [8] and Weng et al. [40] used three views to define linear algorithms for the 3D line estimation problem. In [40], only a set of image lines was used to derive a closed form solution for estimating 3D line from three monocular perspective views. These approaches begin with estimating the relative camera motion and then estimates the 3D line. While this method assumes calibrated camera, Hartley [8] used un-calibrated cameras and estimated the relative camera motions by the trifocal tensor. The 3D line is then reconstructed by the intersection of back-projection planes. Since un-calibrated cameras are used, only projective equivalence of a 3D line is obtained. Similarly, the 3D line estimation method proposed in [41] concurrent uses three backprojection planes to estimate relative camera motion. Once the relative camera motions are computed, the 3D line is estimated from a pair of points on the intersection of three planes. While previous methods use only image line features, Oskarsson et al. [42] use a combination of points and lines. They proposed two methods for reconstructing of four points and three lines in three views, and two points and six lines in three views. In contrast, some methods are generalized to the N-view case. An advantage of using many views in the 3D line estimation is the increased redundancy. Namely, the estimation model becomes overdetermined. Moreover, the location of the estimated 3D line can be well constrained. Hartley and Zisserman [1, page 323] formulated a 3D line estimation model by using the fact that all of the back-projection planes from line matching intersect at a single line in space. Therefore, the intersection line is the pencil of all back-projection planes. Since a 3D line in space can be represented by the two planes where they intersect at a line, they suggested that those two planes can be obtained by using two basis vectors spanning the back-projection planes. The two basis vectors can be computed by 33

49 using Singular Value Decomposition (SVD). The concept of using back-projection plane is also used in by Taillandier and Deriche [43]. They formulated a cost function for 3D line estimation by using the fact that point on the estimated line must be coincident with the back-projection planes. Petsa and Karras [44] proposed a method for 3D line estimation for a stereo pair. Since using only two views cannot provide more constraints on the location of the estimated line in space, this method constrains the estimated line to be on a model plane. In other words, this method hypothesizes that the back-projection planes from a stereo pair and the model plane intersect at the estimated line. This may be impractical in some situations where the equation of the model plane is not known before hand. Heuel [10] and Heuel and Förstner [45] proposed a probabilistic framework for reconstructing a 3D line by direct reconstruction. Strictly speaking, a statistical model for 3D line reconstruction by the intersection of back-projection planes was proposed. Another approach of 3D line estimation is the minimization of the re-projection error. The concept of this approach is to re-project the estimated 3D line back to the image plane. The optimal 3D line estimate is the one that minimizes a cost function defined in the image space. Bartoli and Sturm [15] presented a quasi-linear method for line estimation by representing a 3D line by its Plücker coordinates with a cost function defined in the image space. Their cost function is composed of the orthogonal distances from the image line end-points to the re-projection of the estimated line. This 3D line estimation is quasi-linear because the non-linear Plücker constraint is imposed in the estimation model. Instead of using orthogonal distance from the image line end-points to the re-projection of estimated line, Schindler et al. [35] estimate a 3D line by minimizing the total squared distance between the image line segments and the re-projections of the estimated line. The disadvantage of 34

50 this method is that the 3D line representation is limited to vertical line and horizontal line with fixed direction. The aforementioned methods used the re-projection of infinite straight line to define cost function. An alternative way for defining re-projection error is to use the re-projection of points defining the estimated line. Park [46] proposed an image based rendering method which generates 3D polygons. The basic concept is to estimate 3D straight edges of polygon from multiple images. Both image line features and estimated 3D lines are represented by their endpoints. That is, the 3D straight lines are the polygon s vertices. The cost function is then defined by the area of a quadrilateral whose vertices are the four endpoints of the two line segments i.e. image line segment and projected polygon s vertices. The re-projection of 3D line segments is also used to define the probabilistic distribution function of the location of the estimated 3D line in [47]. The distribution function is generated by sweeping the 3D line end-points along the back-projection rays of the image line endpoints. The probability of the 3D line at each sweep is computed by the total re-projection error on all images of the 3D line end-points where the re-projection error is formulated as the image gradient value. In contrast to most 3D line estimation methods, this method does not need an explicit 2D line match. However, the computational cost of the probability of the 3D line location in space is more expensive than the 2D line matching. Josephson and Kahl [30] proposed a unified framework for estimating point, line and plane. They estimate the 3D line by finding bounds on the coordinates of the two points defining the estimated line. They chose the coordinates system such that the two points defining the estimated line are located on the planes z = 0 and z = 1. The problem of 3D line estimation then becomes finding x and y coordinates of those two points by minimizing their re-projection 35

51 errors. The optimal x and y coordinates are solved by using branch and bound optimization [48]. Instead of using all of the image data at once, a 3D line can be reconstructed sequentially by using sequential adjustment e.g. Kalman filtering [49, 50, 51, 52]. That is, the estimated line is updated when a new frame is added. Seo and Hong in [49] proposed a sequential line reconstruction method based on the Kalman filtering technique. The reconstruction starts with computing the camera projection matrices of the first three frames which are then used to estimate 3D lines. To update the estimated 3D line, new frame are added one by one. After a new camera matrix is computed, the 3D line is updated by using Iterative Extended Kalman filter (IEKF). A drawback of this method is that it is biased to the first three frames used to initialize the camera projection matrices. Similarly, Gee and Mayol-Cuevas [50] proposed a Simultaneous Localization and Mapping (SLAM) system using unscented Kalman filters. The system can generate 3D line segments and estimate the camera location in real-time. The rest of this chapter is organized as following. The overview of two proposed 3D line estimation methods is presented in Section 3.2. The 3D line estimation method based on the incidence property in the dual projective space is then present in Section 3.3. The averaging approach for the 3D line estimation is then presented in Section Overview of the proposed methods In the proposed methods, the initial camera motions are assumed to be known. They can be initialized by some linear methods or given by some navigation device such as an inertial navigation system. Among many choices of 3D line representation, the Plükcer coordinates are adopted to represent a 3D line because of its homogeneity and its ease for 36

52 manipulation in the projective space. As discussed in Chapter 2, the Plücker coordinates is 6-vector in 5 dimensional projective space satisfying the Plücker constraint. This gives a difficulty in the 3D line estimation because an estimation model needs to take the non-linear constraint into account in order to get valid Plücker coordinates. The first proposed method is based on the incidence property of the dual line and dual plane in the dual projective space. Specifically, the cost function used in this line triangulation method is the orthogonal distance between the dual line and the dual of backprojection planes. That is, all of the image lines corresponding to the estimated 3D line are back-projected to the object space. The 3D line can then be estimated by fitting a line to the set of the dual of those back-projection planes. The Plücker constraint is naturally embedded in the solution of the cost function. A 3D line can then be estimated by solving the fitting model which is a set of linear equations without explicitly taking the Plücker constraint into account. The second 3D line estimation method is based on the averaging of a set of lines. For example, the intersection of the back-projection planes from a line match across two images is a 3D line. By considering all stereo pairs viewing the estimated line, a set of lines can be generated for use in the 3D line estimation. For the illustrated 3 image case in Figure 3.2, the plane intersections generate C 3 2 = 3 lines, which is shown by dashed lines in 3.2a. We use these geometrically incorrect line estimates to compute the likelihood of true 3D line shown as red circle lying on the Klein quadric in Figure 3.2b. This can be done by averaging the set of 3D lines. The difficulty of averaging the set of 3D lines is that the naive barycentric mean is not guaranteed to give valid Plücker coordinates. By converting the Plücker coordinates to the orthogonal representation, which is a pair of 3D and 2D rotation matrices, as discussed in Chapter 2. The 3D line averaging can be computed by 37

53 l' l l'' L ˆ Figure 3.2: Estimated 3D lines on the Klein quadric denoting the proposed final error. Red denotes the estimated 3D line, and other colors denote pairwise intersections of backprojected lines. (a) Initial solution, (b) final solution. the averaging in the space of 3D and 2D rotation matrices. The result from averaging is guaranteed to satisfy the Plücker constraint. In contrast to image space based 3D line estimation methods that estimate the 3D line by minimizing geometric or algebraic error on the image plane, the proposed line estimation methods are more generic since its independent of camera types. Namely, the estimation model does not need to be redefined for different camera types e.g. conventional camera, fish eye lens or catadioptric system because the proposed 3D line estimation methods are defined based on error minimization in the object space. By back-projecting line features back to the object space, the estimation methods then become independent to the camera models D line triangulation by incidence property in dual space Given the initial estimates of K, R and C of conventional pinhole cameras or inverse projection functions of omnidirectional cameras, 3D line estimation deals with back-projecting 38

54 image lines using these estimates to locate the 3D lines in the object space. This process, however, does not guarantee projectively correct line estimates. As discussed in Chapter 2, the estimated line should lie on the Klein quadric which suggests using the Plücker constraint in the estimation model: AL = 0. Using this constraint, the 3D line estimation model can be formulated as a least square problem subjected to a nonlinear constraint as follows: L = argmin AL 2 2 (3.1) L subjected to ζ(l) = L DL = 0 and L = 1. (3.2) Imposing the Plücker constraint and additional unit norm constraint in the above estimation model gives a difficulty in the estimation process. Only the linear least square problem (3.1) and the unit norm constraint can be easily solved by the SVD technique because the unit norm constraint is implicitly enforced. With the Plücker constraint, which is non-linear, the 3D line estimation problem needs to be iteratively solved by the linearized estimation model. Recently, Barreto et. al [53] presented a linear 3D line estimation for medical endoscope without imposing the Plücker constraint. A 3D line is triangulated by fitting the line to a set of 3D points. Although the Plücker constraint is not required, this method is not practical for traditional camera because it requires points in 3-space which are not available if they are not measured prior to estimation. Hartley and Zisserman [1, page 323], on the other hand, use a set of back-projection planes to estimate the 3D line from the spanning space corresponding to the two largest eigenvalued eigenvectors. This work is motivated from both [53] and [1, page 323] in that a 3D line is estimated from the dual of back-projection planes which is a point in the dual-space. That is the dual line is fitted to 39

55 the dual of back-projection planes in the dual projective space. The analysis on the estimation model shows that the Plücker constraint is not required during the estimation. An advantage of our 3D line estimation over [53] is that the proposed method is more practical central and non-central projection cameras because no 3D points are required and the cost function is defined in object space. The useful property of dual projective space for deriving the proposed 3D line estimation model is the preservation of the incidence between the geometric entities. By duality principle, the dual elements of the plane Π and line L are point Π and line L such that L = [L o L h ]. Let there be two planes in 3-projective space i.e. Π a and Π b intersecting at a line L, see Figure 3.3. Therefore, in the dual projective space, the dual line L is incident with the duals of planes Π a and Π b, see Figure 3.3. With this observation, a 3D line L can be estimated by finding the dual L minimizing the distance from duals of planes which are points in dual space. Figure 3.3: The dual projective space. Given two planes intersecting at a line L, the dual elements of these two planes are incident with dual line L. 40

56 The first step in the estimation process is to back-project line matches to object space. For the pin hole camera, let the projection matrix of camera i be P i and the image of line L on view i be l i with i = 1,..., N. The back-projection plane containing the projection center and the image of line L on view i is computed by: Π i = P i l i = [ Π i S i ]. (3.3) Geometrically, the back-projection planes must intersect at a single line in object space. However, in practice, those back-projection plane cannot intersect at a single line in space due to errors in the camera motions and image lines. In terms of the dual projective space, the duals of back projection planes are not collinear. To estimate a 3D line from the backprojection planes, the dual line minimizing orthogonal distance from the duals of backprojection planes is suggested. By way of explanation, the sum of the squared orthogonal distance from the dual L = [L o L h ] to all duals Π i is used to form the cost function for estimation, see (2.4): N d (L, Π i ) = i=1 N [ [ Π ] i ] S i I 3 L Si L h i=1 2, (3.4) or more compactly, where A = E(L, {Π i }) = A3N 6 L 2 2, (3.5) (. 1 w i [ Π ) i ] S i I 3., w i = S i L h. (3.6) This function is a non-linear cost function because of the weighting term w i. By setting the weight to be constant, the dual L satisfying Plücker constraint can be estimated from model (3.6) without imposing the constraint by using SVD method. 41

57 The estimation model in (3.5) is a homogeneous system. Its non-trivial solution can be obtained by using the SVD technique. By considering the SVD of the design matrix A: A = USV, (3.7) where V contains the eigenvectors of A A. The non-trivial solution of the estimation model in (3.5) is the column of V corresponding to the smallest eigenvalues of A A. The Plücker constraint satisfying the solution from SVD method can be easily shown by using the fact that the matrix U is orthonormal and V is the eigenvector of A A, we have: [ ] 2 1 i Πi [ S i w 2 i Πi A i w i ] AV = [ S i i Πi w i ] i I V = VS2. (3.8) Following a property of vector cross product such that the cross product between two vectors produces a vector that is orthogonal to those two vectors, the upper and lower 3-vectors of a certain column of V are therefore orthogonal: v u,iv l,i = 0, i = 1,..., 6 (3.9) where v u,i and v l,i is the upper and lower 3-vectors of column i of the matrix V. This implies the Plücker constraint satisfaction of the solution from model (3.5). The dual L = [L o L h ] can be iteratively estimated by re-weighting the model in each iteration. Note that an estimated 3D line is in dual space we need to converse it back to the projective space i.e. L = [L h L o ]. The complete algorithm for estimating dual line L is shown in Algorithm 2. After 3D lines are triangulated, they will provide the initial solution for bundle adjustment to refine the estimation. Be aware that the cost function (3.6) has a singularity at S = 0. This can happen when a back-projection plane contains the origin of the object space coordinate system. This situation can be avoided by selecting the object space coordinate system different than any camera coordinate system. 42

58 Data: A set of image lines {l i } and camera projection matrix {P i } Result: Triangulated line L. Generate a set of back-projection planes {P i l i } = {[ Π i S i ] }; Set unit weight w i = 1; Form the design matrix A in (3.6); L V(:, 6) where A = USV ; Choose the number of iterations MaxIter; for iter 1 to MaxIter do Compute weight w i = S i L h ; Form the design matrix A in (3.6); L V(:, 6) where A = USV ; end [ ] L L o L h Algorithm 2: 3D line averaging In the projective space, the cost function (3.6) can be interpreted as the incidence between the plane and line. Let a line L is incident with a back-projection plane P l = Π = [ Π S] i.e. L Π. The points L x, L y, L z and L on the line L (2.2) must be incident with the plane Π: l PL x = l PL y = l PL z = l PL = 0. (3.10) Alternatively, the incidences of these four points relation can be written by using the Plücker matrix given in (2.2) as: Π M = [ ], (3.11) where M is the Plücker matrix given in (2.2). After using some algebraic manipulation, the above relation can be rewritten as: [ SI3 [ Π] ] } Π {{ } G [ Lh L o ] = (3.12) 43

59 It can be observed that the first three rows of the matrix G is the cost function (3.6) without the weighting term. Let L be a line minimizing the cost function (3.6) in the dual projective space is an element in the null space of the first three rows of the matrix G. In general, it is sufficient to use just only two points from L x, L y, L z and L to check the incidence between the plane Π and the line L. In other words, the rank of the matrix G is 2. Being in the null space of the first three rows of the matrix G and the matrix G has 2 degrees of freedom (rank 2), the line L hence satisfies the relation (3.12). Figure 3.4: The plane containing the origin of the object space coordinate system and the 3D line L intersect the back-projection plane at the 3D line. As a result, the cross product between the moment L o and the normal of the back-projection plan parallel to the line direction L h. In terms of geometry, the direction of line from the intersection between two planes is the cross product between normals of the planes. As discussed in Chapter 2, the moment L o of the 3D line L is the normal of the plane containing the origin of object space coordinate 44

60 system and the 3D line; hence this plane intersects a back-projection planes at the line L, see Figure 3.4. As a result, the cross product between L o and the back-projection planes must be parallel to the direction vector L h : L h [ Π i ] L o, i = 1,..., N, (3.13) or which is equivalent to the proposed cost function (3.6). λl h = [ Π i ] L o (λ R), (3.14) 3.4 3D line triangulation by averaging approach Let there be a 3D line L imaged by N cameras. A set of back-projection planes intersect at a 3D line. Therefore, the intersection lines from different combinations of the backprojection planes form a set of non-intersecting candidate 3D lines because of errors in both image lines and camera matrices. For instance, the intersection lines of back-projection planes from all stereo pairs are not unique, see Figure 3.2a. We conjecture that the 3D line can be estimated by averaging the candidate 3D lines. Intuitively, the mean of samples {L i } where i = 1,..., M can be estimated by the barycentric mean: L = argmin L = 1 M M i=1 M L i L 2 (3.15) i=1 L i. (3.16) The barycentric mean is the the best linear unbiased estimator; it gives a point in a vector space minimizing the sum of Euclidean distance to sampled points. Using the barycentric mean cannot provide a valid mean for a set of discrete points that does not lay on any vector 45

61 Figure 3.5: Given a point in R 2 where all points are on a 1-D manifold i.e. circular arc. The average of there points computed by the barycentricn mean does not lay on the circular arc. space but on a non-linear manifold. Namely, the mean may not be on the manifold of data. For example, A set of data points in R 2 is given and these points all live on a non-linear 1-D manifold e.g. circular arc, see Figure 3.5. By using the barycentric mean to estimate the mean of these points, their mean may not on the manifold. By representing lines by their Plücker coordinates, using the barycentric mean to find the average of a set of 3D cannot gaurantee to give a valid line due to the fact that the space of the 3D lines is not a vector space but a non-linear manifold. The space of 3D lines is non-linear manifold because of the Plücker constraint, as discussed in Chapter 2. This limitation can be inhibited by enforcing the Plücker constraint on barycentric mean based estimation: M L = argmin L i L 2 L i=1 subjected to ζ(l) = L DL = 0 and L = 1. (3.17) 46

62 The model (3.17) with this constraint is a linear least square model with non-linear constraints. The solution of this model is a point on the Klein quadric with minimal total Euclidean distance to each sample point on the manifold. An alternative choice for computing 3D line from candidates is to estimate the mean on manifold using the Karcher mean [54, 55]. Karcher mean is the generalized average of a set of discrete points in space other than vector space R n and defined as a local minimizer of the cost function: 1 M M d 2 geodesic(l i, L), (3.18) i=1 which is a least square problem with a Riemannian metric. Compared to the barycentric mean with quadric constraint, the Karcher mean works directly on the manifold and does not require any additional constraints. Due to this property, we use the Karcher mean to compute the 3D line from the average of a set of candidate 3D lines. The metric between two 3D lines on the line manifold, however, is not well-defined. In order to compute the Karcher mean of a set of 3D lines, we resort to the aforementioned orthonormal 3D line representation SO(3) SO(2), which lets us compute the geodesic on a Riemannian manifold. Hence, we proceed by computing the Karcher mean as averaging in the groups SO(3) and SO(2) Effect of scaling on 3D line representation Let line L be represented in the Plücker coordinates as in (2.1). To compute the average, we have to converse the Plücker coordinates to the orthonormal representation. Let s consider the sign scaling of the line L i.e. L L. Although both L and L represents the same 3D line, their orthonormal representations are not the same which result in the variant of the metric between two orthonormal representations. For example, a metric between L 47

63 and L have to be zero. However, the metric between their orthonormal representations do not vanish. Therefore, the sign scaling have to be eliminated prior to the mean computation. To analyze the effect of scaling, let L be scaled with a positive value i.e. al where a > 0. The orthonormal representation of al given in (2.5) is identical to that of L. The invariant of positive sign scaling can easily be demonstrated by first considering the orthonormal representation the line L = [ L h L o ] (S, T), see (2.5), scaled by a positive real constant a i.e. al (S +, T + ). The orthonormal representation of al is then: S + = = [ alh al h [ Lh L h al o al o L o L o ] a 2 [L h ] L o a 2 [L h ] L o ] [L h ] L o [L h ] L o (3.19) (3.20) = S, (3.21) [ ] 1 alh al T + = o (3.22) al al o al h [ ] 1 Lh L = o (3.23) L L o L h = T. (3.24) The orthonormal representation is hence proved to be invariant to the positive sign scaling of the Plücker coordinates. In contrast, the negative sign scaling does change the orthonormal represenation of the line L. This observation can be demonstrated as follows by considering the negative sign scaling of the line L i.e. bl (S, T ) where b < 0. The 48

64 orthonormal representation of bl is then: [ ] blh bl o b 2 [L h ] L o S = (3.25) bl h bl o b 2 [L h ] L o [ = L h L h L ] o [L h ] L o (3.26) L o [L h ] L o = S (3.27) [ ] 1 blh bl T = o (3.28) bl bl o bl h = 1 [ ] Lh L o (3.29) L L o L h = T (3.30) Consequently, negative scaling of a Plücker vector makes the line averaging unstable, unless a suitable pre-processing is applied. In order to eliminate this problem, we employ the fact that a negative scaling in the Plücker coordinates alters the direction-vector L h in (2.1) to opposite direction. A simple method to fix this problem is to scale the direction-vectors of all 3D lines to the same predefined half-space in R 3. In order to facilitate this, let n be a normal vector of a plane containing the origin and separating R 3 into two sub-spaces. The direction-vectors of lines in the set {L i } can be sign-scaled to the same half-space defined by n. This is achieved by scaling L i with φ ( n Lh) i : L i φ ( n Lh) i Li (3.31) where L i h is the direction-vector of the line L i and φ is a function: { 1 when x 0 φ(x) = 1 otherwise. (3.32) To find the normal vector n, lets define a set of line direction vectors: L = {L i h, L i h} i=1,...,m. (3.33) 49

65 Thus, an intuitive choice for the normal vector n is the eigenvector corresponding to the largest eigenvalue of L. Geometrically, this eigenvector is the normal vector of the plane passing origin and maximizing the sum of distance from plane to the set of line direction L Generating 3D line samples As a reminder, the basic concept of the averaging based 3D line estimation is to find the mean of the 3D line samples. Each sample shows the possible location of the correctly estimated line in space. Let there be N images of the estimated line. In this proposed methods, a line in the sample set is constructed by the intersection of k back-projection planes from a subset of N images (k N). To compute the intersection of back-projection planes, Hartley and Zisserman s line estimation method [1, page 323] is adopted due to its simplicity. This method hypothesizes that the back-projections from line match intersect at the 3D line estimate. Since a 3D line can be represented by 2 planes where these two plane intersect at the line, the goal is then to determine those two planes from the set of back-projection planes: A = [Π 1,..., Π i,..., Π k ], (3.34) where Π i is a 4-vector representing a back-projection plane. Those two planes can be estimated by using two basis vectors spanning the best rank 2 approximation of A which can be computed by the SVD technique. Let the SVD decomposition of A be A = UDV. Those two planes are then the columns of U corresponding to the two largest singular values. Let those two plane be denoted by Π and Π. The estimated line can be computed by the intersection of those two planes using (2.15). 50

66 3D line samples are then generated by estimating a line using k images where k N. That is, all combinations of k images are used to generate line samples. Therefore, the set of 3D line samples is: Γ = Γ k (3.35) k N where Γ k be a set of lines estimated from all k-combination of N views. Therefore the cardinality of Γ k is C N k. Since, in practical, the number of images, N, can be large, starting to generate line samples from small number of views (small k) can lead to the requirement of huge memory and slow computation. That is, the cardinality of Γ, n(γ), is large. Moreover, using small number of views does not provide enough constraints on the location of 3D lines. For example, let s start to generate line samples from 2 views. The the cardinality of Γ can be calculated by: n(γ) = = N n(γ k ) (3.36) k=3 N Ck N (3.37) k=3 = 2 N N 1. (3.38) This suggests that the line samples should be generated starting from sufficiently large k (k N) D line averaging by averaging in special orthogonal groups Given a set of 3D lines L i (S i, T i ) where i = 1, 2,..., M, the average of the set of 3D lines L (Ŝ, T) can be computed by estimating the mean Ŝ of {S i } on SO(3) and T of {T i } on SO(2) using numerous methods [56, 57, 58]. In this paper, we adopt the nonlinear mean shift over analytical manifold approach proposed by Subbarao and Meer [21] because the mean shift computation is based on the weighted average in local area (kernel 51

67 profile). Moreover, the mean shift algorithm is easy to compute and robust. Following the model presented in [21], I will briefly introduced the non-linear mean shift over SO(n). The non-linear mean shift was proposed for computing the mean of a set of points over analytical manifold on which a metric is defined. The weighted sum of points on the manifold is not well defined because the weight sum may not be a point on the manifold; thus, the mean shift vector may not be valid. The basic concept of non-linear mean shift is to map points on manifold to the tangent space which is locally well defined at a point on the manifold and has the Euclidean structure. The mapped points on the tangent space are used to compute the mean shift vector which is then mapped back to the manifold. For the manifold of rotation group, the mapping from the manifold to its associated tangent space so(n) is the logarithmic operator. Particularly, the tangent space so(3) of the 3D rotation group SO(3) is the set of 3 3 skew-symmetric matrix: 0 ω z ω y log(r) = Ω = ω z 0 ω x, (3.39) ω y ω x 0 where R SO(3). The logarithm mapping of R can be computed by: log(r) = θ 2 sin(θ) (R R ), (3.40) where trace(r) = cos(θ). Note that this method fails if θ = π. The inverse of the logarithm operator that maps a point from so(3) back to SO(3) is the exponential mapping: exp(ω) = I 3 + sin ω ω 1 cos ω Ω + Ω 2, (3.41) ω 2 which is also known as Rodrigues rotation formula and ω = [ω x ω y ω z ] which is a vector in so(3). For the 2D rotation group SO(2), the computation of logarithm and exponential mappings is straightforward. Let R be 2D rotation matrix with a rotation angle 52

68 α. The logarithm and exponential mappings for 2D rotation group SO(2) are then: ([ cos α sin α log sin α cos α ]) = [ 0 α α 0 ] (3.42) ([ 0 α exp α 0 ]) = [ cos α sin α sin α cos α ] (3.43) The logarithm and exponential mappings for rotation groups SO(3) and SO(2) presented above are used in the non-linear meanshift computation. To estimated the starting point for non-linear meanshift averaging, the linear rotation averaging scheme proposed by Gramkow [57] is adopted. Let S 1, S 2 SO(3). By closure axiom, S 1 S 2 is also an element of SO(3) with eigenvalues (1, e iθ, e iθ ) where the angle θ is referred to as the angular distance between S 1 and S 2 which is the geodesic between two elements in SO(3): d geodesic (S 1, S 2 ) = θ(s 1 S 2 ). (3.44) Since the eigenvalues of S 1 S 2 are 1, e iθ and e iθ, the trace of S 1 S 2 is then: trace(s 1 S 2 ) = 1 + e iθ + e iθ (3.45) = cos(θ). (3.46) By using the second order Taylor expansion of the cosine whitout the remainder term, one has cos(θ) θ2. (3.47) and after some algebraic manipulations, we have: θ 2 (S 1 S 2 ) 3 trace(s 1 S 2 ). (3.48) 53

69 Using (3.48) and extending this to all S i, the Karcher mean Ŝ can be computed by: Ŝ = argmin S SO(3) argmax S SO(3) M θ 2 (S S i ), (3.49) i=1 trace ( S M i=1 S i ). (3.50) A compact solution for the mean is given using the singular value decomposition of M i=1 S i: the mean of {S i } is then: M S i = U S D S V S, (3.51) i=1 Ŝ = U S diag (1, 1, det (U S ) det (V S )) V S. (3.52) Following the same steps, Karcher mean for T i on SO(2) can be estimated by: T = U T diag (1, det (U T ) det (V T )) V T, (3.53) where U T D T V T is the singular value decomposition of M i=1 T i. Using both (3.52) and (3.53), the initial solution for line averaging is then given by: L ( Ŝ, T ) SO(3) SO(2). (3.54) The important formulas and how to estimate initial solution for non-linear mean shift for 3D line averaging have been discussed. Following the non-linear mean shift over analytical manifold presented in [21], as discussed in Chapter 2, {S i } SO(3) and {T i } SO(2) are mapped to the tangent spaces so(3) and so(2) respectively by the logarithm mapping. The mean shift vector is computed on the tangent space using the normal kernel profile k N (s) = e 1 2 s and then mapped back to the manifolds SO(3) and SO(2) by exponential mapping. Algorithm 3 summarizes the steps for estimating L. 54

70 Data: A set of lines {L i } where i = 1,..., M. Result: The mean L ( Ŝ, T ) of lines {L i }. S 0 3 3, T ; Define the normal n vector of a plane e.g. the direction vector of a line in {L i }; for i 1 to M do L i h direction vector of line L i; L i φ(n L i h ) L i; Represent line L i by (S i, T i ),see (2.7) and (2.9); S S + S i, T T + T i ; end Compute SVD of S = UDV and T = U D V ; Ŝ U diag (1, 1, det(u)det(v)) V ; T U diag (1, det(u )det(v )) V ; Choose the number of iterations max iter; Choose the kernel bandwidth h; for iter 1 to max iter do m h (S) M i=1 log(ŝ S i )g N ( log(ŝ S i )/h 2 ) M i=1 g ; N( log(ŝ S i )/h 2 ) m h (T) M i=1 log( T T i )g N ( log( T T i )/h 2 ) M i=1 g ; N( log( T T i )/h 2 ) Ŝ Ŝ exp(m h (S)); T T exp(m h (T)); end L ( Ŝ, T ) ; Algorithm 3: Algorithm for 3D line averaging. The function g N is defined as g N (s) = k N (s) where k N is the normal kernel profile. 55

71 Chapter 4: Line based Bundle Adjustment 4.1 Literature review Bundle adjustment is a powerful technique for recovery 3D scene geometry and camera motions given the images of the scene from different viewpoints. It was first developed in the photogrammetry community during 1950s and became the standard technique in others fields including computer vision and robotics. Considering the bundle of all rays between each 3D point and the optical centers viewing the 3D point, the 3D points and the camera motions are simultaneously adjusted by minimizing a cost function. Since the bundle of all rays is adjusted, it is referred to as the bundle adjustment. The cost function used in the bundle adjustment is usually non-linear. Hence, an optimal solution can be iteratively solved by numerical computation techniques e.g. Levenberg-Marquardt [59] or Dog Leg algorithm [60]. Since the iterative computation requires a starting estimate, this initial solution can be obtained from linear approaches or first solving the motion linearly and then estimate the geometric entities which are then used as the initial solution. Therefore, the bundle adjustment is usually used at the last step of the 3D scene geometry and camera motion recovery as the refinement of an initial solution. The optimal solution for the bundle adjustment process is a local minimizer of the cost function; hence, the final solution depends on the initial solution. 56

72 Because of its versatility, a wide range of data can be used in the bundle adjustment such as points, lines, planes and space curves. The most basic and yet common used features in the bundle adjustment is the point features [3, 12, 61, 4, 62, 63, 64, 65]. In photogrammetry, the collinearity equation for point features is usually exploited and extensively used by many researcher for more than two decades. Assuming the central projection camera, the collinearity equation shows the geometric relationship between the coordinate of a point in the object space and the coordinate of its projection on the image plane. That is, a 3D point and its image lie on a straight line, called projection ray, passing through the camera s optical center. In computer vision, this concept is reformulated in the language of projective geometry. As a result, the projection of a 3D point to the image plane becomes a linear function in homogeneous coordinates. In order to estimate camera motion and scene geometry using point features, a set of point matches is required [66, 67, 11]. In terms of the cost function, researchers, collectively, adapt cost functions that minimize the symmetric transfer error [68], re-projection error [66] and the Sampson error [69], which are computed in the 2D image space. However, cost functions can also be defined in object space [70]. In this dissertation, in contrast to using points, I use an alternative but less exploited line features. An advantage of using line-feature over point is that the line-feature is less prone occlusion compared to point. In addition, since the line features show discontinuities in grey values function in one direction while point features do not, line features are then easier to be extracted with sub-pixel accuracy [71]. Moreover, line matching is conjectured to be easier than point matching in many cases [72]. An image line can be represented by either straight or raw edges. As a straight edge, an image line can be represented by two points on the line or a point and line direction. 57

73 Alternatively, it can be also represented by the line equation, e.g. Hessian form, which can be obtained by fitting line to the raw edge or using Hough transform [73]. For the raw edge representation, an image line feature is represented as a sequence of points tracing the edge. An advantage of the straight edge representation over the raw edge is the memory usage because the raw edge representation has to store all of points on the edge. However, representing an image feature by straight edge may not be applicable in some cases. For example, an image obtained from linear array scanner consists of strip images in digital form. Therefore, representing an image line feature as a straight edge is not suitable because points on the image line are on different strip images having different viewpoints.to the best of my knowledge, a 3D line is always represented as infinite straight line or by two points on the line. By representing an image line by raw edge, a cost function for bundle adjustment is defined as a geometric relation between a point on the raw edge and 3D lines, see Figure 4.1. An approach using raw edge for line based bundle adjustment is the collinearity approach. Schenk [74] extended the collinearity equation for raw edges. A 3D line is represented by minimal parameters i.e. 4 independent parameters and any point on the 3D line can be parameterized by an additional parameter. That is, any point on a 3D line can be represented by 5 parameters. The collinearity equation between a point on the raw edge and an unknown point on the 3D line can then be formed. A similar method for uncalibrated cameras was proposed by Bartoli [75] where 3D lines are represented by two points. The obtained solution is then the projective equivalent one. A drawback of the collinearity approach is the increasing number of parameters because not only the camera motion and 3D line parameters but also the parameters used to specify each point on the 3D lines need to be solved. 58

74 Figure 4.1: An image line feature can be represented by raw edge which is a sequence of points. A cost function for bundle adjustment is then defined as a geometric relation between an image point and its corresponding 3D line e.g. collinearity equation. In contrast to the collinear approach, co-planarity approach uses coplanar property between points on raw edge and object space features. Mulawa and Mikhail [76], Karjalainen et al. [77] ans Zielinski [78] represent a 3D line by its direction and a point on the line. A co-planarity constraint based cost function is then defined by using a fact that a point on the projection of the object line is co-planar with a plane defined by the optical center, 3D line direction and a point on the 3D line. Habib et al. [79] represents a 3D line by two points. The mathematical model for estimation is formulated based on the co-planarity between the intermediate point along the image line (raw edge) and the plane formed by the perspective center and the end-points of the corresponding 3D line. Since a 3D line is represented by its 2 end-points, therefore, the estimated 3D line cannot be unique. This ambiguity can be fixed by enforcing the projection of the line end-points to be close to 59

75 the end-points of its corresponding image line e.g. longest image line segment. A similar idea was applied to the determination of camera motion and 3D scene geometry from airborne hyperspectral imagery in [80], where the ambiguity in the determination of 3D line endpoints are eliminated by fixing one of the coordinates (x or y) to the initial values. In order to reduce the amount used for storing image line features, an image line can be analytically represented by line equation or two points on the image line. Habib et al. [81] represent an image line by its line equation; hence the normal vector of the back-projection plane can be obtained directly from line equation. The cost function is formulated based on the coplanarity between a point on the 3D line and back-projection plane. Bartoli and Sturm [15] represent an image line segment by its two endpoints and a 3D line by its Plücker coordinates. With the Plücker coordinates representation, a 3D line can be directly projected to the image plane by line projection matrix. The cost function is defined as the orthogonal distances between an image line endpoints and the projected lines. Similarly, Asai et al. [82] used orthogonal distance between image line endpoints and reprojected 3D line estimate to formulate a bundle adjustment model for road reconstruction. Alternatively, the integrated squared-distance between reprojected line and its corresponding image line segment is also used as the cost function for bundle adjustment in [83]. In contrast to the aforementioned methods which simultaneously optimize the motion and structure parameters, Tang et al. [84] use the orthogonal distances computed from the projections of points on a 3D-line and analytical image lines as the cost function and split the optimization process into distinct steps. Strictly speaking, the camera motion and 3D structure parameters are optimized independently using non-linear optimizer such as the Levenberg-Marquardt algorithm. 60

76 4.2 Overview of this chapter In this chapter, I introduce a new geometric error in object space for bundle adjustment based on line features along with the derivation of the cost function. The proposed object space cost function was inspired by the work proposed by Taylor and Kriegman [83] in which the geometric error is defined by integrated squared-distance between image line segment and re-projected line. The proposed geometric cost function is derived based on the equivalent between 2D projective space P 2 i.e. image plane and the Gaussian sphere S 2 center at the optical center of the camera. To perform bundle adjustment, the camera matrix and 3D line need to be parameterized. In classical optimization techniques, the parameter space is modeled as Euclidean space [85, 86]. In contrast to the classical optimization techniques, a modern approach does not define the parameter space as the Euclidean space or vector space but the nonlinear manifold as frequently occurs in many constrained problems, [87, 88, 89]. By using this concept, the geometric structure of the parameter space can be exploited and a numerical optimization technique can provide solution satisfying constraints at every iteration. Strictly speaking, constraint based problems can be treated as un-constraint based problems. With this favorable property, the manifold approach is adopted. In this chapter, how to parameterize the camera motion and 3D line structure by non-linear manifold will be discussed. The rest of this chapter is organized as follows. The derivation of the geometric error for bundle adjustment is first presented. The parameterization of the camera motion and 3D line structure for bundle adjustment by using non-linear manifolds is the discussed. 61

77 4.3 Geometric error in object space The recovery of the 3D scene and camera pose is the inverse problem dealing with converting observed measurement into system parameters. The basic approach for solving the inverse problem is to find the model parameters minimizing a cost function. The cost function is an operator that either explicitly or implicitly describes the relationship between observed data and the model parameter. In this work, the observed data is the image lines segment and the model parameters the camera motion and 3D line. The proposed cost function is formulated as a geometric error in object space which is derived based on the equivalence between the image plane and the Gaussian sphere centered at the optical center of the camera. With this equivalency, a line segment on the image plane is equivalent to a circular arc on the Gaussian sphere, see section Geometrically, if the camera motion and 3D line structures are correct, the circular arc must coplanar with the projection plane of the 3D line corresponding to the line segment. For example, let the image of a 3D line L be the image line segment having two endpoints x 1 and x 2. These two endpoints are represented by their homogeneous coordinates and equivalent to points V 1 and V 2 respectively, see Figure 4.2. With the correct camera motions and estimated 3D lines, the sector OV 1 V 2 should be coplanar with the projection plane of the line L. It is hence hypothesized that, the correct camera motion and 3D line estimates minimize the integrated squared distance from point on the boundary of the sectors to the projection planes. To derive the proposed geometric error, let consider an image line segment defined by two endpoints x 1 and x 2. Given calibrated cameras, the derivation of the object space cost function starts from mapping these endpoints x 1 and x 2 to unit Gaussian sphere in object 62

78 Figure 4.2: The proposed object space error is defined as the integrated squared-distance between the projection plane and point on the boundary of the sector formed by O, V 1 and V 2. Points V 1 and V 2 are from mapping endpoints x 1 and x 2 of an image line segment onto Gaussian sphere s surface where the line is the image of 3D line L. The projection plane of the 3D line L has normal vector M in the camera coordinate system. 63

79 space (2.25) V 1 = K 1 x 1 K 1 x 1, (4.1) V 2 = K 1 x 2 K 1 x 2, (4.2) These two points form a circular sector OV 1 V 2, see Figure 4.2. In other words, two endpoints of a line segment define a sector. With accurate cameras and 3D lines parameter estimates, it is hypothesized that the integrated squared (Euclidean) distance between the projection planes of 3D line estimates and point on the boundary of sector defined by two endpoints of their corresponding image line segments is minimized. The boundary C of the sector OV 1 V 2 is a piecewise smooth curve consisting of three curves i.e C 1 which is the straight line between the optical center and V 1, C 2 the circular arc between V 1 and V 2 and C 3 the straight line between the optical center and V 1. That is, C = C 1 C 2 C 3. A point V on straight lines C 1 or C 3 can be easily parameterized as a distance r from optical center O by: C 1 := {V V = rv 1, 0 r 1} (4.3) C 3 := {V V = rv 2, 0 r 1} (4.4) Points V on the circular arc between point V 1 and V 2 can be generated by revolving the point V 1 about the normal vector N of the great circle which is defined by: N = [V 1] V 2 [V 1 ] V 2. (4.5) By using Rogriquez s formula, we can then generate every point on the great circle by: V(θ) = V 1 cos(θ) + ([N] V 1 ) sin(θ) + N(N V 1 )(1 cos(θ)), (4.6) where θ is the rotation angle such that θ [0, cos 1 (V 1 V 2 )] because we want to parameterize a point between V 1 and V 2 on the great circle. That is, a point on the circular arc 64

80 between V 1 and V 2 is parameterized with rotation angle about the normal vector N in the right hand convention. Since the point V 1 is orthogonal to the plane normal vector N, the right-most term in the right side of (4.6) vanishes: V(θ) = V 1 cos(θ) + ([N] V 1 ) sin(θ) C 2. (4.7) Given a plane Π = [M 0] in camera coordinate system which is the projection plane containing the optical center and 3D line corresponding to the line segment with endpoints x 1 and x 2, the shortest distance from a point V on the curve C to the plane M is simply: d (Π, V) = V M M d2 (Π, V) = M VV M M 2 (4.8) Hence the object space error is formulated as the integration of the squared-distance of point on the curve C to the projection plane with normal vector M: object space error = g(m) = = 1 M VV Mds (4.9) M 2 C ( ) 1 M 2 M VV ds M, (4.10) where ds is an elementary arc length. Since, the curve C is a piecewise smooth curve, the integral (4.10) can be breaked into the summation of 3 line integrals: C=C1 C2 C3 VV ds = VV dr + C 1 VV dθ + C 2 VV dr C 3 (4.11) C From (4.3) and (4.4), the line integration on curve C1 and C3 are simply: VV dr = C 1 VV dr = C r 2 V 1 V 1 dr = V 1V 1 3 r 2 V 2 V 2 dr = V 2V 2 3 (4.12) (4.13) 65

81 With the point parameterization on the circular arc C 2 in (4.7), the line integral on C 2 can be computed by: C 2 VV ds = θ 0 = V 1 V 1 = (V 1 cos(θ) + [N] V 1 sin(θ)) ( V 1 cos(θ) V 1 [N] sin(θ) ) dθ θ 0 θ cos 2 (θ)dθ [N] V 1 V1 [N] sin 2 (θ)dθ + ( ) θ [N] V 1 V1 V 1 V1 [N] cos(θ) sin(θ)dθ 0 ( θ 2 + sin(2θ) ) ( θ V 1 V1 4 2 sin(2θ) ) N] V 1 V1 [N] 4 + sin2 (θ) ( ) [N] V 1 V1 V 1 V1 [N] 2 0 (4.14) By substituting the line integrals (4.3), (4.4) and (4.7) into (4.10), we obtain the closed form of the object space error of an image line. The normal vector M of the projection plane is obviously the function of optical center and a 3D line parameters. By transforming the object space coordinate system to a camera coordinate system, the moment vector of the 3D line, see Figure 2.1, in the camera coordinate system is thus the normal vector M of the projection plane Π. Let L be the Plücker coordinates of a 3D line in object space coordinate system, R and T the rotation matrix and translation vector of the camera. The Plücker coordinate of the line in the camera coordinate system is then (2.20): [ L L = h L o ] [ R R[T] R ] L. (4.15) The normal vector M of the projection plane is then the moment L o of the 3D line L : M = R [ [T] I 3 ] L, (4.16) or alternatively in terms of orthonormal representation: [ t11 s M = R [ [T] I 3 ] 1 t 21 s 2 66 ], (4.17)

82 where L (S, T), see (2.10). Figure 4.3: A 3D line under changing coordinate system from object space to camera coordinate system. The moment of of the 3D line in camera coordinate system, L o, becomes the normal of projection plane. 4.4 Parameterization of camera motion and 3D line In classical optimization techniques, the parameter space is usually a vector space. Namely, the Euclidean space R n is used to model the space of parameters. Classical optimization techniques has long been developed based on the use of Euclidean space. The cost function or objective function is defined as the mapping from the Euclidean space to the real line R. To solve the optimization problem using vector space approach, the first step is to parameterize estimated entities by a vector in Euclidean space R n. If necessary, either linear or non-linear constraints are imposed to guarantee the valid estimated entities. For example, a rotation matrix can be parameterized by a quaternion. The unit norm constraint need to be taken into the estimation model in order to get a quaternion that represent a valid rotation matrix. 67

83 A modern approach for a solving optimization problem is to model the parameter space as a non-linear manifold M, especially, Riemannian manifold. The main characteristic of the Riemannian manifold is that the tangent space at each point on the manifold has an Euclidean structure. The concept of this approach is not to use the parametric expressions of the estimated entities but the expressions for adjusting the estimated entities. In other words, the element of the parameter manifold is the estimated entities not the parameters of the parameterized entities. For example, a point on the parameter manifold of the rotation matrix estimation problem is a matrix not the Euler s angles or quaternion parameterization of the rotation matrix. Therefore, the solution from an optimization technique is used for directly updating the matrix not the parameters of parameterized the rotation matrix. Figure 4.4: Local parameterization of the parameter manifold. Each point on the manifold has a Euclidean structure. The current estimate is updated locally such that solution is still on the manifold and minimizes the cost function g. The objective function is defined as a smooth real function g : M R. The goal of optimization is to find an element ξ in parameter manifold M minimizing g. Given a 68

84 starting point ξ 0 M, the (local) minimizer of g can be solved by iterative algorithms on the manifold M which can be visualized as iteratively walking on the manifold, see Figure 4.4. Namely, the sub-optimal solution is iteratively corrected to a locally optimal one. The step size and direction for iterative computation are computed in the tangent space, which is a Euclidean space, at the current solution on the manifold M, and then project back to manifold to adjust the estimated entities. In the rest of this section, parameterizations of the camera motion and 3D line in terms of non-linear manifold are discussed Parameterization of camera motion The camera motion consists of two entities, particularly translation vector (the position of the optical center) and rotation matrix. The expression for updating the translation vector is discussed first. Let the translation vector from the last iteration be denoted by T 0 and [T x T y T z ] the updating term for translation vector. The new translation vector T is then: T = T 0 + T x T y T z, (4.18) because the manifold of the translation vector is linear. Taking the derivative of the translation vector with respect to its updating term evaluated at the point [T x T y T z ] = (0 0 0) leads to: T x T = 1 0 0, T = T y 0 1 0, T = T z (4.19) For the camera s rotation matrix, there are various parameterization e.g. unit quaternion, SO(3) manifold, Euler s angles, Cayley-Klein parameterization or spinor. A rotation 69

85 matrix R = [r 1 r 2 r 3 ] has 9 elements with 6 constraints: r i = 1 where i = 1, 2, 3 r i r j = 0 where i j. (4.20) Thus, its minimal representation has 3 degrees of freedom. With the above constraints, the manifold of the rotation matrix is then non-linear and a point on the manifold is a matrix satisfying the above constraints. According to the concept of the optimization on manifold, the parametric expression of a rotation matrix does not need to be known but the expression for updating the rotation matrix. Let the rotation matrix from the last iteration be denoted by R 0. Following the technique presented in [90], the rotation matrix R in the local region of R 0 is defined by: R(ω) = R 0 exp ([ω] ), (4.21) where ω = [ω x ω y ω z ]. Note that the vector is used for updating R 0 not the parameters of a parameterized rotation matrix. It can be observed that the above equation is the first order change of rotation matrix R 0. The term exp ([ω] ) is actually a rotation matrix where ω and ω are the axis and angle of rotation of the axis angle representation for 3D rotation. The above equation can be interpreted as the update of R 0 by exp ([ω] ). In terms of the Riemannian manifold, the vector ω is on the tangent space at R 0 and the exp ([ω] ) is the projection from the tangent space back to the manifold SO(3). In other words, the rotation matrix is updated to a distinct point on the manifold in a different way from the parameterization in vector space e.g. Euler angles or quaternion parameterization. 70

86 The derivative of the rotation matrix can be evaluated at R 0 and ω = as follows: 1 R ω x = R 0 0 ω= (4.22) = R = [0 3 1 r 0,3 r 0,2 ] (4.23) R ω y = R 0 1 ω= (4.24) = R = [ r 0, r 0,1 ] (4.25) R ω z = R 0 0 ω= (4.26) = R = [r 0,2 r 0, ] (4.27) (4.28) where r 0,i i = 1, 2, 3, is the column i of the rotation matrix R 0. Although this approach is not well known in computer vision and photogrammetry where Euler s angle and quaternion representations are popular, it is a standard approach in physics and robotics for optimization involving rotation. It can be observed that the updated rotation matrix always satisfies the six constraints for a rotation matrix. Moreover, it is simple and straightforward because the expression of the rotation matrix need not to be known Parameterization of 3D line There are various parameterizations for a 3D line in space. The chosen line parameterization for bundle adjustment should not make the line be over-parameterized or have 71

87 singularity. A simple representation is two points on the line. However, it gives a difficulty in solving for a solution because the solution is not unique. Similarly, representing a line by a point on the line and line direction does not have a unique solution. Although this obstruction can be fixed by using the point on the line closest to the origin, the orthogonality constraint need to be imposed in the estimation model and complexity of the problem is then increased. Although the Plücker coordinates are a good candidate because the uniqueness of the solution can be obtained due to the Plücker and unit norm constraints, its manifold is not Riamannian. In this work, the orthonormal representation proposed in [15] is adopted because the the Plücker and unit norm constraints are naturally satisfied. As discussed in Chapter 2, The concept of the orthonormal representation is to embedded the Plücker coordinates into the 3D and 2D rotation matrices. As a result, the Plücker coordinates represented by the orthonormal representation are parameterized by (minimal) 4 parameters, which are the rotation angles of 3D and 2D rotation. Moreover, the space of 3D line can then be modeled by the manifold of rotation matrix. By representing the line by 3D and 2D rotation matrices, 3D lines can be locally updated in the same way as updating rotation matrices which preserves the orthogonal and unit norm constraint according to the properties of a rotation matrix. In order to demonstrate this process, let a line L 0 from the last iteration be represented by (S 0, T 0 ) SO(3) SO(2). The new line L (S, T) is then 72

88 S = S 0 exp v x v y v z (4.29) ([ 0 θ T = T 0 exp θ 0 ]) [ cos θ sin θ = T 0 sin θ cos θ ]. (4.30) That is S and T are locally defined in the neighbor regions of S 0 and T 0. At each iteration in the estimation process, 3D lines can be locally updated by parameters vector [v x v y v z θ] by above equations. Since the orthonormal representation is based on 3D and 2D rotation matrices, the Jacobian matrix of Plücker coordinate L [ ] t 11 s 1 t 21 s 2 can evaluated with respect to these four parameters by using the derivative of 3D and 2D rotation matrices. The Jacobian matrix of S with respect to v at can be computed by: ( v S = S v=0 v x v=0 S v y v=0 ) S v z v=0 (4.31) 73

89 where S v x = S 0 v=03 1 = S 0 S v y = S 0 v=03 1 = S 0 S v z = S 0 v=03 1 = S (4.32) = [0 3 1 s 0,3 s 0,2 ] (4.33) (4.34) = [ s 0, s 0,1 ] (4.35) (4.36) = [s 0,2 s 0, ] (4.37) (4.38) where s 0,i i = 1, 2, 3, is the column i of the rotation matrix S 0. The derivative of T with respect to θ at θ = 0 is simply: [ d 0 1 dθ T = T ] [ t0,12 t = 0,11 t 0,22 t 0,21 ]. (4.39) where t 0,ij i = 1, 2, is the element at the row i and column j of the rotation matrix T 0. As a result, the Jacobian matrix of the line L with respect to [v 3 1 [ v x L v y L v z L θ] at 0 is: ] [ θ L 03 1 t = 0,11 s 0,3 t 0,11 s 0,2 t 0,21 s 1 t 0,21 s 0, t 0,21 s 0,1 t 0,11 s 0,2 ] (4.40) 4.5 Numerical optimization Given the initial estimation of the camera motion i.e. translation vector and rotation matrix and the structure i.e. 3D line, the bundle adjustment is exploited to refine those 74

90 entities to obtain optimal ones. Since the proposed geometric error presented in Section 4.3 is a non-linear function of camera motion and 3D line, the bundle adjustment can be then be performed by solving the non-linear least square problem. Hence, the structure and motion parameters can be iteratively adjusted by numerical optimization techniques. Given N cameras viewing M 3D lines, the bundle adjustment model can be formulated in least square sense by: F = N M w i,j g(l i,j, T i, R i, L j ) = r r, (4.41) i=1 j=1 where w ij is one if the line j is visible in view i and r the residual vector with dimension (r 1). The optimal 3D lines and camera motions are the minimizer of the above function F. The optimal solution can be obtained by adjusting the given sub-optimal solution. To iteratively adjust the estimated entities particularly the camera motions and 3D lines, we seek to find the correction parameters for: optical center: [T x T y T z ], see (4.18) camera rotation: [ω x ω y ω z ], see (4.21) 3D line: [v x v y v z θ], see (4.30). To obtain the above correction parameters, let δ be the vector of the correction parameters and ξ be a point on the parameter space in the neighborhood region of sub-optimal solution ξ 0, see Figure 4.4. Remind that ξ 0 contains the sub-optimal translations, rotation matrices and 3D lines. For the small correction parameters δ, the Taylor series expansion of the function F is: r(ξ) r(ξ 0 ) + Jδ (4.42) 75

91 where J is the Jacobian matrix of the residual vector r with respect to the correction parameter δ: J r p = δ r (4.43) The above Taylor series expansion is the basis of solving non-linear optimization problem such that the minimization problem (4.41) can be solved by solving a sequence of approximations of the original problem [62]. Namely, we seek to find the solution of the following linear least square problem: r(ξ 0 ) + Jδ 2 2. (4.44) ξ 0 is corrected by the solution of the above linear lease square δ which can be obtain by solving the following normal equation: Nδ = J r(ξ 0 ), (4.45) where N is the Gauss-Newton approximation of the Hessian matrix: N p p = J J (4.46) The normal matrix N in (4.45) can be rank deficient. One of an approaches for solving rank deficient systems is the Levenberg-Marquardt algorithm. The basis of Levenberg- Marquardt algorithm is to solve the augmented normal equation [91, 62], as follow: (N + H(λ))δ = J r(ξ 0 ), (4.47) where λ > 0 is called damping parameter. That is, the normal matrix N is regularized by a symmetric definite matrix H(λ) so called damping matrix [62], which is often chosen as H(λ) = λi p p H(λ) = λdiag(n) 76

92 The damping parameter λ needs to be tuned at each iteration. A basic strategy of tuning λ is by dividing it by a constant, often 10, when the error decreases after the correction of ξ 0 i.e. r(ξ) < r(ξ 0 ) and then the step is accepted. Otherwise, the damping parameter is multiplied by the constant and the step is neglected. The summary of the Levenberg- Marquardt algorithm is presented in Algorithm 4. Data: Initial estimated camera translation T 0 and rotation R 0 and 3D line L 0. Result: Adjusted camera translation t and rotation R and 3D line L. Step 1 Compute the correction parameters (N + H(λ))δ = J r(ξ 0 ); Step 2 Adjust camera translations and rotations and 3D lines using the obtained correction parameters, see (4.18),(4.21), (4.30); Step 3 If the new estimation error is greater than the initial estimation error, the adjusted camera translations and rotations and 3D lines are declined and the damping parameter is multiply by a constant. If the new estimation error is less than the initial estimation error, the adjusted camera translations and rotations and 3D lines are accepted and the damping parameter is divided by a constant. The initial estimation error is reset to this new estimation error.; Step 4 Repeat until convergence; Algorithm 4: Levenberg-Marquardt algorithm In order to solve the camera motions and 3D lines by bundle adjustment, especially the computation of the Jacobian matrix, the proposed cost function need to be differentiated with respect to correction parameters. Let δ = [T x T y T z ω x ω y ω z v x v y v z θ] be the vector of updating parameters. We therefore seek to find the Jacobian matrix J of the cost function (4.10) with respect to δ at given camera s translation vector T 0, rotation matrix R 0 and 3D line L 0 (S 0, T 0 ): J = dg dξ = dg dm dm dξ (4.48) 77

93 Starting from differentiating the cost function (4.10) with respect to the normal vector M, we obtain: dg dm = 1 d M 2 dm M DM + M DM d ( ) 1 dm M 2 = M (D + D) M 2 (4.49) 2M DMM M 4, (4.50) where D is the integral in (4.11) and M 0 = R 0 [ [T 0 ] I 3 ] L 0. The derivative of M (4.17) with respect to the updating parameters of the optical center [T x T y T z ] is: [ M = R 0 [ [1, 0, 0] ] ] T L 0 (4.51) x [ M = R 0 [ [0, 1, 0] ] ] T L 0 (4.52) y [ M = R 0 [ [0, 0, 1] ] ] T L 0 (4.53) z Using the formulae in (4.28), the derivative of the normal vector M with restpect to the updating terms ω x, ω y and ω z is simply: M ω x = [ R ] [ 0 [1, 0, 0] [T0 ] ] I 3 L0 (4.54) M ω y = [ R ] [ 0 [0, 1, 0] [T0 ] ] I 3 L0 (4.55) M ω z = [ R ] [ 0 [0, 0, 1] [T0 ] ] I 3 L0 (4.56) The derivative of the normal vector M with respect to the updating terms v x, v y, v z and θ can be computed by using (4.40): [ v x M v y M v z M ] θ M = R [ [t] I 3 ] [ t 0,11 s 0,3 t 0,11 s 0,2 t 0,21 s 1 t 0,21 s 0, t 0,21 s 0,1 t 0,11 s 0,2 ], (4.57) where s 0,i i = 1, 2, 3 is the column i of the rotation matrix S 0 and t 0,ij i = 1, 2 the element at the row i and column j of the rotation matrix T 0. As a result we obtain the derivative of 78

94 the normal vect M with respect to the updating parameters: [ δ M = M M M M M M M T x T y T z ω x ω y ω z v x v y M v z M ] θ M (4.58) 79

95 Chapter 5: Experimental results This chapter is dedicated for demonstrating the performance of proposed line based 3D scene and camera motion recovery methods including 3D line estimation and bundle adjustment. Experiments were conducted using both synthetic and real data. The synthetic case includes a set of known forty 3D lines bounded by endpoints and camera projection matrices designed to simulate a camera motion. The real data is composed of the model house sequence from Oxford multi-view dataset 1 and a homemade datasets including the images from pinhole and fish eye cameras. The performances of the proposed methods were compared with Hartley and Zisserman s line triangulation method [1, page 323] and line-based bundle adjustment from Taylor and Kriegman [83] and Bartoli and Sturm [15]. 5.1 Performance evaluation An approach for evaluating the performance of 3D scene and camera motion recovery methods is by measuring the similarity between the image lines and the re-projection of estimated 3D lines. Since image lines feature are represented as line segments and estimated lines as infinite lines, the suitable similarity metric between them is the orthogonal distances between the image line endpoints and the re-projections of estimated lines, see Figure 5.1. Let R and T be estimated rotation matrix and translation vector of a camera. 1 vgg/data/data-mview.html 80

96 The estimated line L can be projected to the image plane by first transforming the line to the camera s coordinate system using (2.20): L = [ L h L o ] [ R ] R [ T R ] L. (5.1) The projection plane of the estimated line L then has normal vector L o in the camera coordinate system, see Figure 4.3. By the relation between the image line and the normal of projection plane (2.28), the image of the estimated line can be computed by: [ ] l K R [ T ] R } {{ } Q L. (5.2) The matrix Q is called line projection matrix. The similarity measure is then the orthogonal distances from image line segments to the re-projection of estimated 3D lines i.e. Q L. Unlike the real image dataset, the synthetic data are generated from a control 3D structure. It is hence possible to evaluate and compare the performance of both the proposed methods and reference ones by using error in 3D. Since, in the estimation, the object space coordinate system is attached to the first camera s coordinate system, the estimated 3D structure needs to be transformed to the actual object space coordinate system before computing 3D error. The appropriate transformation is the Euclidean one because the cameras used in the estimation are calibrated. Such transformation can be determined by estimating rotation and translation that minimize the orthogonal distances from 3D line segments endpoints and the estimated 3D lines. Therefore, the 3D reconstruction error, which is the orthogonal distance from estimated lines to end-points of 3D line segments, is computed after least square Euclidean alignment with ground truth. 81

97 Figure 5.1: The similarity measure between the re-projection of the estimated line and its corresponding line segment is the orthogonal distance from line segment endpoints to the re-projected line. 5.2 Experiments on synthetic data To test the performance of the proposed method on controlled setup, the experiments on synthetic data were conducted. Each 3D line was generated by 2 random points in a unit cube [ 1, 1] [ 1, 1] [ 1, 1] and each camera had the identical calibration matrix 2. Each camera pointed to the center of the scene, which is the origin of the object space. The optical centers of cameras are on the circle with radius 4 meters on the plane Z = 3. Namely, the camera optical center were revolved abound the Z axis. The 3D line segments were projected to the image planes by the projections of their endpoints. To analyze the effect of observation noise on the image space on the estimation result, Gaussian random noise with standard deviation σ was added to the endpoints of line 2 The calibration matrix of the camera in Middlebury multi-view stereo dataset, middlebury.edu/mview/data/, is used 82

98 segments on the images and the true camera matrices were perturbed and used in the experiments. The performance was measured by Root Mean Squared (RMS) 2D re-projection and 3D reconstruction errors. The first experiment on synthetic data was the wide baseline case using 40 lines and 8 cameras where the distance between the optical centers of two consecutive cameras is constant. see Figure 5.2. For the averaging approach 3D line estimation, the 3D line sample generation starts from using 5 images. The initial 3D line estimates from dual projective space approach were used as the initial solution for all bundle adjustment models including the proposed and reference ones. The estimation process for all methods was repeated 50 times with different random noise and the averaged RMS of 2D and 3D error are reported in Figure 5.3. The plot shows the performance of both the proposed line estimation methods and the bundle adjustment along with alternative ones as functions of standard deviation of noise. It can be observed that the errors in line estimation and bundle adjustment increase linearly with the increase in the standard deviation of observation noise. The estimation errors were reduced after bundle adjustment. The second experiment on synthetic data was the short baseline case. 40 lines and 8 cameras were used. Camera poses were generated by rotating a current camera by 9 degrees about Z-axis, see Figure trials of estimation process were performed. The Root Mean Square (RMS) of errors are reported in Figure 5.5. It can be observed that the reconstruction errors in line estimation and bundle adjustment of the short baseline case are higher than that of wide baseline case. The increasing of the reconstruction error is natural when the baseline between camera is reduced because insufficient parallaxes between viewpoints. 83

99 45 degrees 4 meters 3 meters (a) Camera setup parameters (b) Cameras in 3D Figure 5.2: Camera setup for wide baseline 84

100 Reprojection error [pixel] Dual projective space approach Averaging approach Hartley and Zisserman Proposed bundle adjustment model Taylor and Kriegman Bartoli and Sturm Standard deviation [pixel] (a) Reconstruction error [meter] 9 x Dual projective space approach Averaging approach Hartley and Zisserman Proposed bundle adjustment model Taylor and Kriegman Bartoli and Sturm Standard deviation [pixel] (b) Figure 5.3: Experiment result on wide base line case. For the comparison for 3D line estimation methods, it can be observed that the dual projective space based 3D line estimation gives the lowest 2D re-projection error among all 85

101 3D line estimation methods in both wide and short baseline cases. In contrast, the averaging approach for 3D line estimation gives the best result in terms of 3D reconstruction error. 5.3 Experiment on real data The proposed method are also tested on two real datasets i.e. the model house and book sequences. The performances of the proposed methods are compared with alternative ones by using only re-projection error because of the lacking of the scene ground truth The model house sequence In this experiment, a sequence, called model house sequence from Oxford s visual geometry group 3 was used. This dataset was generated by taking the images of a model house placed on a revolvable table. Conversely, by considering that the model house was not moving, the camera was moved around the model house in order to obtain views of different sides of the model house. With the provided camera matrices, the cameras were assumed to be calibrated. That is, only the calibration matrices were used as the initial information. The camera motions were then initialized by the factorization method [2] using the provided point matches. The calibration matrices were then used to stratify the camera motion to Euclidean space. The provided line matches and the initial camera matrix estimates were used to estimate lines. Note that most lines were not visible on all views. At this step, the dual projective space approach 3D line estimation method reported the RMS of the orthogonal distance from the re-projected line estimates and their corresponding line segment 0.85 pixel while the averaging approach 0.63 pixel and Hartley and Zisserman s approach [1, page 323] 0.62 pixel. 3 vgg/data/data-mview.html 86

102 The estimated lines from dual projective space approach and the initial camera matrices are then subjected to the bundle adjustment process proposed in this dissertation and two alternative models proposed in [83] and [15]. For quantitative justification, the performance of the proposed method was measured by using orthogonal distance between line segment end-points and projected line estimates, as discussed in Section 5.1. The performance of all the methods are tabulated in Table 5.1. In this table, the RMS, minimum, maximum and standard deviation of orthogonal distance are report. The RMS of the orthogonal distance from the proposed method is closest to that of [15] which directly minimizes the orthogonal distance from line segments endpoints and re-projected line estimates. However, our proposed method shows smallest standard deviation. For visual justification of the proposed bundle adjustment model, the estimated 3D lines are reprojected onto the image planes using (5.2). Figure 5.6 shows the re-projections of 3D line estimates on two images of the model house sequence where the end-points of each line segment were orthogonally projected to the re-projection of their corresponding 3D line estimates. In Figure 5.7, reconstructed 3D structure of the model house seen from different viewpoints are demonstrated. In Figure 5.9, the top view of the model house from the 3D data provided by the dataset along with the bundle adjustment results from [83, 15] and the proposed method are visualized. Although the 3D line data coming with the data does not look reasonable, the 3D lines are projected to their correct position on the image plane using the provided camera matrices. Similarly, the reference [15] gives the best re-projection error but its re-constructed scene does not look reasonable. In contrast, the proposed method has a visually best re-constructed scene and acceptable re-projection error compared to the best re-projection error. Figures 5.7b and 5.9d show the side and top view of the reconstructed model house from the proposed bundle adjustment model. The 87

103 co-planar features are nearly coplanar from the reconstruction e.g. windows and doors on the wall. Notice that the proposed bundle adjustment model could not provide accurate estimates for the positions of line segments on the roof since the cameras were revolved about the model house. Hence, there are not enough information from the top of the model house. Table 5.1: Comparative results from the experiment with the model house sequence. Proposed approach Reference [83] Reference [15] RMS of orthogonal distance (pixel) Minimum orthogonal distance (pixel) Maximum orthogonal distance (pixel) Standard deviation of orthogonal distance (pixel) The book sequence The book sequence consists of 9 images were taken by a calibrated cameras. There are 25 manually detected lines which are visible on all images. The camera motions were initialized by the homography decomposition method [92]. The homographies between ground plane and image planes were computed using just four corners of standard letter size paper flatly laid on the ground plane. The initial camera matrix estimates are used to estimate lines using the proposed methods and Hartley and Zisserman s method. At 88

104 this step, the dual projective space approach 3D line estimation reported the RMS of the orthogonal distance from the re-projected line estimates and their corresponding line segment 3.21 pixels while the averaging approach 4.05 pixels and Hartley and Zisserman s approach [1, page 323] 3.46 pixels. The 2D re-projection errors of this sequence is bigger than those of the model house sequence because fewer information was used to initialize the camera matrices. That is, only four points were used to estimate homography between the image planes and reference plane in the scene. The obtained calibration matrix was used in the bundle adjustment model without any further refinement. Table 5.2: Comparative results from the experiment with the book sequence. Proposed approach Reference [83] Reference [15] RMS of orthogonal distance (pixel) Minimum orthogonal distance (pixel) Maximum orthogonal distance (pixel) Standard deviation of orthogonal distance (pixel) The re-proejection error after bundle adjustment from the proposed method and references [83, 15] are reported in Table 5.2. The RMS of the orthogonal distance from the proposed method is closest to that of [15]. Although, the cost function of [15] is directly the orthogonal distance between end-points of line segments and projected line estimates, our proposed method shows smallest standard deviation. 89

105 For visual justification of the bundle adjustment result, the estimated 3D lines from the proposed model are re-projected onto two sample images of the book sequence in Figure The zoom on an area of an image is illustrated in Figure Note the significant improvement of the bundle adjustment over the line estimation method. In Figure 5.14, reconstructed 3D scene of the book sequence seen from different views are demonstrated. Note the orthogonality, verticality and parallelity of line features from book, bookend and ruler in the top view of the reconstructed 3D scene in Figure 5.13a. Figure 5.13b illustrates the front view of the reconstructed scene. Coplanar lines on the actual object appear in the same configuration in the reconstruction. In contrast to the model house data, the book sequence gives a better reconstruction because of the variation in viewpoints and estimated lines are visible in all views Fish eye camera In order the demonstrate the versatility of the proposed methods, an experiment using fish eye camera was conduct. The Omnidirectional Camera Calibration Toolbox for Matlab (OCamCalib) developed by Scaramuzza 4 was used to calibrate the camera and the checkerboards was used as the calibration pattern. The calibration toolbox estimated the polynomial function and affine transformation of the camera as defined by Micusik and Pajdla [93] and also the camera motions. Corners of squares were used as line segments endpoints. In stead of using the back-projection of the points on image plane to object space by (2.25) which defined for the pinhole camera, the points were mapped onto the Gaussian sphere by first projecting them to the object space using the inverse projection function of the camera model and then normalized them to unit vector

106 To show the recovery performance, the camera motions from the calibration were randomly perturbed. Given the perturbed camera motions, the 3D line estimation and bundle adjustment can then be performed following methods proposed in Chapters 3 and 4, respectively. In this experiment, the line-based bundle adjustment models from Taylor and Kriegman [83] and Bartoli and Sturm [15] are not used for a comparison because their models were proposed for conventional camera. To visualized the re-projection of the estimated 3D lines, line segments endpoint were projected back onto the estimated lines. At this step, endpoints of all 3D estimated lines were obtained. Points between 3D line endpoints were then re-projected to images. Figure 5.15 illustrates the re-projections of 3D line segments. The reconstructed calibration pattern and the camera motions are illustrated in Figure Notice the co-planarity of reconstructed 3D lines in Figure 5.16b and orthogonality and parallelity between lines in Figure 5.17a. The comparison between the initial and adjusted results are shown in Figure 5.18 and the solution was improved after bundle adjustment. 91

107 9 degrees 4 meters 3 meters (a) Camera setup parameters (b) Cameras in 3D Figure 5.4: Camera setup for short baseline 92

108 3 Reprojection error [pixel] Dual projective space approach Averaging approach Hartley and Zisserman Proposed bundle adjustment model Taylor and Kriegman Bartoli and Sturm Standard deviation [pixel] (a) Reconstruction error [meter] Dual projective space approach Averaging approach Hartley and Zisserman Proposed bundle adjustment model Taylor and Kriegman Bartoli and Sturm Standard deviation [pixel] (b) Figure 5.5: Experiment result on short baseline case. 93

109 Figure 5.6: Two images from the model house sequence overlaid with image lines (yellow dash lines) and re-projected line estimates (black solid lines). 94

110 (a) (b) Figure 5.7: 3D line reconstruction and estimated camera motions from the proposed method after bundle adjustment. 5.7a shows the 3D structure and camera poses. 5.7b shows the side view of the model house. 95

111 (a) (b) Figure 5.8: Reconstructed 3D scene viewed from two different viewpoints 96

112 (a) Triangulated 3D line from the dataset (b) Reference [83] (c) Reference [15] (d) Proposed method Figure 5.9: Top view of the reconstructed model house sequence from different methods. 97

113 Figure 5.10: Reprojection of estimated 3D lines onto two sample images of the book sequence. The manually detected image lines are plotted in yellow dash lines and reprojected line estimates in black solid lines. 98

114 (a) Initial solution (b) After bundle adjustment Figure 5.11: Zoom on an area of an image from the book sequence overlaid with image lines (yellow dash lines) and re-projected line estimates (black solid lines). The result from initial 3D line triangulation is improved by the bundle adjustment process. 99

115 (a) Figure 5.12: The reconstructed 3D scene and camera poses. 100

116 (a) (b) Figure 5.13: Top and front views of the reconstructed scene. 101

117 (a) (b) Figure 5.14: Re-constructed 3D scene viewed from different angles. 102

118 (a) (b) Figure 5.15: Re-projection of the estimated lines to the image. The re-projected lines are shown in yellow solid line. 103

Multiple View Geometry in Computer Vision Second Edition

Multiple View Geometry in Computer Vision Second Edition Richard Hartley Australian National University, Canberra, Australia Andrew Zisserman University of Oxford, UK CAMBRIDGE UNIVERSITY PRESS Contents