Estimating Pose and Motion using Bundle Adjustment and Digital Elevation Model Constraints. Gil Briskin

Size: px

Start display at page:

Download "Estimating Pose and Motion using Bundle Adjustment and Digital Elevation Model Constraints. Gil Briskin"

Peter Park
5 years ago
Views:

1 Estimating Pose and Motion using Bundle Adjustment and Digital Elevation Model Constraints Gil Briskin

3 Estimating Pose and Motion using Bundle Adjustment and Digital Elevation Model Constraints Research Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Gil Briskin Submitted to the Senate of the Technion Israel Institute of Technology Sivan 5773 Haifa May 2013

5 This research was carried out under the supervision of Prof. Ehud Rivlin and Dr. Hector P. Rotstein, in the Faculty of Computer Science.

7 Contents List of Figures Abstract 1 1 Introduction Related Work Current Work Thesis Organization Bundle Adjustment Introduction Coordinates System BA equations Cost Function and Numeric Solver Gauss-Newtons Method Levenberg Marquardt Sparse Bundle Adjustment Digital Terrain Map Introduction Definition and Properties Degrees of Freedom Bundle Adjustment With Digital Terrain Model Constraints Introduction Coordinate System Conversion Surface Approximation Plane Approximation Second order Approximation Bundle Adjustment Framework Image Dilution Feature Extraction Outlier Detection

8 4.4.4 Parameters Initial Guess Solving the Bundle Adjustment Influence on the Bundle Adjustment Structure and Computation Time Bundle Adjustment Structure Computation Time Handling Errors Iterative Closest Point Introduction Iterative Closest Point Experimental Results Introduction Synthetic Experiments Experiments on Small Scale Model Experiment Outline Evaluation methods Experiment Flow Results Conclusion and Future Work 51 Hebrew Abstract i

9 List of Figures 2.1 The Bundle Adjustment weight matrix The Bundle Adjustment Jacobian matrix The Bundle Adjustment Hessian matrix Digital Terrain Model Examples for degenerated scenes The input generation in the synthetic experiments The synthetic scenarion Camera s Location Error on Angular Drift Camera s Location Error on Velocity Error Camera s Location Error Norm Slopes Comparison Images from the first experiment Images from the second experiment The estimated and measured camera path The estimated location error The estimated angles error The reconstructed DTM Heights Ill-constraint point example

11 Abstract Pose and motion estimation of a calibrated camera is a common application in the photogrammetric world. In most cases this is solved by combining motion estimation with connections between the images taken by the camera and external geographical data such as Global Positioning System (GPS), Orthophoto, Digital Terrain Model etc. While the motion estimation can be solved automatically using only the connections between the images, the connections to the geographical data are generated manually or by using geographical measurements, such as GPS signals of the camera s motion. The contribution of this thesis is a new constraint added to a pose and motion estimation algorithm. We propose to integrate the Digital Terrain Model (DTM) into the Bundle Adjustment framework. For that, the terrain was approximated to a differentiable second order function and new constraints were added to the Bundle Adjustment that minimized the distance between the 3D points and the terrain approximation. A framework that solves the pose and motion estimation with the new constraints using inaccurate initial guess was added. We showed that under certain conditions, the proposed method can replace other constraints based on geographical measurements, such as GPS signals on camera s motion, and by that be the first method that solves the pose and motion estimation of a sequence of images using only DTM and without additional geographical source. Our method has several advantages: The generated 3D structure is more accurate when the DTM constraints are added to the Bundle Adjustment, specifically for Ill-constraint points. In addition, DTM is available worldwide, can be acquired offline and is resistant to signal distortion and blocking as apposed to other measurements such as GPS. 1

12 2

13 Chapter 1 Introduction In spite of the effort placed in finding efficient and robust estimates for the pose and motion of a calibrated camera from multiple image views, the problem continues to attract extensive attention in the photogrammetric and computer vision communities. Perhaps the main reason for this continuous attention is the well-known fact that pose and motion cannot be uniquely solved from a series of images. Some of the limitations are obvious from the start: one cannot expect to obtain absolute information about pose with respect to an external coordinate system from a sequence of images; other limitations are more subtle and relate to the specifics of the visual information. One such limitation is the pose estimation when the images are taken from a small baseline relatively to the scene. The lack of uniqueness makes the problem hard and ill-conditioned, and hence additional assumptions, external information and specially design algorithm are required to produce robust and reliable estimates. This thesis will present a new approach for computing the pose and motion problem based on adding external absolute information, namely, information with respect to the external coordinate systems, to the data obtained from the sequence of images. Overall, the Motion and Pose estimation is sometimes referred to as the navigation problem, understood here as the problem of determining the position, location and velocity of an object (in the present case, a camera) with respect to an external reference system. In particular, if the scene is outdoors then the reference system will usually be attached to earth global coordinates, while in the case of indoors then the reference will somehow be attached to the building of interest. Details are discussed below. In some applications, a solution to the motion estimation problem suffices. This is the case, for instance, when dealing with applications that require the transformation of information from one image to another, such as tracking and relative navigation. On the other hand, for applications that transform information from images to an external scene coordinate system and backwards, such as mapping and absolute navigation, pose estimation is required, possibly in addition to the relative solution. The pose and motion estimation can be specified as follows: given a sequence of images I t taken by a calibrated camera at several time instants and from different 3

14 locations and orientations, find an estimate for the pose of the camera at the time instants when the images were acquired, together with the relative translation and rotation of the camera between time instants. The approach followed in this work is to attempt to solve this navigation problem by using optimization tools, hence one can conceptually consider an objective function L(pose, motion, structure, data) and solve: argmin pose,motion,structure L(pose, motion, structure, data) Here pose, motion and structure denote parametrization of the location of the camera with respect to a global coordinate system, of the linear and rotational motion of the camera between time instants, and of the structure of the scene, respectively. Moreover data denotes all the known data used for formulating the function, including information extracted from the images, calibration parameters of the camera, external information available and additional assumptions that are used. An example of the latter is the rigidity assumption of the scene used throughout. In some instances of the problem formulation the scene structure is eliminated, either totally or partially; however, as will be shown below, scene structure and its estimation is an important component of the present work. 1.1 Related Work Among the various approaches to the motion estimation problem, Bundle Adjustment (BA) [FZ98, TMHF99] is currently the method of choice for many applications that require a solution based on images taken under realistic conditions. To solve the pose estimation problem, an external source in addition to the images is needed. The common source is typically a GPS, an Inertial Navigation System (INS) system or a combination of both [Ell06, McG04, Lhu11]. BA together with GPS and INS is also used in outdoor vehicle navigation [KA], [SKLP06]. Alternatively, the location of one or more of the tracked features is sometimes assumed to be known, e.g., geo-referenced landmarks or manual control points. [DMH04] used laser sensor, such as LIDAR, to produce automatic control points. Another type of external source is the Digital Terrain Model (DTM) also known as Digital Elevation Model (DEM). [LRR06, LR11] used DTM to recover the orientation for a pair of images by approximating the DTM surface and using matches between the images to formulate a constraint on the pose parameters. The disadvantage of this method in compared to Bundle Adjustment is that the later solves all the images as a block and by that achieves better epipolar geometry between the images. Another approach for solving the problem, is using a coordinate descent approach, namely solve sequentially: 1. [ motion k, structure k] =argmin motion,structure L(pose k 1, motion k 1, structure k 1, data) 2. [ pose k] = argmin pose L(pose k 1, motion k, structure k, data) 4

15 The first step can be solved by using, for example, the epipolar or trifocal tensor geometries [SZB95, YWCO06] or the Bundle Adjustment. The second step can be solved by using scene matching, the Iterated Closest Point (ICP) algorithm [BM92, Zha94] or any other approach that exploits the additional data available in the formulation [HY12]. The solutions to each one of the steps are indexed, to express the fact that the coordinate descent can be iterated in the attempt to refine the solution. Since the ICP solves the rigid transformation between the cloud points generated by the BA and the DTM, it is very sensitive to errors in the cloud points structure. For example, velocity errors in the INS will produce a scale error in the estimated point cloud. [ZSN05] introduced a variation on the ICP method that also estimates the scale between the points sets. [SSH + 08, JB11] used also star observations, INS measurements, and doppler measurements as absolute information about the camera location and orientation. The solution of the BA using only those geographical measurements produced a drift in the Z axis (height values), and to solve that, they added a second step, where simple constraints between the 3D points and the DTM were added to the BA. All the methods mentioned above have advantages and disadvantages: GPS may or may not be available [BPHR11], the accuracy of the INS solution diverges with time, landmarks need to be in the field of view and then identified. 1.2 Current Work The combination of BA with some sort of external information is the field of the current work; however, instead of assuming the availability of absolute information about position, motion or structure, it will be shown that the BA can be extended by using a Digital Elevation or Terrain Model (DEM/DTM), which is the digital representation of ground surface topography or terrain and hence can be use as a source of absolute information on the scene. It can also be used to solve the INS drift by extracting the transformation between the 3D points generated by the BA and the DTM. Although plausible and intuitively appealing, it is well-known that if the objective function does not meet some stringent assumptions then coordinate descent will not necessarily produce an optimal solution to the overall optimization problem, and may even not converge to a stationary point. Consequently, one is interested in finding a single step solution to the problem, namely, one that would seek for a motion, structure and pose simultaneously. The main purpose of our work is to present one such algorithm, and to compare its performance with respect to a more standard variation of coordinate descent. In this thesis, we propose to solve both the relative and absolute pose in the BA framework by adding DTM to the BA constraints using second order approximation of it. By using the DTM information directly in the BA framework, we successfully solve the pose and estimation problem without the need for additional measurements, such as GPS and INS, or additional solver, such as the ICP. 5

16 1.3 Thesis Organization The rest of the thesis is organized as follows: Chapter 2 gives a general overview of the Bundle Adjustment. Chapter 3 describes the DTM properties and its limitations. Chapter 4 describes our proposed method in details. Chapter 5 shortly describes the ICP algorithm for which we compared our results to and in Chapter 6 the evaluation of our proposed method on synthetic and real images is presented. The conclusion and future work are detailed in Chapter 7. 6

17 Chapter 2 Bundle Adjustment 2.1 Introduction Bundle Adjustment is an algorithm that uses information extracted from a set of images taken by a moving camera to computes the 3D structure of the observed scene together with the relative motion of the camera. BA has effectively become the algorithm of choice for reconstructing large and complex scenes, under the assumption that the relative motion in between images is totally or partially measured using an additional sensor, e.g. a differential GPS. BA was originally formulated in the photogrammetric literature and later re-invented as a structure from motion method in [FZ98]; a comprehensive tutorial including pointers to the relevant literature is provided in [TMHF99]. The name Bundle Adjustment refers to the fact that the method considers the bundles of light rays emanating from a camera at several time instants. Each light ray represents the possible 3D location of a feature seen by the image the light ray emanates from. The light ray origins from the location of the camera s pinhole at the time the image was taken, and passes through the projection of that feature on the image. Light rays from 2D observations of the same feature should ideally intersect at a unique 3D point as they represent the same 3D object. The relative pose of the camera at each imaging location can be adjusted so as to meet the constraint that every set of rays representing a single feature will intersect at a single 3D point. In practice, BA can be formulated to a large sparse geometric parameter estimation problem that can be efficiently solved using modifications of standard optimization algorithms. The optimization problem is formulated using measurements, parameters and equations (restrictions) that tie the measurements and parameters together. In classical BA, the measurements consist of a collection of 2D location of extracted points from the images, where, as mentioned in the previous section, a set of 2D locations on different images that represent the same feature is called a track. The parameters are the camera poses at the time the images were taken and the 3D location representing the intersection point of the light rays of each track. The cost function to be minimized is the square sum of the differences between the measured 2D location and the projection 7

18 of the estimated 3D location of those features using the estimated camera poses. The cost function is actually formed by multiplying each constraint by a weight matrix that represent the predicted noise of the measurement. Alternatively, the weights can be also thought of as a normalizing factors, so that different types of measurements can be combined into a single cost function. In the standard BA the cost function consists only of constraints resulting from the 2D projections. As a consequence, only the relative solution can be solved, namely a solution up to 7 degrees of freedom. In order to solve for the absolute pose, new information and constraints are required and a number of options has been proposed in the literature. For example, a source of measurement can be the solution of an Inertial Navigation System (INS). The new restriction would be the euclidean distance between the INS location of the camera at time t and its estimated location. Although those measurements solve the degree of freedom, they are sensitive to drift and velocity errors in the INS system. Another type of measurement is the 3D location of a track, also known as a control point. The restriction in this case would be the distance between the estimated point and the measured 3D location. The disadvantage of the control points is that the matching between the measured 3D location and the 2D locations is done manually. In this chapter we will briefly present the Bundle Adjustment algorithm without the new constraints proposed in this thesis, which will be described in details in Chapter 4. The rest of the chapter is organized as follows: In Section 2.2 the conversion between the different coordinate system embedded in the equations is detailed. Section 2.3 describes the BA equations and parameters used in this work. Section 2.4 formulates the cost function and the numeric solver used to minimize it. Section 2.5 describes the sparse nature of the Bundle Adjustment and how it is used to reduce significantly the complexity in time and place. 2.2 Coordinates System In order to solve for absolute pose, global coordinate systems with respect to the earth are required. For example, GPS measurements of the camera location is given in either Longitude-Latitude-Height or in Earth-Centered-Earth-Fixed (ECEF) coordinate systems, while DTM is usually referred to Universal Transverse Mercator (UTM) coordinates possibly due to its defense-oriented origins. Except from ECEF, non of the above are cartesian coordinate system. On the other hand, the pinhole camera [HZ04] model used in the BA to forwardproject from the estimated 3D point to the camera plane is Cartesian. More over, since each estimated 3D point is projected to several images (otherwise it is rejected since there is not enough information to determine its location as it has only a single ray of light), the ability to convert between the Cartesian system of each image is needed in order to construct the BA cost function. Another type of projection used in this Thesis, is the backward-projection from the 8

19 camera to the DTM surface. This projection is described in details in Chapter 4, but it is clear that the ability to convert between the Cartesian system of the Camera and the DTM is needed. In addition, when the DTM is referenced to a UTM coordinates, the backward-projection function would suffer from the geographical distortions resulting from the UTM approximation. In order to address all the issues raised above, we used a single Cartesian system in our Bundle Adjustment, and converted all the geographical information from the different coordinate systems to that single Cartesian system at the beginning of the process, before building the cost function. All the measurements, parameters and constraints are then defined and solved in that coordinate system. Obviously, the parameters can be transformed back to its origin coordinate system using the reverse transformation. This method has several advantages. First, it simplifies all the constrains and equations since there is no need for conversions due to coordinate system differences. For example, the conversion between the GPS measurements of the camera location is not added to the forward-projection constraints between the estimated 3D point and the image plain. Secondly, if the DTM is referenced to UTM coordinates, it solves the geographical distortion in the backward-projection function resulting from the UTM approximation. Another advantage, is the comfort of having all the parameters in the same coordinate system, so for example calculating the distance between two points is equivalent to just measuring the difference between their values, with no additional conversion. We used a Local-Level Local-North (LLLN) system, also known as North-East-Down (NED) system as our Coordinate system. To avoid numeric overflow and to accurately represent the problem, we preferred the parameters to have small values. Therefore, we chose the origin to be the measured location of the first camera and by that all other parameters only represented the delta from the first camera. For the same reason we didn t choose ECEF coordinate system since its origin is set to the center of the earth. In our experiments, there were three types of geographical measurements: Camera locations in Longitude-Latitude-Height, Camera angles in LLLN coordinate system originated in the camera measured location and DTM in UTM projection. The camera locations were converted to the main Coordinate System by first converting them to ECEF and then transforming them to the main Coordinate System. Since the angles are given in an LLLN system in the first place, we converted them to the main Coordinate system by just rotating the angles with the rotation matrix that transforms between the two LLLN coordinate systems. The conversion of the DTM to our Coordinate system is detailed in Section

20 2.3 BA equations The measurements together with the parameters below define the set of constraints (equations) solved by the bundle adjustment. Since measurements are based on physical observations, they are not accurate and include errors and noise from different sources. In particular, noise can be modeled as a white-noise, Gaussian process. Let m i R n be a noisy measurement vector, and m i R n be the true ideal measurement vector. Then: m i = m i + n i, n i N (0, i ) (2.1) where n i R n is modeled as a Gaussian noise vector with zero mean and covariance i R nxn. Covariance matrix with large singular values indicates that the error vector n i may have large values and therefore the measurement might contain noise. In the same way, covariance matrix with small singular values indicates that with high probability the error vector will be small and the measurement is close to its real value. The covariance matrix is used to weight the different equations in the cost function defined in Section 2.4, so errors on measurements with large singular values will contribute less to the cost function then the same errors on measurements with small singular values. The common measurements used in the Bundle Adjustment are the 2D observations of the 3D features extracted from the images. The 2D observations are usually generated by an image processing tool, such as Scale Invariant Feature Transform [Low04] or can be even set manually. We will denote such observation on an image taken at time t with p t i R2. That observation is then matched manually or automatically to 2D observation on other images. A set of such observation is denoted a track: i track i = { p t 1 i, p t 2 i,..., p t h i }, [t 1, t 2,..t h ] [1..n], 3 h n For each track i we denote the 3D point P i that represents the location of the feature all the track s 2D observations refers to. For each observation p i a covariance matrix ( t i ) R2x2 is defined so: p t i = p t i + n t i (2.2) where p t i is the true location of the feature and nt i N (0, t i ) is the Gaussian noise vector. The desired situation is that each track will contain 2D observations from as many images as possible. This has two main advantages: The estimation of the 3D point P i will be more robust to bad matches in the track. 10

21 The longer the track is the wider the base line angel between the extreme images, which will then increase the accuracy of the estimation of the 3D point. See the section on Ill-constraint points in Chapter 7 for more details. Another type of measurements is the camera poses, locations and orientations, generated by the Internal Navigation System (INS) and Global Position System (GPS) at different time stamps. INS is based on accelerometers, gyroscopes, or other motionsensing devices and together with GPS is able to generate location at any time stamp by integrating on the motion sensing devices. As such, it is obvious that the error of the location is not independent since error in the accelerometer or gyroscope at time t will influence all further locations and angles due to the integration process. This type of error is also referred to as a Drift Error. For such errors there is a strong correlation between the poses in different time stamps and the Gaussian noise error model used in this thesis is not suitable. Since we didn t conduct experiments using real GPS and INS data (see Chapter 6) and in Chapter 4 we presented a method that solves the absolute pose problem using only DTM, we preferred to use the simple representation of the error model. To formulate our pose error model, the pose measurement at different time stamps is donated by: T t, Ψ t, t [1..n] In addition, the covariance matrix n t T,Ψ N (0, t T,Ψ ) is given so: [ T t Ψ t ] where n t T,Ψ N (0, t T,Ψ ) is the Gaussian error vector and [ 6x1 = [ T t Ψ t ] 6x1 + n t T,Ψ, (2.3) true location and orientation of the camera in the form of Euler angles. T t Ψ t ] 6x1 represent the While the measurements define the equations of the cost function, the parameters are the variables of the cost function. These are the types of parameters used in this thesis: T3x1 t, Ψt 3x1-6 parameters representing the camera pose at time t: T t represents the camera location and Ψ t represents its Euler angles. Both T t and Ψ t are relative to the main coordinate system, defined in Section 2.2. The transformation from the Euler angel to the rotation matrix used in Equation 2.4 can be found in [Sla99]. P i 3x1-3 parameters representing the 3D location of feature i. To model the camera projection, we use a Pinhole camera model [HZ04]. The 2D projection of feature i on the image taken at time t, p t i, can be calculated using the 11

22 camera pose at time t, T t Ψ t, and the 3D location of feature i, P i by: p t i = Vi t(0) Vi t(2) Vi t(1) Vi t(2) 2 1, V t i = KP t i = KR(Ψ t )(P i T t ) (2.4) It is assumed that the camera internal parameters, represented by K matrix, are known and accurate. In case this assumption is not valid, the internal parameters can be estimated by adding them as parameters to the BA. 2.4 Cost Function and Numeric Solver As presented by [TMHF99], the BA problem can be expressed by looking at the measurements and their error model, and finding the parameters that maximize the likelihood of our solution: argmax P i,ψ t,t t i,t P rob(n t i = p t i p t i) P rob(n t Ψ = Ψ t Ψ t ) P rob(n t T = T t T t ) = argmax P i,ψ t,t t i,t t t t 1 2 Ψ 2π t t 1 2 i 2π exp( 1 2 ( pt i p t i) T ( t i) 1 ( p t i p t i)) exp( 1 2 ( Ψ t Ψ t ) T ( t Ψ) 1 ( Ψ t Ψ t )) t 1 2 T exp( 1 2π 2 ( T t T t ) T ( t T ) 1 ( T t T t )) = t argmax exp( P i,ψ t,t t i,t 1 2 ( pt i p t i) T ( t i) 1 ( p t i p t i) + t 1 2 ( Ψ t Ψ t ) T ( t Ψ) 1 ( Ψ t Ψ t )+ + t 1 2 ( T t T t ) T ( t T ) 1 ( T t T t )) = argmin P i,ψ t,t t i,t ( p t i p t i) T ( t i) 1 ( p t i p t i)+ t ( T t T t ) T ( t T ) 1 ( T t T t )+ t ( Ψ t Ψ) T ( t Ψ) 1 ( Ψ t Ψ) In summary, the BA is the problem of minimizing the following cost function: q(p i, Ψ t, T t ) = i,t ( p t i p t i) T ( t i) 1 ( p t i p t i) + t ( T t T t ) T ( t T ) 1 ( T t T t )+ 12

23 ( Ψ t Ψ) T ( t Ψ) 1 ( Ψ t Ψ) t The above cost function is a weighted Sum of Squared Error (SSE) when the inverse of the covariance matrices ( t i ) 1, ( t T ) 1, ( t Ψ ) 1 is considered as the weights. Each term in the cost function, such as ( p t i pt i )T ( t i ) 1 ( p t i pt i ), can be thought as an equation that ties between the measurement and the parameter. It can be easily seen that the cost function described above is not necessarily convex, and therefore a numeric solver is needed. we use the Levenberg Marquardt (LM) algorithm which is a step control algorithm that interpolates between Gauss-Newton s and Gradient methods Gauss-Newtons Method Given some non convex function h(x), Gauss-Newton s method [Bjo96, HZ04] attempts to construct a sequence x n from an initial guess x 0 that converges towards the minimum of the function. First, h(x) is approximated by developing the quadratic Taylor approximation around x 0 : h(x 0 + δx) h(x 0 ) + g T δx δxt Hδx g dh dx (x 0) H d2 h dx 2 (x 0) (2.5) This approximation is convex and has a simple global minimum, which can be found explicitly using linear algebra. Setting dh(x 0 + δx) dδx = g + Hδx = 0 (2.6) will give us the Gauss-Newton step: δx = H 1 g (2.7) The next value of the variable will be x 1 = x 0 + δx. Iterating on the Gauss-Newton step (by recalculating Taylor approximation around the point x n = x n 1 + δx and solving (2.6)) will give us the Gauss-Newton s method, which hopefully converges to the minimum Levenberg Marquardt For large steps, the Taylor approximation (2.5) might be inaccurate, and the Gauss- Newton step may not decrease h(x). On the other hand, the gradient descent method guarantees decrease but may be slow. 13

24 A combination of the two methods is used: δx = (H + λi) 1 g (2.8) where λ is the parameter that weights between the two steps. When λ is big H can be neglected and the step is a small Gradient step : δx 1 λ g When λ is small then λi can be neglected and the step is a Gauss-Newton step: δx H 1 g Levenberg Marquardt (LM) [PTVF92] is a step control algorithm that interpolates between the Gauss-Newton s method and the Gradient method. The algorithm changes the λ factor through the iterations in order to a achieve fast and accurate convergence. There are many variants to the LM algorithm. The following one was used in this work: 1. λ = initial value, iter = 1 2. while iter <max num of iterations && improve >minimum improve (a) while 1 i. δx = (H + λi) 1 g ii. if h(x i + δx) < h(x i ) A. λ = λ/10 B. break iii. else A. λ = λ 10 (b) x i+1 = x i + δx (c) iter = iter +1 (d) improve = h(x i ) h(x i+1 ) To describe the numeric step in the Bundle Adjustment, lets first represent the cost function (2.4) in a matrix form: q(x) = 1 2 (f(x) b)t W (f(x) b) (2.9) where f(x) is the projection matrix transferring from the parameters vector x and the measurements vector b, and W is the weight matrix compound from the measurements covariance. 14

25 Therefore: g = dq dx = d( 1 2 f(x) b)t W (f(x) b) = J T W (f(x) b) (2.10) dx where J = d(f(x) b)) dx = df(x) dx. The Hessian is therefore: H = d2 q dx 2 = dg dx = J T d(f(x) b)) W + dx i ( d2 f(x) i dx 2 )(W (f(x) b)) i = J T W J + i ( d2 f(x) i dx 2 )(W (f(x) b)) i In practice, the term ( d2 f(x) i )(W (f(x) b)) dx 2 i is relatively small in comparison to i J T W J (since either the prediction error (f(x) b) is small or the model is nearly linear d 2 f(x) dx 2 0). Dropping the second term gives the Gauss-Newton approximation to the least squares Hessian, Now, the Levenberg Marquardt step becomes: H J T W J (2.11) δx = (H + λi) 1 g = (J T W J + λi) 1 J T W (f(x) b) (2.12) 2.5 Sparse Bundle Adjustment One of the main strength of the Bundle Adjustment is its sparse structure that enables it to solve large problems with large number of constraints and parameters in real time. It boils down to the fact that the calculation time of a single step in the numeric solver is proportional to the number of pose parameters and not to the number of 3D points parameters. To show that, lets look at a simple problem with eight 3D points seen by 3 images. The parameter vector is then: and the measurement vector is: x = [P 1... P 8, T 1, Ψ 1... T 3, Ψ 3 ] 42x1 b = [{ p 1 1,..., p 3 1},..., { p 1 7, p 2 7}, { p 1 8,..., p 3 8}, { T 1, Ψ 1 },...{ T 3, Ψ 3 }] 62 1 when in this case the 7th point is not seen by the 3rd camera. If all the measurements are independent, which is usually the case, then the weight matrix W is block diagonal and the order is defined by the measurement vector b. Figure 2.1 shows the weight matrix for the example above. 15

26 Figure 2.1: The Bundle Adjustment weight matrix. The Jacobian matrix is defined by J = df(x) dx. The rows correspond to the measurement vector b, and the columns correspond to the parameters vector x. To evaluate the structure of the Jacobian, lets first examine the derivations of the constraints. Here are the derivation of the 2D observations: d( p t i pt i ) dp j = 0, j i d( p t i pt i ) dψ j = 0, j t d( p t i pt i ) dt j = 0, j t The derivations of the camera constraints are: d( T t T t ) = d( Ψ t Ψ t ) = 0, j dp j dp j d( T t T t ) dψ j = d( Ψ t Ψ t ) dt j = 0, j d( T t T t ) dt j = d( Ψ t Ψ t ) dψ t = 0, j t In Figure 2.2, the sparseness of the Jacobian can be easily seen for the example defined above. As was seen in Equation 2.11 the hessian can be approximated by H = J T W J. Since W is block diagonal it does not change the structure of J, therefore the structure of H is equivalent to the structure of J T J. The entry H i,j is therefore not empty if the inner product of column J i with column J j is not zero. The structure of H is called Arrowhead and it is divided into 4 parts: N 1 represents the second derivations of 16

the second derivations of camera parameters by camera parameters. Figure 2.3 shows the Hessian matrix for the example defined above. Figure 2.3: The Bundle Adjustment Hessian matrix.

27 Figure 2.2: The Bundle Adjustment Jacobian matrix. points by points, N 2 represents the second derivations of points by camera parameters, N 3 is the transpose of N 2, N 3 = N2 T, and N 4 represents the second derivations of camera parameters by camera parameters. Figure 2.3 shows the Hessian matrix for the example defined above. Figure 2.3: The Bundle Adjustment Hessian matrix. Note that the step control λi added to H does not change the structure of the matrix, as the diagonal is already assumed to be full. To further analyze the calculation needed to solve the numeric step defined in 17

28 Equation 2.12, we will donate g for the gradient step: g = J T W (f(x) b) and divide it together with δx into to two parts g = Equation 2.12 can be written as a set of two equations: ( N 1 N 3 N 2 N 4 ) ( δx 1 δx 2 ) = [ ( g 1 g 2 g 1 g 2 ] [ and δx = ) δx 1 δx 2 ] so that (2.13) From the first equation: N 1 δx 1 + N 2 δx 2 = g 1 we can write δx 1 in terms of δx 2 : δx 1 = N 1 1 (g 1 N 2 δx 2 ) (2.14) By applying it to the second equation N 3 δx 1 + N 4 δx 2 = g 2 we get: and δx 2 can be extracted as: N 3 N 1 1 (g 1 N 2 δx 2 ) + N 4 δx 2 = g 2 δx 2 = (N 4 N T 2 N 1 1 N 2) 1 (g 2 N T 2 N 1 1 g 1) (2.15) and δx 1 can be extracted from δx 2 using Equation Now, lets analyze the computational time needed for computing δx 1 and δx 2. For that lets denote k as the number of features and l as the number if images. Therefore, Calculating g - O(l k) as the worst case is when all the features are seen by all the images. Calculating N O(k) Calculating (g 2 N T 2 N 1 1 g 1) - O(l k). Calculating (N 4 N T 2 N 1 1 N 2) 1 - O(l 3 ). Under the assumption that l < O(k 2 ) the overall computation is O(l 3 ) and is proportional to the number of images and not to the number of features. 18

29 Chapter 3 Digital Terrain Map 3.1 Introduction As described in Chapter 2, simple Bundle Adjustment can be solved up to 7 degrees of freedom based only on a sequence of images. In order to solve the degrees of freedom, additional constraints on geographical data is needed. One type of global information is the Digital Terrain Model (DTM). DTM is a digital model or 3-D representation of a terrain s surface. Before describing the new constraints added to the Bundle Adjustment, the properties and limitations of the DTM are described in this chapter. Section 3.2 describes the DTM and Section 3.3 describes some degenerated cases where solving the absolute pose using DTM is limited. 3.2 Definition and Properties A Digital Terrain Map (DTM) is a model of the surface of the earth. More specifically, given a two-dimensional (2D) parametrization for the horizontal location on the earth surface, the DTM is the mapping DT M : R 2 R providing the altitude of the earth surface at a particular horizontal location: h(x, y) = DT M x,y. The parametrization x, y may denote the geographical coordinates with x being the latitude and y the longitude. Alternatively, the two variables can denote the northing and easting of a UTM projection. Likewise, h may denote the altitude of the terrain above sea level or above a reference ellipsoid, typically WGS84. In practice, the earth surface cannot be modeled by a simple function, and instead the DTM is known by the altitude values over a discrete grid: h i,j = DT M xi,y i = DT M i,j The resolution of the DTM grid ranges from 5 meters to 90 meters. The dense DTM (5 meters resolution) is usually available only to official authorities but the sparser DTMs are freely available: Shuttle Radar Topography Mission (SRTM) [12] has a 19

30 Figure 3.1: Digital Terrain Model Terrain s surface and its Digital Terrain Model. global coverage with 90m resolution, where Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) [AST] has coverage of 99% and resolution of 30m. The accuracy of the height values of the DTM ranges from one meter to 15 meters. The accuracy depends on the source of elevation, the terrain roughness and the sampling density. For example, [RMB06] states that 90% of the height errors in SRTM are below 5-9 meters, depends on the location. The errors might be correlated and obviously have influence on the quality of the registration to the DTM. While some of the errors can be handled, some errors, such as a constant drift error in all the height values, can not be solved when DTM is the only geographical source. The different error types and the methods used to handle them are detailed in Section 4.6. In terms of memory, the DTM is very compact. For example, a region of 10x10 km with a grid resolution of 25m has 160,000 sampled point and by assuming each point requires 4 bytes, the whole DTM can be represented with only 540k. 3.3 Degrees of Freedom The basic idea behind solving the absolute pose estimation problem using Bundle Adjustment and DTM, is the ability to solve the pose problem by comparing between the reconstructed scene and the DTM using the 3D structure of the scene. In some degenerated cases, the terrain s scene is such that it is possible to solve the absolute estimation problem up to some degree of freedom. In those cases, additional geographical information is needed, such as GPS. For example, when the surface is represented as a 20

horizontal plane, there are 3 degrees of freedom: the matched solution can vary

Figure 3.2 describes some additional examples. Figure 3.

The top example has 3 degrees of freedom, as the solution can vary along the XY

In the first example in the second order, the solution has 2 degrees of freedom as

31 horizontal plane, there are 3 degrees of freedom: the matched solution can vary along the XY plain and can be scaled without effecting the matching to the DTM. Figure 3.2 describes some additional examples. Figure 3.2: Examples for degenerated scenes. Some cases for degenerated scenes. The top example has 3 degrees of freedom, as the solution can vary along the XY plane and can be scaled. In the first example in the second order, the solution has 2 degrees of freedom as it can vary along the X axis and can be scaled. In the third example, the solution has a single degree of freedom as the solution can be scaled and in the last example there are no degrees of freedom. 21

32 22

33 Chapter 4 Bundle Adjustment With Digital Terrain Model Constraints 4.1 Introduction Our novel approach is to combine the DTM into the BA equations. To add the DTM to the BA, the set of parameters, measurements and constraints must be defined. The requirements the BA framework impose are such that the new measurements must have covariance matrices and the new constraints be differentiable, otherwise so they could not fit into the numeric solver introduced in Chapter 2. Our method does not introduce new parameters to those described in Chapter 2, which is important since as described earlier the calculation time of the numeric step is proportional to the number of parameters. As for the constraints, beside adding new constraints that ties the reconstructed structure to the DTM, we also removed the constraints based on the INS measurements for two reasons: First, we wanted to simplify the equation system so the absolute solution will be solved only by the DTM. Secondly, we wanted to show the strength of the proposed constraints so the absolute solution can be solved without INS and GPS signals. By that, we extend the ability to solve the absolute position problem to products without geographical signals, such as INS and GPS, or in case those signals are not available. Having said that, it is obvious that in case other geographical measurements exist, they can be used together with the new DTM constraints. Our basic idea is to add a constraint on the distance between the 3D point P i and the DTM. Ideally, we would like that all the 3D points will lie on the DTM. In order to define a constraint between P i and the DTM, the DTM surface is approximated near the predicted location of P i to a simple differential function h i (x, y) : R 2 R Section 4.3 describes two possible approximation we examined. 23

34 In contrast to the measurements described so far, the surface approximation does not have an error model. On the other hand, the BA scheme is based on the assumption that the measurements have Gaussian error model. Therefore, in order to add the constraint to the BA we assume that the distance between the point and the surface is modeled by Gaussian error: d i = dist(p i, h i (x, y)), d i N (0, var i ) The variance of the constraint, var i, is set a-priori by the scenario properties and is changed according to the accuracy of the DTM. See Section 4.6 for more details. 4.2 Coordinate System Conversion As described in Section 2.2, a single coordinate system is used in our framework and all the other geographical information is converted to it. We use a LLLN coordinate system originated in the first measured camera location. To add the DTM to the BA cost function, we first need to convert the DTM to the main coordinate system. As mentioned in Chapter 3, the DTM is given in a UTM projection and is constructed of 2D grid of points with height values. Due to the geographical distortions in the UTM projections, a grid of points defined in the UTM projection is not converted to a grid in a real Cartesian system, such as ECEF or LLLN. Since a grid of points is used in the DTM backward-projection function, simple conversion of the DTM grid is not enough. Instead, we first define the 2D grid in the main coordinate system where we want the DTM to be converted to. Then, we transform the new grid to the UTM coordinates, and sample the height values at those points by interpolating the DTM values. Since usually the images are taken on a relative small area compared to the earth surface, the transformed DTM is still a valid mapping R 2 R. 4.3 Surface Approximation One of the basic ideas in the approach presented here is to approximate the true surface around a given location by using a simple function of the horizontal parametrization. For each 3D point, P i, defined in the BA parameters, an approximated surface will be calculated and used to define a new constraints. Since the surface approximation is local, we first need to define the location on the surface where the surface approximation is needed. To do that before an estimation of P i even exists, we use backward-projection on the DTM. Given a 2D observation p t i of point P i and the camera pose approximation, T t Ψ t, the estimated point on the surface can be calculated using ray tracing [App68]. Lets denote the intersection point by Pi DT M. To approximate the true surface around Pi DT M, we will consider the DTM grid as a sample of points from the true surface. The grid of points is then used to construct 24

35 the surface approximation. There are several considerations when choosing the size of the grid to use in the surface approximation methods. First, we need to decide the size of the area for which we want to approximate the surface from. In the one hand, a large area should be chosen so that the topographic of that area is captured, but on the other hand, our surface approximation can model only simple geometric shapes, such as a single mountain or valley in the case of the second order approximation. In our experiments we found that choosing large area produced bad surface approximation. We chose the size of the area a-priori according to the surface topography. Since usually the topography doesn t change so rapidly in the region of interest (ROI) of the camera, this can also be done automatically before the estimation begins by examining the derivations in the height values inside the ROI. Given that the size of the area was defined, we used all the grid points in that area so our approximation will be less sensitive to errors in the grid samples. Another element that influences the quality of the surface approximations is the DTM resolution. For very sparse DTM samples, small changes in the surface are not modeled by the DTM and therefore not modeled by the surface approximation. This can be handled by increasing the constraint variance, see Section 4.6. The surface approximation depends on the point Pi DT M which is determined by ray tracing from the camera pose estimation, T t Ψ t. Since the camera pose is vary inaccurate in the beginning of the process, and consequently so is Pi DT M and the surface approximation, we re-calculate Pi DT M and the surface approximation h i (x, y) through the BA convergence. A plane approximation is described in Subsection and a taylor expansion of it to a second order function can be found in Subsection In our experiments we found the second approximation to be more precise and to speed up the convergence Plane Approximation We approximated the surface around the point Pi DT M points on the plane satisfies: to a plane (ñ i, d i ) such that ñ T i p + d i = 0 (4.1) The surface parameters were calculated by taking the 3x3 height values from the DTM surrounding the approximated point Pi DT M, and solving the minimal square problem: Ax = b (4.2) where A = X1 DT M Y1 DT M Z1 DT M X2 DT M Y2 DT M Z2 DT M... X9 DT M Y9 DT M Z9 DT M 25 9x3, b = 1 3x1

36 and (Xi DT M, Yi DT M, Zi DT M ) are the nine DTM points around Pi DT M which in our experiments reflects an area of 75x75 meters. By multiplying A T on both sides of the equation, we get the following equations: (A T A) 3x3 x = (A T b) 3x1 (4.3) This set of equations can be solved explicitly since it has only 3 variables. The normalized plane parameters are then: ñ i = x x, di = 1 x For each 3D point P i and its plane approximation (ñ i, d i ) we add the constraint: ñ T i P i + d i = 0 (4.4) Now, the BA is the problem of minimizing the following cost function : minimize ( p t P i,ψ t,t t i p t i) T cov 1 ( p t i p t i) + i,t i (ñ T i P i + d i ) 2 var (4.5) Second order Approximation We approximate the surface around the point Pi DT M that all the points p = (x, y, z) on the surface satisfies: to a second order surface such z = ãx 2 + by 2 + cxy + dx + ẽy + f (4.6) where (ã, b, c, d, ẽ, f) are the surface parameters. The surface parameters are calculated by taking the 7x7 height values from the DTM around the approximated point Pi DT M which in our experiments reflects an area of 175x175 meters, and solving the minimal square problem: where Ax = b A = (X1 DT M ) 2... (X49 DT M ) 2 (Y1 DT M ) 2... (Y49 DT M ) 2 X1 DT M Y1 DT M... X49 DT M Y49 DT M X1 DT M... X49 DT M Y1 DT M... Y49 DT M T 6x49

37 X = ã b c d ẽ f 6x1, b = Z DT M 1 Z DT M 2.. Z DT M 49 and (Xi DT M, Yi DT M, Zi DT M ) are the 49 DTM points around Pi DT M. By multiplying A T on both sides of the equation, we get the following equations: 49x1 (A T A) 6x6 x = (A T b) 6x1 (4.7) Now that the matrix, (A T A) 6x6, is square and semi-positive, we can solve the set of equations using LU factorization. To simplify the derivations of the new constraint, we used as a distance function the difference in the height value of estimated point P i = (X i, Y i, Z i ) and the surface at point X i, Y i : Z i ãx 2 i + by 2 i + cx i Y i + dx i + ẽy i + f = 0 (4.8) In our experiments we found that this distance function was a good approximation of the real distance between a 3D point and the surface, and on the other hand simplified significantly the derivations of the constraint. For each 3D point P i estimated in the Bundle Adjustment and its surface approximation (ã, b, c, d, ẽ, f) we add the above constraint. The combined cost function is now: minimize ( p t P i,ψ t,t t i p t i) T cov 1 ( p t i p t i) + i,t i (Z i Z i ) T var 1 i (Z i Z i ) = (4.9) minimize ( p t P i,ψ t,t t i p t i) T cov 1 ( p t i p t i) + i,t i (Z i ãx 2 i + by 2 i + cx i Y i + dx i + ẽy i + f) 2 var i 4.4 Bundle Adjustment Framework In this section, the framework of the proposed method is described. As mentioned before, the input to our method contains a sequence of images, DTM and an initial guess for the first image. Before we solve the optimization problem, we first need to extract 2D observations, i.e. tracks, and set an initial value to all the parameters: camera locations and orientations and 3D locations of the features. 27

38 4.4.1 Image Dilution When images are taken from adjacent locations, as in video stream, using all the images will increase significantly the number of the BA parameters, and specifically the number of image pose parameters, and as was described in Section 2.5 are critical to the time performance. In those cases, some dilution of the images that will be estimated is needed. The Dilution should be such, that the tracks will be long enough, and as a rule of thumb, a 65% overlap between the images is sufficient. A simple method for dilution can be marking an image for estimation every X frames. This method assumes that the camera is moving in a constant speed. A more precise method will be choosing an image that has X% degrees of baseline, e.g. 3 degrees, relatively to the last estimated image. In both cases, all the images can be used to help matching tracks between the estimated images. In our experiments, the images were sparse enough and no dilution was needed Feature Extraction To extract 2D features, there are various method, and we used a tracking method that is based on the Scale Invariant Feature Transform (SIFT). In the first stage, we extract a predefined number of features, e.g. 200, from the first images and try to match them to 2D features that were extracted in the second image. Assuming that some features are not successfully matched, new features are extracted in the second image so the total number of features is not changed. In the third image, the existing features are first matched to extend the existing tracks, and new features are then extracted to compensate for the failed matched, and so on. Features that are not matched, i.e. have a track of length one, are rejected as there is no baseline for them and their 3D location can not be extracted Outlier Detection Some of the tracks contain wrong matches, 2D observations of different 3D features. Since the BA is eventually a SSE algorithm, such outliers may drift the algorithm from the desired estimation. Therefore, we apply various outlier detections after the features were extracted. We used outlier detections using the RANdom SAmple Consensus (RANSAC) [FB81] with Fundamental and Affine models on the 2D observations. In addition, we also used outlier detection based on the Root Mean Square (RMS) error of the 2D constraint of the Bundle Adjustment. After each iteration of the Bundle Adjustment, we rejected 2D observations that were more then 3σ of the average RMS Parameters Initial Guess Before starting to solve the Bundle Adjustment, we first need to set initial values for the parameters, P i T t Ψ t. The initial values for T 1 Ψ 1 are set to the initial guess given 28

39 as an input. The initial guess for the 3D locations of the tracks seen by the first image, are calculated by ray tracing the DTM using T 1 Ψ 1 and the 2D observations of those features on the first image. Then, the location and orientation of the second image, T 2 Ψ 2, are calculated using features that their 3D location, P i, was calculated in the previous stage, i.e. features that are common between the first and second image. This can be done using Bundle Adjustment where the only parameters are of the second image pose parameters, T 2 Ψ 2, and the measurements are the 2D observations on that image and the features location is fixed. These BA has only 6 parameters and is solved very quickly. The previous steps, are repeated for all the images and features Solving the Bundle Adjustment As described in Section 4.3, the surface approximation depends on the estimated point on the DTM, Pi DT M, which itself depends on the camera estimation. Since in the beginning of the process, the camera estimation is very poor and so is the Pi DT M, the surface approximation and obviously the solution of the BA will be inaccurate as well. Therefore, we repeat the surface approximation several times as the camera pose is more accurate. After each time the Bundle Adjustment converge, we use the new camera poses to recalculate Pi DT M Adjustment estimation. and the surface approximation and repeat the Bundle Since in the beginning, the surface approximation is not good, there is no point giving the BA to converge, since we only wants its coarse step. In any way, it could converge in the next iteration. Therefore, we start with small number of BA iterations, and as the surface approximation is getting better, we increase the BA iterations. In our experiments, we started with 20 BA iterations, and increased them by 10 iterations for every surface approximation. We found that by average after 7 surface approximation, the whole solution converge. 4.5 Influence on the Bundle Adjustment Structure and Computation Time Bundle Adjustment Structure First, lets evaluate the number of constraints added: For every 3D point we add a single constraint. On the other hand, for every 2D observation of that feature there are 2 constraints, one per image axis. For example, if a feature is seen by 10 images, it has 20 constraints on 2D observations and therefore in terms of matrices size, our new constraint is quite neglected. As mentioned before, new parameters are not added. Now lets examine the influence of our constraint on the sparseness of the Bundle Adjustment by examining the change in Equation 2.12, where we denote constraint DT M i as the new constraint for P i : 29

40 Jacobian J - We add a single row for every constraint DT i M. The number of columns are not change, since no new parameters were added. As for the sparseness of the constraint, since the new constraint depends only on the 3D location of its feature, i.e. only dconstraintdt i M dp i is not zero, the constraint is very sparse. Weight Matrix W - For every new constraint a single row and column is added to the weight matrix. This is neglected in terms of the matrix size for the same argument given in the previous item. Since the new constraint is independent of the other constraints, the only value that is filled in the new row and column is the value on the diagonal of the Weight Matrix representing the variance of the constraint. Hessian J T W J - Since the number of columns of J is unchanged, the size of the Hessian matrix is unchanged as well. As for the sparseness, the constraint depends only on its feature location, so the only derivation that is not zero is d 2 constraint DT M i dp 2 i or equivalently the only inner product of the new column with other columns that is not zero is the inner product of the column with itself. Therefore the new constraint are only contribute values to the diagonal of Hessian, which is already filled by other constraints. To conclude, our new constraint does not influence the structure of the Bundle Adjustment Computation Time There are three stages in our proposed method that influence the computational time. The first is the computation of the approximated point Pi DT M by ray tracing from an estimated camera. This stage is relatively time consuming, but since it is done only once for every feature and prior to the Bundle Adjustment estimation, it is negligible relatively to the estimation of the Bundle Adjustment. This stage can also be accelerated by using GPU calculation. The second stage is the surface approximation given P DT M i. Each surface approximation takes in the worst case O(mn 2 ), where m is the number of DTM points used and n is the number of parameters in the surface approximation. As mentioned in Chapter 2, the bottleneck of computing a single step in the BA is inverting a matrix of size 6 k where k is the number of cameras. So, for a scenario with 100 images and 1000 features, solving a single step in the BA optimization method will take at least 200M cycles, where the approximation of 1000 features will take 1M cycles. Therefore, the additional computational time added by the approximation stage is quite negligible. The third stage where we might add computational time is in the numeric step calculation due to the new constraints. There are two main stages in calculating the numeric stage: calculating J T W J and inverting it. Since the new constraints does not change the size or the sparseness of the Hessian, they can only influence the first stage - calculating J T W J. To evaluate that, lets look again at the example above, where 30

41 there are 100 images and 1000 features and lets assume that every feature is seen by 10 images. There are therefore = 20k constraints from 2D observation and 600 constraints on GPS and INS measurements (6 for each image - 3 for location and 3 for angles). Our proposed method adds a single constraint for each feature, 1000 all together. Even if we don t remove the GPS and INS constraints, we increase the number of constraints in 10%. Given that the main bottleneck is inverting the Hessian, our constraints are adding very little to the computational time if at all. In our experiments, we didn t see any increase in the computational time. 4.6 Handling Errors This section describes the different methods we used to handle DTM inaccuracy and outliers. Our surface approximation might not be an accurate representation of the surface for several reasons: DTM represents only the topographic surface and in many scenarios, such as wooded areas, this does not represent correctly the surface. When DTM has low resolution then subtle changes between two sampled points will not be represented by the DTM. Inaccuracy of the height values of the DTM. The surface approximation is not good enough due to the topography or due to the limitations of our approximation. To handle the surface approximation inaccuracy, we change the covariance of the distance constraint: As the surface approximation is inaccurate, the higher the covariance is, and vice versa. So for example, in desert area where the DTM representation should be accurate and the surface approximation should suffice, the covariance will be low. On the other hand, in wooded area the covariance will be high as the DTM does not model the trees. The decision on the value of the covariance can be done offline, before the estimation begins, manually or automatically, by examining the region of interest (ROI). The covariance can also be updated during estimation. For example, if after convergence, most of the 3D points do not lie close to the surface (and obviously the error can t be explained by a constant drift, where in that case it is expected that the BA will solve the drift), the covariance can be increased in the next iteration. On the other hand, if the distances between the surface and the points are small, the covariance can be reduced in the next iteration. Another type of error, is outliers. Outlier is a 3D point that its distance from the approximated surface is significantly larger then the other 3D points. These can happen for several reasons: 31

42 3D points that lie on edges in the scene. For example, consider a point on a cliff edge. A small error in the camera location will generate an approximated DTM point at the ray tracing stage far away from the correct point, in the valley beneath the cliff for example. Since the 2D observations will force the point to be in the height of the cliff and on the other hand the DTM constraint will want to pull the point to valley, the error on the constraint will be very big. Urban areas were there are large buildings that are not modeled by the DTM, and therefore will have large errors. In addition, these points might not be decorrelated. In some urban areas, there are data sources, such as Digital Surface Models (DSM), that maps the urban surface. These models can be used to better approximate the surface in urban areas (see Chapter 7). Since the error model we use for new constraints can not handle this kind of errors, we perform an outlier detection after each convergence of the Bundle Adjustment. We sort the 3D points by their error values, and remove the top x% (we used 3% in our experiments). The other constraints on these points, such as 2D observation, are not removed. Another type of error is a bias in the DTM height values. This kind of errors can not be solved with our new constraints as they are insensitive to a bias in all points. We assume that in addition there are no local drift in the DTM values, i.e. areas where the DTM has suddenly a drift in the height values. 32

43 Chapter 5 Iterative Closest Point 5.1 Introduction To compare our proposed method, we used Iterative Closest Point (ICP) algorithm which is a common method to calculate the transformation between DTM and a reconstructed scene. The Iterative Closest Point algorithm was presented independently by Besl and McKay [BM92] and by Chen and Medioni [CM92]. An overview of different extensions and an experimental evaluation can be found in [SMFF07]. All the different methods receives as an input two sets of 3D points, not necessary of the same size, and returns the transformation between them, represented by rotation matrix R R 3x3 and translation vector v T 3x1. Some version require additional information, such as Chen method that requires the points normal. Chen and Besl assumed that the sets have the same scale. Zinber al. in [ZSN05] presented ICP that calculates the scale in addition to the transformation between the sets. We found Zinber s method to be important since in some cases the reconstructed scene has scale errors, mainly due to velocity errors in the INS. The next section describes the ICP algorithm in some details. 5.2 Iterative Closest Point The basic structure of the algorithm performs the following two steps until convergence: matching points and calculating transformation. In the first step, every point in the first (source) set is matched to a point in the second (target) set by minimizing the euclidean distance. Outliers detection is used to reject badly matched pairs. In the second step, the best motion that aligns the matched points is calculated and applied to the source set. The algorithm stops when the registration change is below a specified threshold. The simplest strategy for the matching stage is to find for each point in the source set the closest point in the target set by minimizing the euclidean distance. Other methods selects a mixture of source and target points by sampling the sets. Since this 33

44 operation of finding the closest point is time-consuming, an optimized data structure is used. Most of the methods use optimized KD tree. In the end of this step, a set of matched pairs is generated. Denoting the source set by A = {a 1,..., a n } and the target set by B = {b 1,..., b m }, the generated pairs set is: S = {(a i, b j ) a i A, b j B} Pairs with large distance in compare to the others are rejected. Besl and Chen differs in the transformation calculation. While Besl is using only the points, Chen s method requires the points normal. Zinber extended Besl s method by calculating the scale as well. Here is a brief description of the methods. Besl uses the sum of squared distances of the corresponding point pairs as the error measure, and therefore tries to minimize the following cost function: argmin R,t (i,j) S b j Ra i t 2 (5.1) A comparison of different methods to solve the problem can be found in [ELF97]. We chose to use the SVD method since it provides stability, high level of accuracy and speed. First, the center of mass of the two sets is found: a = 1 S (i,j) S a i b = 1 S (i,j) S Centralizing the two sets yields the following minimization problem: R = argmin R (i,j) S and the problem is solved by computing the SVD: The solution is set to be: UDV T = (i,j) S b j (5.2) (bj b) R(a j ā) 2 (5.3) (b j b)(a i ā) T (5.4) R = UV T, t = b R ā (5.5) In case R is not a uniform matrix, hence represents a reflection and not a rotation matrix, its third column is multiplied by 1. 34

45 Zinver estimated the scale, s, in addition to R, t. The cost function is set to be: argmin R,t,s And equation (5.3) becomes: R, s = argmin R,s (i,j) S (i,j) S b j sra i t 2 (5.6) (bj b) sr(a j ā) 2 (5.7) Adding the scale will change matrix D in SVD and therefore does not change the calculation of R. s can then be calculated directly from: s = argmin s (i,j) S (b j b) sr (a j ā) 2 (5.8) And set to be: s = (i,j) S (i,j) S b T j ā i ā it ā i (5.9) The translation vector is updated by: t = b s R ā. Chen method differs in the cost function. By denoting ã i = [a i 1], bj = [b j 1] and n j as the normal of point b j, Chen method minimizes the following cost function: argmin M (i,j) S ((Mã i b j ) n j ) 2 (5.10) where M = [R t] 3x4 is the transformation between the source set and the target set. By approximating R for small angles, M can be written as 1 γ β t x M = γ 1 α t y β α 1 t z The set of equations in 5.10 are now linear in α, β, γ, t and can be easily calculated. Similar to Zinber, we added the scale parameter, s, to the cost function: argmin M (i,j) S s sγ sβ t x where M = [sr t] 3x4 = sγ s sα t y sβ sα s t z ((Mã i b j ) n j ) 2 35

46 By denoting γ = sγ, β = sβ, α = sα, M can be rewritten to: s γ β t x M = γ s α t y β α s t z After calculating M, the angles can be extracted by normalizing α, β, γ with s. 36

47 Chapter 6 Experimental Results 6.1 Introduction We evaluated our work on two types of experiments: Synthetic and real images. The first was conducted to evaluate our work under controlled environment where different types of errors were added with known error models and magnitudes. We also compared it with regular Bundle Adjustment followed by a transformation to the DTM using Iterative Closest Point (ICP) algorithm. In the second experiment, we wanted to evaluate our proposed method on a real sequence of images. While we wanted to use images taken by an airplane, we couldn t find images with ground truth data as described in Section 6.3. Instead, we used a small scale model and images taken from a robotic arm with known pose and orientation. In Section 6.2 the synthetic experiment and its results are detailed, and in Section 6.3 the results from the image sequence experiment are described. 6.2 Synthetic Experiments In this section, we evaluated our proposed method on synthetic data and compared it to a Bundle Adjustment with geographical measurements on the camera pose and orientation, as described in Chapter 2, followed by a transformation to the DTM calculated by ICP. We also evaluated a third method where we checked whether the remaining error in our proposed method could be further reduced using ICP as a follow up method. The input to our proposed method contained 2D observations, i.e. tracks, and an initial guess for the first image pose and orientation. The output is the final camera poses together with the 3D locations of the tracked features. While the output of BA with GPS and INS is the same as in our proposed method, the input is different: In addition to the 2D observations it also contained pose measurements for every image. The input to the ICP algorithm, in both cases it was used, contained two set of 37

48 points: 3D locations of the tracked features calculated by the BA and DTM grid points. Each 3D point in the DTM point set represents a point in the DTM grid together with its height value. The output of the ICP algorithm is the transformation between the BA and the DTM. The final output of the methods that uses ICP is the BA s output transformed by the ICP s transformation. The generation of the input data is described in Figure 6.1 and contained the following steps: 1. Generating the true trajectory of the camera, over the given DTM. 2. Sampling the DTM. Each sample will represent the true 3D location of a feature. 3. Adding Gaussian noise to the DTM samples. In real scenarios, the DTM does not represent the true surface and contained errors. Some of the errors are due to the DTM inaccuracy and due to objects in the scene that do not lie on the DTM surface, see more details in Section 4.6. These errors were modeled by adding a Gaussian noise to the points sampled from the DTM. 4. Projecting the sampled points on the camera using the true trajectory of the camera to generate the 2D observations. These are the accurate projections of the features. 5. Adding Gaussian noise to the 2D observations to illustrate the errors in the tracking algorithms, such as Scale Invariant Feature Transform. 6. Adding Gaussian noise to the camera pose, location and orientation. 7. Adding drift errors to the camera pose, location and orientation. This illustrates errors in the Inertial Navigation System. 8. Running all methods, BA followed by ICP, the proposed method and our proposed method followed by ICP on the noisy data. 9. Evaluating the performance by measuring the error between the estimated data, camera pose and 3d locations of the features and the true measurements. In all the experiments listed below, the specified errors where in addition to a basic set of errors: Error Type Camera location STD Camera angles STD Camera location drift Point noise STD Pixel STD Value 10m 1 deg 20m 1m 0.5 pixel 38

experiments. 2: The synthetic scenarion.

49 Figure 6.1: The input generation in the synthetic experiments. Figure 6.2: The synthetic scenarion. The DTM used in the synthetic experiment and the camera locations marked with blue pluses. 39

50 We used DTM from the Shuttle Radar Topography Mission (SRTM) [12] project in an area near Haifa, Israel. The camera was located 3000 meters above the ground, with 10 images 40 meters apart, see Figure 6.2. We increased the grid density from 90 meters to 25 meters so it will represents an average DTM. We ran Monte Carlo for increasing velocity and angular drift errors. For every error configuration, 1000 samples of noisy input was generated and the output of all the algorithms were recalculated and evaluated. Figures 6.3 and 6.4 shows the average camera location error for the three methods for different axes and noise. Figure 6.5 shows the average norm error of the camera location. Here are some observations from the figures: The method of BA with GPS and INS followed by ICP did not converge for some errors. This can be seen by the large STD of the X axes error the method had in Figure 6.3. In contrast, the BA based on the DTM converged for all the errors. The error in our proposed method did not increased as the noise increased, in contrast to BA with GPS and INS followed by ICP. The best method is our proposed method followed by ICP, the second is our proposed method and BA with GPS and INS followed by ICP is third. In addition, the improvement between the first method and the second is less dramatic compared to the improvement between the second and the third method. The first two points can be explained by the fact that our method uses only the initial guess of the first image, and therefore is built to overcome large initial errors. On the other hand, the structure of the reconstructed scene of the regular BA is sensitive to the camera measurements errors, especially drift errors. In addition, the ICP algorithm can not fix errors in the scene structure. The improvement the ICP algorithm added to our proposed method can be explained by the inaccuracy in our surface approximation surface and the approximation we did to the distance function. This can be further examined and should be part of the future work. 6.3 Experiments on Small Scale Model In addition to the synthetic experiments, we also wanted to tests our proposed method on real images captured in a scenario as close as possible to common Photogrametric scenarios. One of our most important consideration in choosing the experiment framework, was our ability to evaluate the performance of our proposed method, similar to what we did in the synthetic experiment. We wanted to make sure we had true measurements of the pose of the camera at the time the images were taken. Therefore, we decided to perform the experiments using a small scale model of a surface and a camera assembled 40

51 Figure 6.3: Camera s Location Error on Angular Drift The camera s location error in each axes. Three types of BA are compared: BA with GPS together with ICP (blue), BA with DTM (green) and BA with DTM and ICP (red). As can be seen in the second graph, BA with GPS (blue) did not always converge (hence the large STD values), which can indicates that the method is sensitive to noise in the initial starting point. Therefore, the median error (and not average) was calculated for that method and in addition the STD graphs of that method were removed from other graphs. 41

52 Figure 6.4: Camera s Location Error on Velocity Error The camera s location error in each axes. Three types of BA are compared: BA with GPS together with ICP (blue), BA with DTM (green) and BA with DTM and ICP (red). As can be seen in the second graph, BA with GPS (blue) did not always converge (hence the large STD values), which can indicates that the method is sensitive to noise in the initial starting point. Therefore, the median error (and not average) was calculated for that method and in addition the STD graphs of that method were removed from other graphs. 42

53 Figure 6.5: Camera s Location Error Norm The camera s location norm error for velocity and angular drift errors. Three types of BA are compared: BA with GPS to together with ICP (blue), BA with DTM (green) and BA with DTM adn ICP (red). One can see that for velocity error, BA+GPS+ICP drafts in an unbounded manner while the two methods based on BA+DTM do not. This is due to the fact that BA+DTM methods use only the location of the first image and therefore are less sensitive to velocity draft. 43

54 on a robotic arm able to record its pose in the space. In figure 6.6, we compared the slopes of the Model with slopes of a real scenario to show that the surface is realistic. By capturing simultaneously still images of the surface and the robotic arm orientation, we were able to record the true measurements of the camera orientation needed for the algorithm evaluation. Figure 6.6: Slopes Comparison To verify that the model is realistic, we compared the Model s slope (blue) with the slope of an area in southern Israel (red). It can be seen that they have the same magnitude. The slopes were calculated from the directional derivates of the model and of SRTM on southern Israel Experiment Outline Here is a short description of the different components used in the experiments: Camera - We used a camera with a field of view of X degrees and an image size of 1280X1204. Surface model - We used a 53x79x11 cm sandbox. The DTM of the model was generated using depth scans from a kinect device. The scale of the model is one centimetre to 25 meters, 1:2500. It was calculated by comparing the average width of a mountain in the model (20 cm) with real mountains (500 m). The accuracy of the DTM height values is around 0.5 cm which is equivalent to 12.5m. To make the model a little bit more realistic we added 3D objects (car, bridge, vegetation) to the model after it was scanned, so they were not modeled in the DTM. Robotic arm - The camera was assembled on a robotic arm that was able to capture its 6 Degrees Of Freedom with the following accuracy: 1 cm (10 m after scaling) in location and 0.5 angle degree. 44

55 6.3.2 Evaluation methods To evaluate the performance of our proposed method on the experiments, we used several comparison and evaluation methods: Comparing the estimated camera pose, location and angles, with the ground truth. The advantage of this method is that it directly compares the pose of the camera in oppose to the other methods that checks the alignment of the Bundle Adjustment points to the surface. A good alignment of the Bundle Adjustment points does not necessarily indicates accurate camera estimations. Reconstructing the height values of the surface and comparing them to the DTM. The height values were reconstructed from a dense 3D points set that was generated by triangulating dense 2D features extracted from the images. The triangulation was done by using camera orientations estimated by the Bundle Adjustment. Re-Projecting the images on the surface. By coloring the surface with the original images using the estimated camera orientations, we were also able to verify with naked eye the accuracy of the solution Experiment Flow The initial guess used in the experiments was the orientation of the first image only, estimated manually by roughly guessing the distance and pitch angle of the first image. The inputs to the Bundle Adjustment are therefore: the images, the initial guess of the first image and the Digital Terrain Model. The output of the Bundle Adjustment was then evaluated using the methods described above Results We ran two experiments that represents different flight patterns. The first experiment illustrated a photogrammetric flight, where the camera passed through the model in a straight line, See Figure 6.7. The second experiment illustrated a camera approaching the surface, similar to a landing flight or a missile approchiong the target, See Figure 6.8. Both patterns started from a distance of 1m from the surface (equivalent to 2.5 km). Figures 6.7 and 6.8 shows several of the images used in the experiment and figure 6.9 shows the estimated and measured path related to the DTM. Figures 6.10 and 6.11 shows the location and and angles errors of the estimated camera. Figure 6.12 shows the error between the reconstructed height values. Some observations from the experiments: The errors in the experiment are of the magnitude of the robotic arm accuracy, below 1 cm in location and 1 degree in angles, which indicates that our method converged. 45

56 Figure 6.7: Images from the first experiment. Figure 6.8: Images from the second experiment 46

57 Figure 6.9: The estimated and measured camera path The red and blue crosses represent the measured and estimated camera locations. The blue dots represent the location of the Bundle Adjustment 3D points. Note that the Bundle Adjustment points are lying on the DTM. 47

Hartley - Zisserman reading club. Part I: Hartley and Zisserman Appendix 6: Part II: Zhengyou Zhang: Presented by Daniel Fontijne

Hartley - Zisserman reading club Part I: Hartley and Zisserman Appendix 6: Iterative estimation methods Part II: Zhengyou Zhang: A Flexible New Technique for Camera Calibration Presented by Daniel Fontijne