ESA ExoMars Rover PanCam System Geometric Modeling and Evaluation DISSERTATION

Size: px

Start display at page:

Download "ESA ExoMars Rover PanCam System Geometric Modeling and Evaluation DISSERTATION"

Britney Cole
5 years ago
Views:

1 ESA ExoMars Rover PanCam System Geometric Modeling and Evaluation DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Ding Li Graduate Program in Geodetic Science and Surveying The Ohio State University 2015 Dissertation Committee: Alper Yilmaz, Advisor Alan Saalfeld Ralph R. B. von Frese

2 Copyright by Ding Li 2015

3 Abstract The ESA ExoMars rover, planned to be launched to the Martian surface in 2018, will carry a drill and a suite of instruments dedicated to exobiology and geochemistry research. To fulfill its scientific role, high-precision rover localization and topographic mapping will be important for traverse path planning, safe planetary surface operations and accurate embedding of scientific observations into a global spatial context. For such purposes, the ExoMars rover PanCam system will acquire an imagery network providing vision information for photogrammetric algorithms to localize the rover and generate 3-D mapping products. Since the design of the PanCam will influence the localization and mapping accuracy, quantitative error analysis of the PanCam design will improve scientists awareness of the achievable accuracy, and enable the PanCam design team to optimize the design for achieving higher localization and mapping accuracy. In addition, a prototype camera system that meets with the formalized PanCam specifications is also needed to demonstrate the attainable localization accuracy of the PanCam system over long-range traverses. Therefore, this research contains the following two goals. The first goal is to develop a rigorous mathematical model to estimate localization accuracy of this PanCam system based on photogrammetric principles and error propagation theory. The second goal is to assemble a PanCam prototype according to the system specifications and develop a ii

4 complete vision-based rover localization method from camera calibration and image capture to obtain motion estimation and localization refinement. The vision-based rover localization method presented here is split into two stages: the visual odometry processing, which serves as the initial estimation of the rover s movement, and the bundle adjustment technique, which further improves the localization through posterior refinement. A theoretical error analysis model for each of the localization stages has been established accordingly to simulate the rover localization error with respect to the traverse length. Additionally, a PanCam prototype was assembled with similar parameters as the latest technical specifications in order to systematically test and evaluate the ExoMars PanCam localization and mapping capabilities. The entire processing path from system assemblage, calibration, feature extraction and matching, as well as rover localization from field experiments has been performed in this research. iii

5 Dedication To my family iv

6 Acknowledgments First and foremost, I would like to express my gratitude to Dr. Alper Yilmaz, my advisor, whose unwavering support and guidance in the ever changing frontier of science and technology has been essential to this work. I also wish to acknowledge the National Aeronautics and Space Administration for their financial support and interest in this project. Many thanks to the people who assisted in this project, particularly to Dr. Rongxing Li and Dr. Kaichang Di, for their previous guidance on this work. I am also grateful for the members of my dissertation committee, Dr. Alan Saalfeld and Dr. Ralph R. B. von Frese. I also would like to thank my fellows in the program of Geodetic Science and Surveying at The Ohio State University. Finally I would like to thank my wife, Xia Wu, for her love, companionship and helping me stick to the hard and yet sweet end of this graduate study. v

7 Vita B.E. Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China, M.S. Signal and Information Processing, Graduate University of Chinese Academy of Sciences, Beijing, China, M.S. Geodetic Science & Surveying, The Ohio State University, 2011 to present...ph.d. Candidate, Geodetic Science & Surveying, The Ohio State University. Publications D. Li, R. Li and A. Yilmaz ESA ExoMars: Pre-Launch PanCam Geometric Modeling and Accuracy Assessment. ISPRS Archives of Technical Commission III Symposium Volume XL-3, Zurich, Switzerland, September 5-7, R. Li, D. Li, L. Lin, X. Meng, K. Di, G. Paar, A. Coates, J.P. Muller, A. Griffiths, J. Oberst, and D.P. Barnes ExoMars: Pre-Launch PanCam Modeling and Accuracy Assessment. Abstract (2 pages) and Poster, 43rd Lunar and Planetary Science Conference, The Woodlands, Texas, March 19-23, D. Li, C. Wang, H. Zhang, B. Zhang, Automatic Ground Control Points Matching in Spaceborn SAR Imagery Using Phase-only Correlation, Chinese Journal of Scientific Instrument Supplement, 2008, Vol. 4. vi

8 Fields of Study Major Field: Geodetic Science and Surveying vii

9 Table of Contents Abstract... ii Dedication... iv Acknowledgments... v Vita... vi Publications... vi Table of Contents... viii List of Tables... xii List of Figures... xiii 1. INTRODUCTION Motivation and Research Objectives Relevant Studies Pipeline for vision-based rover localization Feature extraction and matching Motion estimation and refinement from corresponding features Organization of the Dissertation viii

10 2. CAMERA MODELS AND VISON-BASED STEREO VISUAL ODOMETRY Camera Models Linear camera model Nonlinear camera model Camera rotation matrix Two-view Geometry Epipolar geometry and the fundamental matrix Stereo calibration and rectification D localization from stereo vision Feature Matching and Visual Odometry Feature extraction and description Stereo visual odometry based on local bundle adjustment GEOMETRIC MODELING OF EXOMARS PANCAM ESA ExoMars PanCam System Components PanCam Geometric Modeling Mapping error analysis of PanCam stereo WAC Error analysis of stereo VO based on simplified two-stops model Improvement of mapping and localization by incorporating HRC Bundle Adjustment of PanCam Imagery Network ix

11 3.3. Simulated Results of PanCam Mapping and Localization Feature correspondences mapping accuracy from stereo triangulation Error propagation from feature points mapping to motion recovery in VO Accuracy improvement by adding HRC to stereo WAC BA processing for PanCam image network PANCAM PROTOTYPE DESIGN AND CONTROL Selection of Vision Sensors and Rotary Platform PanCam Prototype System Calibration Feature point extraction and recognition for single camera calibration Boresight calibration for stereo rectification Stereo Matching with Constraints Stereo matching aided by angle information from rotary platform High-resolution HRC and low resolution WAC image matching Long-Range Rover Localization through Visual Odometry and Bundle Adjustment Initial rover localization though stereo WAC data Refinement of rover s position by incorporating HRC Further improvement of rover s position by bundle adjustment CONCLUSIONS AND FUTURE WORK x

12 5.1. Summary Future Work REFERENCES xi

13 List of Tables Table 1.1 Comparison of commonly used feature detectors: types and characteristics... 9 Table 3.1 ExoMars and MER vision system camera specifications comparison Table 3.2 ExoMars and MER vision system BA-based localization comparison Table 4.1 Statistical information on rover trajectories based on various approaches xii

14 List of Figures Figure 1.1 An artist s view of ExoMars rover and onboard instruments (courtesy of ESA)... 2 Figure 2.1 Pinhole camera illustration Figure 2.2 Pinhole camera geometric model Figure 2.3 Image plane and principal point Figure 2.4 Radial distortion effect: (a) no distortion; (b) pincushion / negative radial distortion; c) barrel / positive radial distortion; (d) fisheye distortion Figure 2.5 Rotation of angle ω about the X axis Figure 2.6 Rotation of angle ϕ about the Y axis Figure 2.7 Rotation of angle κ about the Z axis Figure 2.8 Epipolar geometry. (a) coplanarity condition for camera centers, object point and its projections on left and right image plane; (b) corresponding points constrained by epipolar line Figure 2.9 Stereo vision rectification based on relative orientation Figure 2.10 Epipolar geometry for rectified stereo vison Figure 2.11 Three image patches representing different types of regions Figure 2.12 A series of DOG images for SIFT feature detection Figure 2.13 Neighbors scan order in non-maximum suppression: (a) raster scan order; (b) spiral scan order xiii

15 Figure 2.14 Stereo VO based on 6 DoF rigid transformation Figure 3.1 ExoMars PanCam system configuration Figure 3.2 Stereo VO configuration in rover traverses Figure 3.3 Stereo configuration for HRC and WAC Figure 3.4 Panoramas at adjacent traverse sites Figure 3.5 Feature points distribution and associated error ellipse for ExoMars WAC Figure 3.6 Feature points distribution and associated error ellipses for MER Navcam Figure 3.7 Feature points distribution and associated error ellipses for MER Pancam Figure 3.8 Standard deviation of rover localization vs. feature points number Figure 3.9 Standard deviation of rover localization vs. distance rover has travelled using only stereo WAC Figure 3.10 Feature points mapping accuracy improvement by adding HRC Figure 3.11 Comparisons of rover localization results between using stereo WAC only and stereo WAC plus HRC Figure 3.12 Feature points / landmarks distribution between adjacent sites Figure 3.13 Relative localization error vs. various FOV in BA Figure 3.14 Relative localization error with respect to traverse segment lengths Figure 4.1 Workflow of rover localization using PanCam prototype Figure 4.2 PanCam prototype and mounting platform Figure D calibration field for camera geometric calibration xiv

16 Figure 4.4 Three types of target pattern from left to right: circle, circular checkerboard, and checkerboard Figure 4.5 Detected targets in the 3-D calibration facility Figure 4.6 Jacobian matrix for solving boresight parameters Figure 4.7 Stereo matching aided by pan/tilt rotary platform Figure 4.8 Feature points detected at fixed scale for low-resolution WAC (left) compared with feature points detected at varying scales for high-resolution HRC (right) Figure 4.9 PanCam prototype field experiment near Denver, Colorado (top) and the highlighted traverses (bottom) (courtesy of Google Maps) Figure 4.10 Trajectories constructed from stereo WAC compared with the GPS ground truth reference Figure 4.11 Feature matching results at turnings: feature matching available at small turnings (left); feature matching fails at large turnings (right) Figure 4.12 Relationship between accumulated distance from stereo VO and GPS Figure 4.13 Modified trajectories after removing scale difference Figure 4.14 Relationship between the elevation profile and accumulated traverse distances Figure 4.15 Trajectories constructed from stereo WAC and HRC compared with the GPS ground truth reference Figure 4.16 Trajectories constructed from BA compared with the GPS ground truth reference xv

17 1. INTRODUCTION 1.1. Motivation and Research Objectives The ExoMars Programme was established by European Space Agency (ESA) to search for signs of present and past life on Mars, and to study the surface environment and identify hazards to future human missions, as well as to demonstrate new technologies in preparation for a Mars sample return mission in the 2020s (ESA, 2014a). This program consists of two Mars missions: one is an Orbiter and Entry, Descent and Landing (EDL) module to be launched in 2016, and the other is the ExoMars rover planned to be launched to the Martian surface in 2018 (ESA, 2014b). The rover will carry a drill and other instruments dedicated to exobiology and geochemistry research (Figure 1.1). As the ExoMars rover is designed to travel dozens of kilometers over the Martian surface, highprecision rover localization will be critical for traverse path planning, safe planetary surface operations, and accurate embedding of scientific observations into a global spatial context. 1

18 Figure 1.1 An artist s view of ExoMars rover and onboard instruments (courtesy of ESA) For such purposes, the ExoMars rover Panoramic Camera system (PanCam) will acquire images that are processed into an imagery network providing visual information for both computer vison and photogrammetric based algorithms to localize the rover. Since the design of the ExoMars PanCam will influence localization and mapping accuracy, quantitative error analysis of the PanCam design will improve scientists awareness of the achievable level of accuracy, and potentially enable the PanCam team to optimize its design to achieve the highest possible level of localization and mapping accuracy. Moreover, a prototype camera system in accordance with the PanCam specifications is also needed to demonstrate the attainable localization accuracy of the PanCam system over long-range traverses. Therefore, the goal of this research contains the following two items: 2

19 1. Develop a rigorous mathematical model to estimate localization accuracy of the PanCam system based on photogrammetric principles and error propagation theory. 2. Assemble a PanCam prototype according to the system specifications and develop a complete vision-based rover localization method from camera calibration and image capture to obtain motion estimation and localization refinement Relevant Studies Generally there are two types of practical rover localization techniques on a GPS-denied environment such as the Martian surface (Cheng et al., 2006, Maimone et al., 2007). The first one includes dead reckoning devices which uses an onboard inertial measurement unit (IMU) and wheel odometry to measure attitude and position changes. The second technique involves estimating the rover s motion based on the images of a single or multiple cameras mounted on the rover platform. Nister first introduced the term Visual Odometry (VO) to describe the second type of rover positioning technique (Nister et al., 2004). However wheel odometry can accumulate large errors over slippery surfaces of loose sand, mud or steep slopes. In an experiment conducted by the Jet Proposal Laboratory (JPL) in Johnson Valley, California, the testing rover experienced substantial slippage of as much as 85% of the total traverse distance of 2.45 m over the sandy slopes. The accumulated relative error from dead reckoning devices had gone far beyond the desired level of 10%, whereas the 3

20 error from VO remained less than 1% of the total traverse distance (Cheng et al., 2006). In the National Aeronautics and Space Administration (NASA) Mars Exploration Rover (MER) 2003 mission, the rover Spirit overcame significant slippages when traveling Husband Hill in Gusev Crater using images from the rover s Navcam and Pancam (Li et al., 2008). In a practical rover mission, the dead reckoning method is often used in conjunction with VO to bridge the gaps in featureless areas or where the features are difficult to detect (Maimone et al., 2007) Pipeline for vision-based rover localization The vision-based rover motion estimation pipeline can be split into two stages (Matthies, 1989). The first stage finds corresponding features across adjacent image frames. Here the term feature refers to an image pattern which differs from its immediate neighborhood (Tuytelaars and Mikolajczyk, 2007): it is often a corner, an edge, or a blob. Typically, two approaches are used to find feature correspondences across adjacent frames. The first is to track features across multiple frames based on local search techniques such as correlation or least-squares (LS) techniques (Moravec, 1980; Matthiews and Shafer, 1987; Lacroix et al., 1999; Olson et al., 2003; Yilmaz et al., 2006). The second is to independently detect all the features on adjacent frames and match them all based on similarity of these features local appearance (Mouragnon et al., 2006, Tardif et al., 2008; Scaramuzza et al., 2009). The feature tracking method is more suitable for the situation where a small motion change 4

21 is experienced across adjacent frames, whereas the latter feature matching method works well when large change of motion or viewpoints are involved (Fraundorfer and Scaramuzza, 2012). After detecting the features and finding their correspondences across adjacent frames, the second stage of vision-based motion estimation is to deduce the rover s attitude and position changes from these feature correspondences. Depending on the need to triangulate the features positions in 3-D space, the motion estimation stage can be categorized into three different cases (Scaramuzza and Fraundorfer, 2011). The first case is to recover the 6 degrees of freedom (DoF) of the rover s attitude and position changes directly from feature correspondences. Most monocular VO systems that use a single camera opt for this method (Nister, 2004). The camera is normally calibrated by removing any lens distortion before use. The essential matrix governing the geometric relationship between feature correspondences across adjacent frames is then obtained after rejecting outliers in the feature correspondences using the RANSAC technique (Fischler and Bolles, 1981; Nister et al., 2004, 2006; Lhuillier, 2008; Tardif et al., 2008). Based on the singular value decomposition (SVD) of the essential matrix or the least-squares technique, the 6-DoF attitude and position changes between adjacent frames can be recovered (Horn, 1990; Hartley and Zisserman, 2004). However, the recovered position change is a relative estimation containing an unknown scale factor. This unknown scale can then be determined from absolute measurements of features in the scene or from integrating other sensors, such 5

22 as the IMU or other ranging devices (Millnert et al., 2007). For the binocular case involving stereo camera, the use of a quadrifocal tensor has been proposed to describe the relationship between feature correspondences in adjacent stereo pairs. The 6-DoF motion can be recovered from the quadrifocal tensor directly in the image coordinates of the adjacent stereo pairs without triangulation (Comport et al., 2007). When the 6-DoF trajectory estimation involves triangulating the 3-D positions of the detected features, the two other motion estimation methods are introduced, i.e. motion from 3-D structure matching and from correspondences of the 3-D structures and 2-D image features (Scaramuzza and Fraundorfer, 2011). The former method finds the 6-DoF transformation between the two sets of 3-D features triangulating from adjacent stereo pairs by minimizing the Euclidean distance of the 3-D structure correspondences (Milella and Siegwart, 2006). The popular iterative closet point (ICP) algorithm recovers the 6-DoF transformation between two sets of 3-D point structures (Besl and McKay, 1992). The latter method is designed to minimize the reprojection error in the image coordinates by minimizing the coordinate differences between the projections of the 3-D structure triangulated from previous frames onto the current frames and their correspondences obtained from feature matching. The popular algorithm used in the second approach is known as perspective from n points (PnP) in computer vison (Dhome et al., 1989; Gao et al., 2003; Moreno-Noguer et al., 2007) or, in photogrammetry, space resection (Keller and Tewinkel, 1966; DeWitt and Wolf, 2000; McGlove et al., 2004). The 3-D structure also 6

23 can be refined based on the least-squares technique when solving for the optimal 6-DoF motion parameters (Mouragnon et al., 2006). Motion from the 3-D-to-2-D correspondences turns to be more accurate than from the 3-D-to-3-D correspondences since the 3-D triangulation introduces more errors than reprojection in the image space (Nister et al., 2004) Feature extraction and matching Obviously, features play a vital role in vision-based motion estimation. The motion estimation algorithm wouldn t work if a sufficient number of feature correspondences could not be obtained. The general procedures to find feature correspondences in the context of large viewpoint changes in VO applications involve the following three steps (Li and Allinson, 2008). 1) Independently find distinct features from each image; 2) Calculate the descriptors associated with each feature based on its immediate neighborhood; 3) Apply algorithms to find the closest matches between features in different images. There have been a number of studies on feature extractions and descriptions that focus on finding those features that are both distinctive and invariant to geometric changes such as scaling, rotation and affine transformation. Mikolajczyk and Tuytelaars did a 7

24 comprehensive review of various kinds of invariant feature detectors (Mikolajczyk et al., 2005; Tuytelaars and Mikolajczyk, 2007). The early work on feature extraction/detection goes back to corner detectors such as the Moravec corner detector (Moravec, 1977), the Forstner corner detector (Forstner, 1986), the Harris corner detector (Harris and Stephens, 1988), and the Shi-Tomasi corner detector (Tomasi and Shi, 1994). Corner mean the intersection of two edges in this context (Harris and Stephens, 1988). These corner detectors are fast to compute and easy to implement, but they are all suffer from scale changes, because the detected corner locations are not consistent if the image is re-scaled at a different level (Schmid and Mohr, 1997). The recent two decades of research on feature detection are mainly focused on the scale invariant detectors, including the scale invariant feature transforming (SIFT) detector (Lowe, 1999), Harris-Affine and Hessian-Affine detector (Mikolajczyk and Schmid, 2004), speeded-up robust feature (SURF) detector (Bay et al., 2006), and maximally stable extreme region (MSER) detector (Matas et al., 2002). The rest of this section will focus on how to construct these different types of scale invariant feature detectors and a comparison of each detector s characteristics with respect to the others. For both scale invariant features and scale variant features, two steps are usually needed to detect a feature in an image. The first step is to construct a certain type of filter kernel to be convolved with the entire image data, such as the Difference of Gaussian (DoG) operator in the SIFT detector or certain corner kernel in corner detection. The result from the first 8

25 step is a response function with many local maxima or minima denoting the possible feature locations. Then the next step is to apply nonmaxima suppression to select all the potential features. In order to get scale invariant features in SIFT feature detection, the image should be rescaled at different levels to get a series of images with different local scales to be convolved with the DoG kernel filter. Finally scale invariant features are obtained from a series of DoG images. Based on the feature detector s type (i.e. corner and blob region detectors), and the performance of each detector when experiencing transformations such as rotation, scale, and affine, the comparisons of commonly used feature detectors are listed in the table below. Table 1.1 Comparison of commonly used feature detectors: types and characteristics Corner Detector Region Detector Rotation Invariant Scale Invariant Affine Invariant Harris Shi-Tomasi Harris-Affine Hessian- Affine MSER SIFT SURF Once the features locations are determined, the next stage is to find a descriptor for each detected feature using its immediate neighborhood. The descriptor is then compared with 9

26 others to identify the best match of a feature in another image. Based on the complexity of different descriptors, it can range from the local appearance of the neighborhood (Martin and Crowley, 1995), first- or higher-order derivatives of the neighborhood (Schmid and Mohr, 1997; Geiger et al., 2010), or the neighborhood s gradient in the more complex SIFT feature descriptor and gradient location-orientation histogram (GLOH) feature descriptor (Lowe, 1999; Mikolajczyk and Schmid, 2005). The simplest descriptor based on local appearance pushes the pixel values of the neighborhood into a vector and then it compares the similarity of this vector against others based on geometric distance or cross correlation. The derivative based method is similar but the descriptor vector contains the first- or higher-order derivatives of the neighborhood rather than pixel values (Geiger et al., 2010). The more complex SIFT feature descriptor consists of a 128 dimension vector by computing sub-histograms of the surrounding 4-by-4 neighborhood and each subhistogram contains 8 bins for the gradient s angular directions (Lowe, 1999). Once the features are located in the two images and their associated descriptor has calculated, the next step is to find the closest match for each feature in one image and its correspondence in another image. The most intuitive method is based on a brute-force way of comparing the feature descriptor in one image with all the feature descriptors in another image. The comparison scheme is based on a metric similarity measurement such as normalized cross correlation (NCC), sum of squared distance (SSD), etc. (Martin and Crowley, 1995; Gonzalez and Woods, 2007). Based on the similarity score from the 10

27 comparison, one can choose the best match of a feature, i.e. that with highest similarity score. Obviously this method is time consuming if a large number of features are present since the computation cost is quadratic to the size of the features. More complex data structures can be applied to improve the efficiency, such as the k-d tree for the multidimensional feature descriptor (Bentley, 1975, 1990). Now finding the closest match is equivalent to finding the nearest neighbor for each feature in the k-d tree feature descriptors space. It has proven to be much more efficient to use the k-d tree to find the nearest neighbors with a logarithmic search time (Friedman et al., 1977). Although the k- D tree is efficient in finding nearest neighbors, it only works well in low dimensions and the search time grows exponentially as the number of dimensions increases (Nene and Nayar, 1997). Beis and Lowe (1997) proposed the best-bin-first (BBF) search strategy and Arya introduced the priority k-d tree (Arya and Mount, 1993), both of which are usually more efficient in dealing with high dimensional data. However, the nearest-neighbor based methods may result in ambiguities if the feature patterns are repeatable across images. One can use another strategy called the nearest neighbor distance ratio to remove these ambiguous matches (Lowe, 2004). The idea is to compare the ratio between the nearest and second nearest match, and only accept the nearest match if the ratio is above a predefined threshold. 11

28 Motion estimation and refinement from corresponding features As mentioned before, there are three commonly used approaches for recovering motion from feature correspondences. The most favorable one is based on the 3-D to 2-D feature correspondences, since it is able to get the absolute scale rather than a relative scale in the 2-D-to-2-D method. It is also more accurate than the 3-D-to-3-D method through minimizing the feature measurements error in the image coordinates (Nister et al., 2004). A 6-DoF rigid transformation consisting of attitude and position changes was proposed for describing the movement between adjacent stereo frames by Geiger (Geiger et al., 2011), and the Gaussian noise model was assumed in the camera projection model. Therefore the 6-DoF transformation can be calculated based on maximum likelihood estimation. Actually, this is equivalent to the least-squares method since the Gaussian noise model is assumed (Cheng et al., 2006; Maimone et al., 2007). If the 3-D feature positions are also regarded as unknowns and need to be optimized, a method called window or local bundle adjustment (Mouragnon et al., 2006) can be used to solve the 6-DoF rover movement as well as optimizing the triangulated feature positions in the 3-D space. As pointed out in Olson et al. (2003), the localization error in stereo VO is unboundedly accumulated with respect to the distance the rover traverses. In order to eliminate this error accumulation, Li et al. (2005) proposed another offline posterior processing method similar 12

29 to the aforementioned vision-based rover localization but the moment to perform the optimization is when the rover has finished daily traverses. Generally the rover moves a variable length of distance every sol (the Martian day) and then stops overnight. The onboard stereo VO algorithms are engaged to get the rover position in real time whenever possible during that sol, or in other words, the initial exterior-orientation (EO) parameters are obtained. Here the EO refers to the attitude and position of the camera when the image is taken. The rover s position could also be estimated from these EO parameters. Normally when the rover stops after a certain traverse, it would take a full or partial panorama covering the surrounding area. This stop is called a site in Li s method. Since the distance between two sites is up to 30 meters most of the time, it s usually possible to find feature correspondences appearing in these panoramas of the adjacent sites (Xu, 2004). Therefore, a local BA will be performed for these panoramic images between adjacent sites when the rover stops. This method can optimize all the EO parameters for the images as well as features 3-D positions connecting the images. The BA is performed sol by sol, expanding to as many sols as the rover traverses, as long as there are sufficient features connecting any adjacent two sols. Derived from this processing manner, it is named after the incremental BA and has proven to be effective for the MER 2003 Spirit rover, achieving a long term localization accuracy of 0.5% (Li et al., 2007a). Basically, error in the above-mentioned incremental BA also accumulates as the rover traverses. But the accumulation rate is less than the stereo VO algorithm since it performs 13

30 the optimization to minimize the error accumulation in stereo VO for a relative long distance. In order to further improve the rover localization accuracy, Li et al. (2007b) proposed another vision-based method integrating both the ground-based rover images and images from an orbiting Mars satellite. The satellite images used are from the High Resolution Image Science Experiment (HiRISE) instrument onboard the Mars Reconnaissance Orbiter (MRO), which have a resolution of 0.3 meters. Although HiRISE imagery has lower resolution compared to ground imagery, it has much more reliable position and attitude control covering the entire rover landing site area. That is to say, the satellite image can be approximately used as a ground reference source after georeferenced rectification (Hwangbo, 2010). Due to the lack of absolute ground reference, it only can compare MER Spirit localization results obtained from either incremental BA or a satellite-image-based method. This inconsistency was found to be 1.5% as the rover traversed a total of 7 km (Li et al., 2011). It is also indicated in this paper that this inconsistency could be removed by a combined BA of all the available datasets Organization of the Dissertation The organization and contents of the rest of the dissertation is summarized as below. Chapter 2 introduces the background of camera models including both the linear camera model and the nonlinear camera model consisting of lens distortions. Then the camera 14

31 calibration method is described, which removes the lens distortion and obtains the interiororientation (IO) parameters for the camera in order to rectify its captured images. Next the single camera model is extended to stereo cameras by introducing the fundamental matrix for the two-view geometry, or epipolar geometry. The calibration and rectification of the stereo vision is introduced in order to facilitate the feature matching within a stereo image pair. Finally the feature detection and matching techniques are discussed in more details, as well as the principal of stereo visual odometry. Chapter 3 focus on the theoretical error analysis of the PanCam mapping and localization model. First the mapping accuracy of the stereo WAC (Wide Angle Camera) system is derived based on the latest specifications. Then the attainable level of localization accuracy is analyzed based on a simplified rover stereo VO model involving only two stops. This localization accuracy analysis for stereo WAC is extended to the three-camera PanCam system by incorporating the additional stereo pair formed by the HRC (High Resolution Camera) and the left WAC. After that, a BA-based ideal rover localization model is established for long-range accurate rover localization with a relative error of 1%. Again this ideal model is considered to include stereo WAC only at first, then the third camera HRC is added into the model to estimate how much improvement it can achieve. Finally the simulation results are presented for the relative localization error with respect to the distance the rover traverses and the factors affecting the rover localization accuracy for long-range traverses are discussed. 15

32 Chapter 4 introduces a prototype PanCam system in detail, including each camera s specifications in this prototype and the pan/tilt platform controlling the movement of the system. First it discusses the camera calibration for each camera to remove lens distortion as well as the stereo calibration to obtain the stereo rectified pair. Automatic feature extraction on specific calibration pattern is also studied in this chapter in order to speed up the camera calibration process. Then the stereo matching between stereo WAC as well as the high-resolution HRC and low-resolution WAC are studied with several constraints to improve matching accuracy and efficiency. Finally, the field experiment is presented consisting of traverse lengths of around 900 m using the PanCam prototype. The trajectories from the stereo WAC VO processing, stereo WAC and HRC VO processing, and the posterior BA processing are compared with ground truth reference from GPS measurements for verification. Chapter 5 is the conclusion with a summary on the localization accuracy analysis performed in this dissertation and future work on further improving rover positioning accuracy. 16

33 2. CAMERA MODELS AND VISON-BASED STEREO VISUAL ODOMETRY Figure 2.1 Pinhole camera illustration The camera model is the geometric description of an optical system, i.e. the relationship between the object and its projection on the image. The most commonly used and simplest camera model is the pinhole model, in which the object, its projection on the image and the pinhole lie on the same line (Fig. 2.1). It s called central projection for this type of projection. In this chapter, the linear geometric model is first introduced for the pinhole camera model, describing the collinearity condition of the central projection. Then, the nonlinear lens distortion effect is incorporated to improve the projection model. The camera rotation matrix based on three Euler angles is also studied in this section. Next the single camera model is extended to stereo cameras by introducing the fundamental matrix for the two- 17

34 view or epipolar geometry. Furthermore the calibration and rectification of the stereo vision is introduced in order to facilitate the feature matching within the stereo image pair. The last section of this chapter discusses feature detection and matching in more detail, as well as the principals of stereo visual odometry. The local bundle adjustment is also introduced here for calculating the rover s position from stereo pairs, meanwhile refining the 3-D location of matched features in object space Camera Models The basic pinhole camera model and central projection is described in this chapter. The linear camera mode is introduced to describe the geometric collinearity relationship of the object, pinhole and projection on the image. Then the nonlinear model is discussed in order to model the lens distortion effect including both radial and tangential distortions. The camera rotation matrix is also briefly studied at the end of this section Linear camera model The pinhole camera model is a geometric projection model that maps objects in the real world onto the image plane. The geometric relationships within the pinhole camera model can be illustrated in Fig. 2.2(a), in which C is the optical center, P is the object point in 3- D space, O is the origin of the image plane or principal point, the axis ZC directing from C to O is the optical or principal axis, and p is the projection of P on the image plane. The optical axis is perpendicular to the image plane and the distance from the optical center to 18

35 the image plane is equal to the focal length of the lens. Note the projection point, p, on the image plane can be determined from the object point, P, and the camera center, C, i.e. it is a one to one relation from point P to point p; however on the other hand, the point P cannot be determined from the projection point p and the camera center C, i.e. it s a one-tomultiple relationship from point p to the object point P. In other words there is an unknown scale factor from projection point p to object point P in the single view case, which is why at least two images are need to determine an object point in the 3-D world. Figure 2.2 Pinhole camera geometric model Assume the 3-D coordinates of point P are (Xc, Yc, Zc) in the camera-centered coordinate system C-XcYcZc, the 2-D coordinates of the corresponding projection p(x R, y R ) on the image can be calculated easily according to the simplified projection model in Fig. 2.2(b). x R = f X C Z C y { R = f Y (2.1) C Z C 19

36 Equation (2.1) can be rewritten in homogeneous coordinates as the following equation (2.2). x R f Z C [ y R ] = [ f ] [ 0 X C Y C Z C 1 ] (2.2) Usually the origin of the image plane is not coincident with the principal point, so there is an offset between the image plane s origin and principal point. Letting o(u0, v0) denotes this offset, it is also the principal point s 2-D coordinates in the image plane (Fig. 2.3). Figure 2.3 Image plane and principal point Adding this principal offset to equation (2.2), the following equation can be obtained describing the projection model from object point P to point p on the image plane within the camera-centered coordinates. 20

37 x R f Z C [ y R ] = [ f 0 u 0 v ] [ 0 X C Y C Z C 1 ] (2.3) The object space has its own coordinate system, known as the world coordinate system Ow-XwYwZw (Fig. 2.2(a)). Therefore, a transformation is needed to convert the object point from the world coordinate system to the camera-centered coordinate system. A 6- DoF rigid transformation containing rotation R and translation t can be used to denote this transformation (Hartely and Zisserman, 2004), in which R is a 3 by 3 matrix representing the orientation change and t is 3 by 1 vector describing the translation. The transformation R and t altogether are usually referred to as the exterior-orientation (EO) parameters. Assuming the object point s coordinates are (X, Y, Z) in the world coordinate system, this 6-DoF rigid transformation can be expressed in homogeneous coordinates as equation (2.10). X C Y C Z C X R [ ] = [ 3 3 t 3 1 ] [ Y ] (2.4) Z 1 1 Combing the equations (2.3) and (2.4) derives the transformation from the object point in the world reference to its projection point in the image plane x R y R Z C [ 1 f ] = [ f 0 u 0 v ] [ R X 3 3 t 3 1 ] [ Y ] (2.5) Z 1 21

38 f 0 u X X 0 = [ 0 f v 0 ] [R 3 3 t 3 1 ] [ Y ] = K[R Z 3 3 t 3 1 ] [ Y ] Z It s often convenient to use the camera calibration matrix K to replace the 3 3 matrix defined in the above equation (2.5), which is related to the camera and lens parameters and is called the interior-orientation (IO) parameters including focal length, principal point, etc. Therefore, the projection matrix for the pinhole camera model is defined as: P = K[R 3 3 t 3 1 ] (2.6) The projection matrix is a 3 4 matrix consisting of both the IO and EO parameters. So far it has been assumed that the camera and lens are ideal in the above derivation, which means the focal length is the same in the x and y directions, meanwhile the x and y axes are exactly perpendicular to each other on the image plane. For the general case, fx and fy can be used to denote the focal length along the x and y directions, and a skew parameter, s, to represent the imperfect right angle between the x and y axes. Obviously for the ideal camera, the focal lengths are equal to each other in the x and y directions, and the skew parameter s is zero. The projection model for a general pinhole camera is in the following form. x R f x s u X 0 Z C [ y R ] = [ 0 f y v 0 ] [R 3 3 t 3 1 ] [ Y ] (2.7) Z

2.1.2. Nonlinear camera model However, it has been found that the aforementioned linear camera model is not accurate for precise photogrammetric measurement applications due to the lens distortion

39 Nonlinear camera model However, it has been found that the aforementioned linear camera model is not accurate for precise photogrammetric measurement applications due to the lens distortion effect (Fraser, 2001). Fig. 2.4 shows several examples of lens distortion effect in the radial direction. Experiences show that the distortion increases away from the projection center or the principal point, whereas this relationship is always nonlinear. Therefore a great deal of research has been done since 1970s to develop an adequate projection model to correct the lens distortion (Fraser, 2001). The radial and tangential distortion are taken into account in this dissertation, which are suitable for correction of the cameras used in our prototype experiment. Figure 2.4 Radial distortion effect: (a) no distortion; (b) pincushion / negative radial distortion; c) barrel / positive radial distortion; (d) fisheye distortion The radial distortion can be expressed as following in the x and y directions, respectively, { x r = (x u 0 )(k 1 r 2 + k 2 r 4 ) y r = (y v 0 )(k 1 r 2 + k 2 r 4 ) (2.8) in which r 2 = (x u 0 ) 2 + (y v 0 ) 2 ; 23

40 k1 and k2 are coefficients for radial distortion; and (x0, y0) are principal point or projection center. The radial distortion is caused by the imperfect lens shape, and it can be deduced from equation (2.8) that the distortion effect increases as the projection point (x, y) gets away from the principal point (x0, y0). The tangential distortion model is as following. { x t = 2p 1 (x u 0 )(y v 0 ) + p 2 (r 2 + 2(x u 0 ) 2 ) y t = 2p 2 (x u 0 )(y v 0 ) + p 1 (r 2 + 2(y v 0 ) 2 ) (2.9) The coefficients p1, and p2 are multipliers along the tangential direction, and r2 is the same as defined in equation (2.8). The tangential distortion is caused by physical elements in the lens not being perfectly aligned, which is also known as decentering distortion (Zhang, 2000). Adding both the radial and tangential distortions together, the nonlinear lens distortion correction model to the projection point can be obtained as following. { x = x r + x t = (x u 0 )(k 1 r 2 + k 2 r 4 ) + 2p 1 (x u 0 )(y v 0 ) + p 2 (r 2 + 2(x u 0 ) 2 ) y = y r + y t = (y v 0 )(k 1 r 2 + k 2 r 4 ) + 2p 2 (x u 0 )(y v 0 ) + p 1 (r 2 + 2(y v 0 ) 2 ) (2.10) It should be pointed out that a 4 1 coefficient vector [k1, k2, p1, p2] is used to model the lens distortion effect, which has been proven to be sufficient for all the cameras in our prototype experiment. 24

41 The complete nonlinear pinhole camera model in equation (2.11) representing the object point in world coordinates to projection on the image plane can be obtained by combining the linear camera projection model and the nonlinear lens distortion correction model. At first the object point (X, Y, Z) T in the world coordinate system is projected onto the image plane where there is no lens distortion and the IO parameters are represented by camera matrix K. On the other hand, the lens distortion effects of both radial and tangential distortion are added to the observations of this projection on the image plane (u, v) to obtain the measured projections (xr, yr). x X R Z C [ y R ] = K[R 3 3 t 3 1 ] [ Y ] Z 1 1 where { x R = x (k 1 r 2 + k 2 r 4 ) + 2p 1 x y + p 2 (r 2 + 2x 2 ) y R = y (k 1 r 2 + k 2 r 4 ) + 2p 2 x y + p 1 (r 2 + 2y 2 ) ; { x = u u 0 y = v v 0 ; and r 2 = x 2 + y 2 (2.11) We will now examine how to estimate the 4 distortion coefficients (k1, k2, k3, k4) for all the three cameras in the PanCam prototype system calibration section. 25

42 Camera rotation matrix In this section, the 3-D rotation matrix is defined in terms of three Euler angles ω, ϕ, and κ: ω defines the rotation about the X axis, ϕ is the rotation about the Y axis, and κ is the rotation about the Z axis. Figure 2.5 Rotation of angle ω about the X axis Considering the ω rotation about the X axis in Fig. 2.5, the coordinates of any point A(X1, Y1, Z1) T in the once-rotated X1Y1Z1 system can be computed as equation (2.12) using its coordinates (X, Y, Z) T in the original system XYZ. X 1 X X [ Y 1 ] = R ω [ Y] = [ 0 cos ω sin ω] [ Y] (2.12) Z 1 Z 0 sin ω cos ω Z 26

43 Figure 2.6 Rotation of angle ϕ about the Y axis Similarly, the X1Y1Z1 system is then rotated angle ϕ about the axis Y as shown in Fig. 2.6 forming the twice-rotated system X2Y2Z2. The point A in the X2Y2Z2 can be expressed as equation (2.13). X 2 [ Y 2 ] = R φ [ Z 2 X 1 Y 1 cos φ 0 sin φ ] = [ ] [ Y 1 ] (2.13) Z 1 sin φ 0 cos φ Z 1 X 1 Figure 2.7 Rotation of angle κ about the Z axis 27

44 Finally, the X2Y2Z2 system is rotated angle κ about the Z axis as shown in Fig. 2.7 forming the three-times-rotated system X3Y3Z3. The point A s coordinates in X3Y3Z3 is represented in equation (2.14). X 3 [ Y 3 ] = R κ [ Z 3 X 2 Y 2 cos κ sin κ 0 ] = [ sin κ cos κ 0] [ Y 2 ] (2.14) Z Z 2 X 2 Combining all the above three rotations, the final camera rotation matrix in terms of ω, ϕ, and κ can be defined in equation (2.15). R = R κ R φ R ω = sin ω sin φ cos κ cosω sin φ cos κ cos φ cos κ cos φ sin κ [ sin φ ] + cos ω sin κ + sin ω sin κ sin ω sin φ sin κ + cos ω cos κ cosω sin φ sin κ + sin ω cos κ sin ω cos φ cos ω cos φ r 11 r 12 r 13 = [ r 21 r 22 r 23 ] (2.15) r 31 r 32 r 33 Once the rotation matrix R is obtained, it is often necessary to get the derivatives with respect to the rotation angles ω, ϕ, and κ. Let rij (i, j = 1, 2, 3) denotes the entry of rotation matrix R, then the partial derivatives of each entry with respect to ω, ϕ, and κ are calculated in the following equations (2.16). cos ω sin φ cos κ sinω sin φ cos κ 0 R sin ω sin κ + cos ω sin κ 0 r 13 r 12 ω = cos ω sin φ sin κ sinω sin φ sin κ 0 = [ 0 r 23 r 22 ] sin ω cos κ + cos ω cos κ 0 r 33 r 32 [ 0 cos ω cos φ sin ω cos φ ] sin φ cos κ sin ω cos φ cos κ cosω cos φ cos κ R φ = [ sin φ sin κ sin ω cos φ sin κ cosω cos φ sin κ ] cos φ sin ω sin φ cos ω sin φ (2.16) 28

45 sin ω sin φ sin κ cosω sin φ sin κ cos φ sin κ R κ = + cos ω cos κ + sin ω cos κ sin ω sin φ cos κ cosω sin φ cos κ cos φ cos κ cos ω sin κ sin ω sin κ [ ] r 21 r 22 r 23 = [ r 11 r 12 r 13 ] Two-view Geometry As mentioned before, in order to recover the 3-D scene structure, there should be at least two images viewing the same scene from different viewpoints. The intrinsic projection geometry between these two views is called two-view or epipolar geometry, which is only determined by the camera s IO parameters and the relative pose between the two views while it is independent of the 3-D scene structure. In this section, the fundamental matrix representing the epipolar geometry is introduced first, then discussion of the method of calculating the relative pose between two views as well as rectification of the stereo pair is presented. Finally, the algorithm for obtaining the positions of 3-D scene structure in the world coordinate system from the rectified stereo vision is briefly covered Epipolar geometry and the fundamental matrix The epipolar geometry originated from facilitating the correspondence search between stereo images or stereo matching. Thus the fundamental matrix derived from these studies is used to represent the geometric relationships between correspondences in the stereo 29

46 image pair. Note all the images here are assumed to be linear, i.e. the lens distortion effects have been removed. In Fig. 2.8(a), assume X is the object point in the 3-D scene; C and C are optical centers for the left and right cameras respectively; x and x are the projections of X onto the left and right images of the stereo pair. Obviously the back-projection ray from the left camera center C to x as well as the other ray from the right camera center C to x should intersect at X. In other words, the camera centers C and C, image correspondences x and x, and object point X should be on the same plane, i.e. coplanarity condition. Variable π is used to represent this plane, and assume the intersection line of π and the left image plane is the epipolar line l, similarly epipolar line l represents the intersection between π and the right image. From the aforementioned geometric relationship between the stereo pair, it can be sees that the correspondences for the left image point x should always lie on the epipolar line l in the right image. This constraint significantly shortens the stereo matching time by reducing the search space from the entire 2-D image plane to a 1-D line on the image. By connecting the two camera centers C and C, the intersections between the line CC and the image plane are denoted by epipole e and epipole e on left and right images, respectively (Fig. 2.8(b)). 30

47 Figure 2.8 Epipolar geometry. (a) coplanarity condition for camera centers, object point and its projections on left and right image plane; (b) corresponding points constrained by epipolar line The fundamental matrix F is an algebra expression of the epipolar geometry constrains. It s a 3 3 matrix and satisfies the following equation for all the correspondences x and x. x T Fx = 0 (2.17) where x and x are in homogenous coordinates, i.e. x = (x, y, 1) T, and x = (x, y, 1) T. The detailed derivation for the fundamental matrix from both geometric and algebraic standpoints can be found in Hartley and Zisserman s book (2004). Besides satisfying the correspondences constraints equation (2.17), the fundamental matrix F has the following properties. F is a rank 2 homogenous matrix with a 7 DoF; 31

48 The epipolar line corresponding to the left image point x is: l = Fx; The epipolar line corresponding to the right image point x is: l = Fx ; and The left and right epipoles satisfy Fe = 0 and Fe = 0 respectively. Since the fundamental matrix can reduce the stereo matching search space from a 2-D plane to a 1-D line, it can also be used to remove outliers during the matching process. One of the most popular algorithms in detecting outliers is called random sample consensus (RASAC). The idea is to compute an initial estimate for the fundamental matrix based on a randomly selected minimum set of matches. Then for each of the remaining matches, the distance of every point to its corresponding epiloar line is examined based on the initial fundamental matrix estimate. If this distance is within a predefined threshold, this point is marked by inlier match. The entire process is iterated multiple times, and the best fundamental matrix estimation is selected for the iteration having the largest number of inliers. In the end, the final fundamental matrix can be refined from all the inliers produced by the RASAC algorithm. There are several different ways of solving for F (Hartley and Zisserman, 2004), most of which are based on the least-squares method. The equation (2.17) can be rewritten as equation (2.18), in which fij (i, j are row and column indexes for matrix F) are entries for F. The minimum number of matches to solve the equation (2.18) is 7, and if there are more than 7 matches the least-squares techniques can be utilized. 32

49 xx f 11 + x yf 12 + x f 13 + y xf 21 + y yf 22 + y f 23 + xf 31 + yf 32 + f 33 x 1 x 1 [ x n x n x 1 y 1 x n y n = 0 for a set of n points x 1 x n y 1 x 1 y n x n y 1 y 1 y n y n y 1 y n x 1 x n y 1 y n 1 ] 1 f 11 f 12 f 13 f 21 f 22 f 23 f 31 f 32 [ f 33 ] = Af = 0 (2.18) Note the fundamental matrix F can only be determined up to a scale. That is to say f33 can be assumed to be 1 when solving F using equation (2.18) Stereo calibration and rectification The fundamental matrix can speed up stereo matching by searching for correspondences on a line instead of within the entire 2-D image plane. Moreover, it can go one step further by aligning the corresponding epipolar lines. That is to say, by rotating the left and right camera-centered coordinates by some small angles, all the matches with the same vertical coordinates for horizontal stereo vision (or same horizontal coordinates for vertical stereo vision) can be brought into play. This process is called stereo rectification, which further facilitates stereo matching by limiting the search space to a known horizontal or vertical line on the image. The critical process in stereo rectification is to find the orientation and position of the left image with respect to the right images, which is called stereo calibration 33

50 in computer vison or relative orientation in photogrammetry. In this section, how to determine the relative orientation and obtain the stereo rectification images are investigated. In the rest of this section, the algorithm will be derived for relative orientation based on the idea of the coplanarity condition introduced in the previous section, the fundamental matrix. It has to be pointed out that before relative orientation any lens distortions have to be removed for the stereo images, and a unified focal length should be applied for the left and right images also. Based on the Fig. 2.8, the rectified camera-centered coordinate systems C1-X1Y1Z1 and C2-X2Y2Z2 are introduced for the left and right images respectively in Fig They are obtained by rotating the original camera-centered coordinate system of the left and right cameras by some small angles. Obviously, for the horizontal stereo vision, C2-X2Y2Z2 is the same as C1-X1Y1Z1 except for a translation B along the X axis, where B is the baseline between the two cameras. The C1Y1 axis in C1-X1Y1Z1 is selected purposively to be parallel to the left image plane, and C1Z1 is then determined from right-hand rule. In this way C1Y1 is also perpendicular to the optical axis of the original left camera-centered coordinates. Therefore only two rotation angles (ϕ1, κ1) about C1Y1 and C1Z1 are needed instead of three angles, while the right camera still needs three rotation angles (ω2, ϕ2, κ2) to align with the new camera-centered coordinates C2-X2Y2Z2. In total there are 5 unknown rotation angles (ϕ1, κ1, ω2, ϕ2, κ2) to be solved for stereo vision rectification. 34

51 Figure 2.9 Stereo vision rectification based on relative orientation For each matching pair (p1, p2) in the newly rotated left and right image planes, the three rays (i.e. baseline B, C1p1, and C2p2) lie on the same plane. Assuming the matches (p1, p2) have the coordinates of (X1, Y1, Z1) and (X2, Y2, Z2) in its own newly rotated cameracentered coordinates, the following coplanarity condition follows. B 0 0 G = X 1 Y 1 Z 1 = 0 (2.19) X 2 Y 2 Z 2 As is known (X1, Y1, Z1) and (X2, Y2, Z2) are obtained by rotating the original left and right camera-centered coordinates by some small angles. If (x1, y1, f) and (x2, y2, f) are used to represent the coordinates of the matching pair (p1, p2) in its original camera-centered coordinates, we have the following relationships. 35

52 X 1 [ Y 1 ] = R 1 [ Z 1 x 1 y 1 f X 2 ], [ Y 2 x 2 y 2 ] = R 2 [ ] (2.20) Z 2 f R 1 is the rotation matrix for the left camera, which combines the rotation along C1Y1 and C1Z1 axis and R 2 is the rotation matrix for the right camera combining rotation along all the three axes, as expressed in equation (2.21). cos φ 1 0 sin φ 1 cos κ 1 sin κ 1 0 R 1 = R φ1 R κ1 = [ ] [ sin κ 1 cos κ 1 0] sin φ 1 0 cos φ κ 1 φ 1 [ κ ] φ R 2 = R ω2 R φ2 R κ cos φ 2 0 sin φ 2 cos κ 2 sin κ 2 0 = [ 0 cos ω 2 sin ω 2 ] [ ] [ sin κ 2 cos κ 2 0] 0 sin ω 2 cos ω 2 sin φ 2 0 cos φ κ 2 φ 2 [ κ 2 1 ω 2 ] φ 2 ω 2 1 (2.21) Expanding equation (2.19) using the first-order Taylor series and neglecting the higher order items result in equation (2.22). G = G 0 + G φ 1 dφ 1 + G κ 1 dκ 1 + G ω 2 dω 2 + G φ 2 dφ 2 + G κ 2 dκ 2 where G 0 = B(Y 1 Z 2 Y 2 Z 1 ) (2.22) The partial derivatives for all five small unknown angles (ϕ1, κ1, ω2, ϕ2, κ2) can be calculated from equation (2.23). The first-order approximations are used for these small items. 36

53 B 0 0 G X 1 Y 1 Z B = = f 0 x φ 1 φ 1 φ 1 φ 1 BX 1 Y 2 1 X X 2 Y 2 Z 2 Y 2 Z 2 2 B 0 0 G X 1 Y 1 Z B = = y κ 1 κ 1 κ 1 κ 1 x 1 0 BX 1 Z 2 1 X X 2 Y 2 Z 2 Y 2 Z 2 2 B 0 0 G X 1 Y 1 Z B = ω 2 X 2 Y 2 Z = X 1 Y 1 Z 1 B(Y 1 Y 2 + Z 1 f) 2 0 f y 2 ω 2 ω 2 ω 2 B 0 0 G X 1 Y 1 Z B = φ 2 X 2 Y 2 Z = X 1 Y 1 Z 1 BY 1 X 2 2 f 0 x 2 φ 2 φ 2 φ 2 B 0 0 G X 1 Y 1 Z B = κ 2 X 2 Y 2 Z = X 1 Y 1 Z 1 BZ 1 X 2 2 y 2 x 2 0 κ 2 κ 2 κ 2 (2.23) Substitute all the partial derivatives in equation (2.23) to equation (2.22), and then multiply the entire equation by f/(bz 1 Z 2 ). After simplification, we have the following: fg 0 BZ 1 Z 2 + ( X 1Y 2 Z 1 ) dφ 1 + X 1 dκ 1 + ( Y 1Y 2 Z 1 + Z 1 ) dω 2 + Y 1X 2 Z 1 dφ 2 + ( X 2 )dκ 2 = 0 (2.24) In the above equation, letting Q = fg 0 /(BZ 1 Z 2 ), we have equation (2.25). Q = fg 0 BZ 1 Z 2 = fb(y 1Z 2 Y 2 Z 1 ) BZ 1 Z 2 = f Y 1 Z 1 f Y 2 Z 2 (2.25) The geometric meaning of the residue Q in equation (2.25) is the disparity of each matching pair along the vertical direction, i.e. the y coordinate difference on the image plane. This 37

54 disparity should be 0 if the stereo images are perfectly aligned to each other. Therefore the criterion to verify the completion of the relative orientation process is to check whether the residue of Q is close to 0. There are 5 unknowns in equation (2.24), so at least 5 stereo matches are required to solve this equation. Usually for a set of n point matches, the least-squares technique can be used to solve a set of linear equations of the form shown in equation (2.26). [ X 1 Y 2 Z 1 i=1 X 1 Y 2 Z 1 i=n X 1 i=1 ( Y 1Y 2 + Z Z 1 1 ) i=1 X 1 i=n ( Y 1Y 2 Z 1 + Z 1 ) i=n = f Y 1 Z 1 f Y 2 Z 2 i=1 f Y 1 f Y 2 [ Z 1 Z 2 i=n] AX = Q Y 1X 2 Z 1 i=1 Y 1X 2 Z 1 i=n X 2 i=1 dφ 1 dκ 1 dω 2 dφ X 2 2 i=n] [ dκ 2 ] (2.26) The LS solution for the above equation is equation (2.27). X = dφ 1 dκ 1 dω 2 = (A T A) 1 A T Q (2.27) dφ 2 [ dκ 2 ] 38

55 D localization from stereo vision Once the stereo image pair is aligned after stereo rectification, the y coordinates for the correspondences are the same for horizontal stereo vision. The epipolar geometry relationship for the rectified image pair becomes Fig Figure 2.10 Epipolar geometry for rectified stereo vison Then the object point s coordinates in the camera-centered coordinate system can easily be derived based on simple triangulation as expressed in equation (2.28). X p = B(x u 0) x x = B(x u 0) d Y p = B(y v 0) d { Z p = Bf d (2.28) The x and x are image coordinates of the matching pair along the x direction in the left and right image respectively. The difference between x and x is called disparity in computer vision or parallax in photogrammetry. We can use d to represent this disparity. 39

56 2.3. Feature Matching and Visual Odometry As we have seen so far, the corresponding features are critical in epipolar geometry, relative orientation, and 3-D scene recovery. The accuracy of the recovered 3-D scene is highly influenced by the feature extraction accuracy. Most of the classical feature detectors developed in the last century are based on finding the local extrema of the image gradient, i.e. the local maximum of the first-order derivatives or the zero crossing of the secondorder derivatives. Since the introduction of SIFT in 1999 (Lowe, 1999), scale invariant feature detectors have been attracting increasing attention for dealing with scale changes. This section will focus on feature extraction, description and matching in terms of accuracy and reliability. The VO algorithm based on stereo vison is also introduced in this section Feature extraction and description We will first introduce the classic Harris corner detection algorithm, then discuss the invariant scale blob region detector, SIFT, followed by a brief review of the non-maximum suppression technique in finding the local extrema for feature extraction. Finally several popular feature descriptors for the detected feature, such as local appearance, histogram of gradient, and derivatives, are studied. 40

Harris corner detector The Harris corner detector proposed by Harris and Stephens (1988) is obtained by first computing the autocorrelation matrix.

57 Harris corner detector The Harris corner detector proposed by Harris and Stephens (1988) is obtained by first computing the autocorrelation matrix. Then, judging from the eigenvalues of the autocorrelation matrix A in equation (2.29), the current pixel can be classified into one of the three types: a corner from the intersection of two edges if both eigenvalues are large positive values, a point on an edge if one large eigenvalue and the other is close to 0, or a point in a homogeneous region if both eigenvalues are close to 0. A = G(σ i ) [ I x 2 I x I y 2 I x I y I ] (2.29) y In Fig. 2.11, three image patches corresponding to these three types of center point are shown. The eigenvalues for the center pixel of each image patch is: (a) corner feature eigenvalues 97 and 99, (b) edge feature eigenvalues 196 and 0, and (c) homogeneous region eigenvalues 0 and 0. Figure 2.11 Three image patches representing different types of regions 41

58 However, the following Harris corner metric in equation (2.30), consisting of trace and determinant of A, is used for corner justification rather than directly computing the eigenvalues since the latter is much more time consuming in computation. R = det(a) κ trace 2 (A) (2.30) The multiplier κ in the above equation is a constant, corner pixels having two large positive eigenvalues are corresponding to large Harris metric responses. The corner points can be marked as local maximum of the Harris measure from non-maximum suppression. Obviously, the Harris corner detector cannot deal with scale changes while it is invariant to rotation. It finds the corner points having local maximum of the image gradient at a fixed scale. SIFT corner detector The SIFT detector inherits the DoG approach proposed by Crowley and Parker (1984) in dealing with scale changes. It detects a group of local extrema from a series of DoG responses obtained from convolution of the image with a series of DoG functions with different scales, as shown in the following equation (2.31). DoG(x, y, σ) = (G(x, y, kσ) G(x, y, σ)) I(x, y) (2.31) The k is a constant multiplicity for producing a series of local scales using k n-1 *σ (n=1,2, ; G is the 2-D Gaussian distribution with standard deviation σ). Next all the local extrema 42

59 (minima or maxima) points are identified across scale spaces. Then these candidate points are filtered by removing those with low contrast and edge responses, resulting in keypoints stable for matching purposes. Finally, based on the local image properties of each keypoint, the position of each keypoint is refined to sub-pixel accuracy, and the orientation is assigned to each keypoint. In Fig. 2.12, an example is shown of the procedures for generating a series of DoG images in SIFT feature detection. The image is downsampled three times by a factor of 2, generating an image pyramid; a scale multiplicity constant κ of 2 is used to produce a series of Gaussian smoothing functions; and DoG responses are shown for image pyramids at different scales. The images from top to bottom are: (a) the original image, (b) convolve the original image with a series of Gaussian functions having different sigmas, (c) DoG responses of the adjacent Gaussian responses, (d) downsampled image sequences of the Gaussian response in (b) by a factor of 2, (e) DoG response of the downsampled adjacent Gaussian images, and (f-i) downsampling and DoG procedures repeated two more times. 43

Figure 2.12 A series of DOG images for SIFT feature detection Non-maximum suppression Non-maximum suppression is the procedure for finding all the local maxima in an image.

60 Figure 2.12 A series of DOG images for SIFT feature detection Non-maximum suppression Non-maximum suppression is the procedure for finding all the local maxima in an image. It first appeared in the edge detection field to make thick edges into thin ones (Rosenfeld and Kak, 1976). Then Kitchen and Rosenfeld (1982) borrowed this method to locate the 44

61 feature points in an image. After that, various feature point detectors have utilized this approach for interest point selections. Research on implementing the non-maximum suppression more efficiently has also attracted much attentions (Herk, 1992; Neubeck and Van Gool, 2006). In this section, only the principals behind the non-maximum suppression by using a straightforward approach are introduce, without considering efficiency. As shown in Fig. 2.13a, the simplest straightforward method is based on raster scan order, i.e. the pixel is visited from left to right and then top to bottom. For a given centered point c, its (2n+1) (2n+1) neighbor window is visited and compared with c in the raster scan order. If a neighbor point is found to be equal to or larger than the central point c, then c is regarded as non-maximum. The algorithm then skips to the next pixel in the scan line. The algorithm repeats until all the pixels have been visited in the image. Sometimes it also needs to find the local minimum in cases where the pixel values denote the responses for derivative-based filters. The procedure is similar to non-maximum suppression except the comparison is to find the local minimum values. 45

62 Figure 2.13 Neighbors scan order in non-maximum suppression: (a) raster scan order; (b) spiral scan order According to the experiment in Pham (2010), the average total comparison times are O(n) for a (2n+1) (2n+1) rectangular neighbor in the raster scan order approach. The comparison times can be significantly reduced by using a local spiral order (Fig. 2.13b) rather than the raster scan order (Forstner and Gulch, 1987). The spiral order method compares with the closest neighbor first and guarantees the central pixel is the local maximum for the 3 3 neighbor before it is tested against a larger neighbor. Local appearance based feature descriptor In the feature detection stage, a set of candidate points are identified. Thus for each detected point, its location in the image is known. Next a descriptor needs to be computed for each candidate point based on its immediate neighbor, which is to be compared with descriptors in another image to determine the best match. The simplest descriptor is based on the candidate point s local appearance, such as a small window of size 3 3 surrounding the 46

63 feature point. Then all the pixels within this window are compared against another feature s descriptor based on certain types of similarity metric. As discussed in the last chapter, the commonly used similarity metrics are normalized cross correlation (NCC), sum of squared difference (SSD), etc. The following equations (2.32) and (2.33) describe the NCC and SSD metric, in which (i, j) denotes the interest point s location in the image f; f i,j is the mean of image f under the rectangular window, t is the template or descriptor for the feature to be compared with, t is the mean of the template. NCC(i, j) = u v{[f(u i, v j) f i,j ][t(u, v) t]} [f(u i, v j) f i,j ] 2 u v [t(u, v) t] 2 u v (2.32) SSD(i, j) = [f(u i, v j) t(u, v)] 2 u v (2.33) Obviously, NCC, SSD and other similarity metrics based on local appearance cannot deal with situations when scale, orientation or viewpoint change. The matching process usually fails in such cases. SIFT feature descriptor The SIFT descriptor for each keypoint computes the sub-histograms of the 4 4 sub-regions surrounding the keypoint, and each sub-histogram consist of 8 bins. Thus a vector of 128 elements is obtained serving as the SIFT descriptor to represent the local gradient of a keypoint. The SIFT descriptor has proven to be robust against changes in illumination, 47

64 rotation, and scale, along with viewpoint changes up to 60 degrees (Fraundorfer and Scaramuzza, 2012). Also the SIFT descriptor can be used for corner features, however it is not as distinctive as it is for blob features. The detailed procedures for computing the SIFT descriptor are listed below. a) Partition the keypoint s neighborhood into 4 4 grid, each region has a rectangular size of 4 4 pixels; b) Rotate the keypoint s neighborhood grid by the orientation assigned in SIFT detection stage; c) Compute the gradient magnitude, orientation and its associated weights around the keypoint for all the scales; d) Construct orientation bins of [-π, -3π/4,, 3π/4] for each sub-region of the 4 4 grid, compute the orientation for each pixel, and then assign each pixel to the corresponding orientation bin; and e) Form all the 16 sub-histograms into a 128-dimension vector, and normalize this vector to obtain the descriptor of a keypoint. Derivatives based feature descriptor Feature descriptors represented by a set of local derivatives are found to be invariant to similarity transformations which are additionally quasi-invariant to 3-D projection (Florack et al., 1994). The most popular derivative based descriptor, proposed by Schmid and Mohr (Schmid and Mohr, 1997) and called local greyvalue invariants, is one in which 48

65 the complete set of derivatives are obtained from convolving the image with Gaussian derivatives computed up to the third-order. The complete set of differential invariants can be found in Schmid and Mohr (1997). Given the scale factor, σ, the local derivative, L i1 i k i n of order N at a feature point, x = (u, v) is computed by convolving the image I with the N th order Gaussian derivatives G i1 i n (x, σ) as shown in equation (2.34), where i k (u, v) (Witkin, 1983). L i1 i k i n = I(x )G i1 i n (x, σ)dx (2.34) Another type of descriptor based on the Sobel operator was developed by (Geiger et al., 2010). Named efficient large-scale stereo (ELAS), this allows for dense matching with small aggregation windows by reducing ambiguities on the correspondences. The descriptor vector stacking the Sobel operator responses is obtained by concatenating both the horizontal and vertical 3 3 Sobel filter responses of 9 9 windows surrounding the feature point. Also a mutual consistency check (i.e. correspondences are matched from left to right and right to left) is imposed to ensure robustness Stereo visual odometry based on local bundle adjustment In this section the mathematical model for recovering changes in the camera s EO parameters between two adjacent stereo pairs from the feature correspondences in all four 49

66 images is discussed. As shown in Fig. 2.14, the current stereo frame CC-XY is transformed from the previous stereo frame CP-XYZ based on a 6-DoF rigid transformation, i.e. 3-DoF rotation (ω, ϕ, κ) and 3-DoF translation (Δx, Δy, Δz). The feature points (xpl, ypl) and (xpr, ypr) are triangulated in the previous stereo frame s coordinates, thus the 3-D coordinates (XP, YP, ZP) of all the matched feature points can be obtained from this procedure. Applying the 6-DoF rigid transformation, this set of 3-D coordinates can be transformed to the current frame s coordinate system (XC, YC, ZC). Then the new 3-D coordinates are backprojected to the current stereo frame as (x cl, y cl) and (x cr, y cr), therefore these projections on the stereo image pairs are compared with their corresponding feature points image coordinates (xcl, ycl) and (xcr, ycr) from stereo matching. Based on the camera projection model or collinearity condition, the 6-DoF rigid transformation (ω, ϕ, κ, Δx, Δy, Δz) and feature points 3-D coordinates (XP, YP, ZP) can be solved and refined based on the bundle adjustment technique. Since the bundle adjustment is performed only involving the adjacent stereo pairs, this is called window or local bundle adjustment (Mouragnon et al., 2006). Assuming the feature point P can be viewed in the previous and current stereo frame, the 3-D coordinates of P (XP, YP, ZP) from triangulation in the previous stereo frame can be expressed as equation (2.35). 50

67 X P = B(x pl u 0l ) x pl x pr Y P = B(y pl v 0l ) x pl x pr Bf Z P = { x pl x pr (2.35) Figure 2.14 Stereo VO based on 6 DoF rigid transformation Assuming a 6-DoF (ω, ϕ, κ, Δx, Δy, Δz) rigid transformation between the previous stereo frame and the current stereo frame, note that these 6-DoF parameters are close to 0 so it can be assumed that the initial values for these parameters are all 0. Then object point P s 51

68 3-D coordinates (XP, YP, ZP) can be transformed to the current stereo frame (XC, YC, ZC) with the following rigid transformation. X C [ Y C ] = R ω,φ,κ [ Z C ΔX r 11 r 12 r 13 X P ΔX ] + [ ΔY] = [ r 21 r 22 r 23 ] [ Y P ] + [ ΔY] (2.36) Z P ΔZ r 31 r 32 r 33 Z P ΔZ X P Y P where R ω,φ,κ is the rotation matrix calculated from equation (2.15) with rotation angles (ω, ϕ, κ). Next, (XC, YC, ZC) is back-projected to the current stereo frame, the projection coordinates on the left and right frames are (x cl, y cl) and (x cr, y cr). Since the right camera frame only has a translation of baseline B along the direction of the X axis with respect to the left camera frame after stereo rectification, the back-projection coordinates can be expressed as in equation (2.37) based on the camera projection model. { x cl = f X C + u Z 0l = f r 11X P + r 12 Y P + r 13 Z P + ΔX C r 31 X P + r 32 Y P + r 33 Z P + ΔZ + u 0l y cl = f Y C + v Z 0l = f r 21X P + r 22 Y P + r 23 Z P + ΔY C r 31 X P + r 32 Y P + r 33 Z P + ΔZ + v 0l x cr = f X C B + u Z 0r = f r 11X P + r 12 Y P + r 13 Z P + ΔX B C r 31 X P + r 32 Y P + r 33 Z P + ΔZ + u 0r y cr = f Y C + v Z 0r = f r 21X P + r 22 Y P + r 23 Z P + ΔY C r 31 X P + r 32 Y P + r 33 Z P + ΔZ + v 0r (2.37) in which f is the unified the focal length for both left and right cameras; (u0l, v0l) and (u0r, v0r) are principal point for left and right camera. 52

69 The observation equations can be obtained by subtracting the projected image coordinates from feature matching, i.e. the difference from back-projection and feature matching should be 0 theoretically. { x cl = 0 y cl = 0 x cr = 0 y cr = 0 x cl y cl x cr y cr (2.38) Given a total number of N feature correspondences, a number of 4N equations (observations) can be achieved from coolinearity equations (2.37). The number of unknowns is the 6-DoF rigid transformation plus 3N of feature points 3-D coordinates in the camera frame. Therefore at least 6 feature correspondences are needed for recovering the stereo VO system, if the number of features is more than 6 the least-squares method can be used to solve the system. The nonlinear projection equations (2.37) can be linearized by the first-order Taylor series expansion in the following matrix form: At + BX ε = 0 where: x cl x cl x cl B = X P y cl X P x cr Y P y cl Y P x cr Z P y cl Z P x cr (2.39) X P y cr Y P y cr Z P y cr [ X P Y P Z P ] X = [ΔX P ΔY P ΔZ P ] T 53

70 A = x cl x cl y ε = cl y cl x cr x cr [ y cr y cr ] t = [Δω Δφ Δκ ΔX ΔY ΔZ] T x cl ω y cl ω x cr ω y cr x cl φ y cl φ x cr φ y cr x cl κ y cl κ x cr κ y cr x cl ΔX y cl ΔX x cr ΔX y cr x cl ΔY y cl ΔY x cr ΔY y cr x cl ΔZ y cl ΔZ x cr ΔZ y cr [ ω φ κ ΔX ΔY ΔZ ] a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 = [ a 31 a 32 a 33 a 34 a 35 a ] 36 a 41 a 42 a 43 a 44 a 45 a 46 54

71 { a 11 = f ( r 13Y P + r 12 Z P )Z C ( r 33 Y P + r 32 Z P )X C sin φ cos κ X P + cos φ X P + ( sin ω cos φ cos κy P ) Z C ( sin ω sin φ Y P ) X C cosω cos φ cos κ Z a 12 = f P cos ω sin φ Z P Z C 2 Z C 2 a 13 = f (r 21X P + r 22 Y P + r 23 Z P )Z C Z C 2 a 14 = f Z C 2 Z C a 15 = 0 a 16 = f X C Z C 2 a 21 = f ( r 23Y P + r 22 Z P )Z C ( r 33 Y P + r 32 Z P )Y C sin φ sin κ X P cos φ X P + ( sin ω cos φ sin κ Y P ) Z C ( sin ω sin φ Y P ) Y C +cosω cos φ sin κ Z a 22 = f P cos ω sin φ Z P Z C 2 Z C 2 a 23 = f (r 11X P + r 12 Y P + r 13 Z P )Z C 2 Z C a 24 = 0 a 25 = f Z C 2 Z C a 26 = f Y C Z C 2 a 31 = f ( r 13Y P + r 12 Z P )Z C ( r 33 Y P + r 32 Z P )(X C B) sin φ cos κ X P + cos φ X P + ( sin ω cos φ cos κy P ) Z C ( sin ω sin φ Y P ) (X C B) cosω cos φ cos κ Z a 32 = f P cos ω sin φ Z P Z C 2 Z C 2 a 33 = a 13 a 34 = a 14 a 35 = 0 a 36 = f B X C Z C 2 a 4i = a 2i (i = 1 6) 55

72 In the above equation (2.39), A is the derivatives to the unknowns 6-DoF rigid transformation; B is the derivatives to the feature point s 3-D coordinates in the previous stereo frame; t is the adjustment to the 6-DoF unknown rigid transformation; X is the adjustment to the feature point s 3-D coordinates; and ε is the residue between backprojection and feature measurements from matching. Then, the linearized equations can be solved using the least-squares method in an iterative way until the update to the unknown parameters t is below a predefined threshold. When the 6-DoF rigid transformation parameters are obtained, the 3-D coordinates of the feature points are also refined from the local bundle adjustment. 56

73 3. GEOMETRIC MODELING OF EXOMARS PANCAM The 2018 ESA ExoMars Rover missions, planned for six months of Martian surface operations, will focus on providing contextual information to detect, locate and measure targets of potential scientific interest, and to localize the landing site along with other geological research (ESA, 2014a). The purpose of this chapter is to develop an error propagation model that quantitatively estimates the potential capabilities of the ExoMars PanCam system for mapping and localization prior to launch. As the ExoMars rover will be expected to travel up to 100 meters per sol with a total of several kilometers during the entire mission on the Martian surface, high-precision rover localization and topographic mapping will be important for traverse path planning, safe planetary surface operations and accurate embedding of scientific observations into a global spatial context (Paar et al., 2008). For such purposes, the ExoMars rover PanCam system will acquire an imagery network providing vision information for photogrammetric algorithms to localize the rover and to generate 3-D mapping products essential to mission planning and scientific analysis. Since the design of the PanCam will influence the localization and mapping accuracy, quantitative error analysis of the PanCam design will improve scientists awareness of the achievable accuracy and enable the PanCam team to optimize the design for achieving higher localization and mapping accuracy. 57

74 3.1. ESA ExoMars PanCam System Components Table 3.1 ExoMars and MER vision system camera specifications comparison Camera ExoMars PanCam WAC ExoMars PanCam HRC MER Pancam MER Navcam Focal Length 22 mm ~180 mm 43 mm mm 16 horizontal 34 horizontal 4.8 horizontal 45 horizontal FOV (22.5 (48 diagonal) (6.8 diagonal) (67 diagonal) diagonal) Stereo Baseline 50 cm N/A 30 cm 20 cm Image Size Angular Resolution 580 μrad/pixel 82 μrad/pixel 280 μrad/pixel 820 μrad/pixel General Mass 200 g (each) 300 g 270 g (each) 220 g (each) Power 1.5 Watts 0.9 Watts 2.15 Watts 2.15 Watts Height above Martian Surface Pan/Tilt Angle 1.5 m 1.5 m 1.54 m 1.54 m Azimuth; ±90 Elevation Azimuth; ±90 Elevation Azimuth; ±90 Elevation Azimuth; ±90 Elevation The ESA ExoMars Rover PanCam system consists of a pair of identical Wide-Angle Cameras (WAC) and one High-Resolution Camera (HRC). The WAC has a larger field of view (FOV) angle of 34 whereas the HRC only has a FOV of 4.8 (DLR, 2014). This can be compared with the NASA MER 2003 mission s Navcam 45 FOV and Pancam 16 FOV (Griffiths et al., 2006). The ExoMars PanCam s three-camera configuration is quite different from the NASA MER twin rovers stereo Navcam and stereo Pancam vision systems in terms of both the individual cameras specifications and the stereo 58

75 configuration. A detailed comparison between the ExoMars and MER vision components are listed in Table 3.1 above. Figure 3.1 ExoMars PanCam system configuration From the comparisons in the above table, we notice that the ExoMars PanCam only has a stereo WAC system as the main source of localization and mapping while the MER boasts two stereo systems Navcam and Pancam. However the ExoMars PanCam stereo WAC pair has a much longer baseline than either of the MER Navcam and Pancam stereo pairs, which in general means more accurate target localization measurements due to the longer baseline. Although the ExoMars PanCam doesn t have a secondary stereo vision system, the third HRC camera could possibly be used with one of the WACs to form another stereo pair (Fig. 3.1). Moreover, the HRC s much longer focal length than any of the cameras, empowers it to see much farther than the WAC. Provided with these important characteristics, how incorporating HRC data impacts the attainable accuracy levels for mapping and localization was studied. The challenge for finding correspondences between 59

76 HRC and WAC is the large scale and viewpoint differences between these two cameras. This problem is explored in the next chapter. The attainable level of accuracy is analyzed theoretically based on a simplified rover stereo VO model involving only two stops typically no more than 10 m apart. In practical experiments or a rover mission, the rover doesn t have to actually stop during its movement. The term stop is used here merely to emphasize that the rover takes stereo images at this position for VO processing purposes. In the two-stop rover stereo VO model, the location of the rover at the starting point is assumed to be fixed. The location of the rover at the adjacent stop is determined by identification of correspondences between these two stops. It was assumed that the only possible errors are those in the measurements of the image coordinates of the features. Based on stereo triangulation, feature matching errors are propagating from the image coordinates to the measurement of the features in the 3-D spatial coordinates. In this manner, measurements of corresponding features from adjacent stops, respectively, as well as any measurement error associated with each features was obtained. Assume a rigid transform between these two sets of corresponding features obtained from adjacent stops, the translation and rotation from the first stop to the next one could be solved based on the least-squares. As a result, feature-measurement errors were propagated to the location of the second stop. Finally, applying the error propagation law, this two-stop VO process can be extended to multiple stops. 60

77 3.2. PanCam Geometric Modeling Mapping error analysis of PanCam stereo WAC Features are critical for VO, thus feature measurement errors are the main error source for rover motion recovery in stereo vison. Obviously the feature matching error propagates to the feature measurement error in stereo triangulation, which can be measured using a covariance matrix in the 3-D spatial coordinate system. A brief discussion of the feature mapping error based on simplified stereo triangulation can be found in Matthies and Shafer (1987), which leads to equation (2.35) described in the previous chapter. This equation expresses the feature point P s 3-D coordinates as nonlinear functions of its 2-D projections on the stereo image and other fixed parameters such as focal length, and baseline. In order to derive P s covariance matrix, this nonlinear equation needs to be linearized to be applied with the error propagation law. The linearized form of equation (2.35) is described below in equation (3.1). X P = B(x l u 0 ) x l x r Y P = B(y l v 0 ) x l x r Z { P = Bf x l x r 61 (3.1)

78 X P = B[(x l x r ) (x l u 0 )] x l (x l x r ) 2 = B(x l u 0 ) (x l x r ) 2 = X P 2 B(x l u 0 ) Z P f = X P y l = 0 2 X P Z P B fx P X P = B(x l u 0 ) x r (x l x r ) 2 = XZ Bf Y P = B(y l u 0 ) x l (x l x r ) 2 = Y 2 Y P = B = Z y l x l x r f Y P = Y P = YZ x r x l Bf Bf Z P = x l Z P f B B(y l u 0 ) = YZ Bf (x l x r ) 2 = Z2 Bf Z P y l = 0 Z P x r = Z P x l = Z2 Bf x l x r (X B)Z = Bf In the above equation, (XP, YP, ZP) is the triangulated 3-D coordinates for P in the stereo frame, (xl, yl) and (xr, yl) are P s 2-D projections on the left and right image respectively. Since the two WAC are a rectified stereo pair, the y coordinates are the same for each pair of the feature correspondences. The baseline for the stereo camera is B, and focal length for both cameras is f. From the above equation (3.1), the feature mapping error is approximately quadratic with the range measurement from the object to the camera. After obtaining the partial derivatives of P s 3-D positions to its 2-D image coordinates, the Jacobian matrix can be formed as the following equations. 62

79 X P x l Y P J = x l Z P [ x l X P y l Y P y l Z P y l X P x r Y P = x r Z P x r ] [ (X B)Z Bf YZ Bf Z2 Bf 0 XZ Bf Z YZ f Bf Z 2 0 Bf] (3.2) Let σxl, σxr, σyl be denote the errors associated with xl, xr, yl resulting from the feature matching process. Typically these errors are up to half of the pixel size for a good and reliable feature matching algorithm. According to the error propagation law, P s covariance matrix ΣP can be expressed with J and (σxl, σxr, σyl) as follows. 2 σ xl 0 0 Σ P = J [ 0 2 σ yl 0 ] J T (3.3) σ xr Once the error covariance is obtained, P s confidence region can be visualized by error ellipse indicating majority chances (95%) of P s possible positions fall into this region. In this theoretical analysis, the error along the ranging Z-axis and azimuthal X-axis directions were of more interest rather than along the elevation Y-axis direction, i.e. only taken into account were the (XP, ZP) representing P s location and YP was assumed to be constant Error analysis of stereo VO based on simplified two-stops model 63

80 The methodology of utilizing feature points (landmarks) in rover localization via stereo VO has been successfully used in the NASA MER 2003 mission (Cheng et al., 2006; Li et al., 2005). The general stereo VO configuration is shown in Fig. 3.2, where the rover takes a stereo pair at each stop during its traverse and feature matching is performed between adjacent stops. Normally, in order to guarantee successful VO performance, the distance between adjacent stops is kept relatively small (up to 1~2 m) and the viewpoint angle difference is less than 18 (Xu, 2004). As the rover moves on for several stops, it would stay at a position (called a site) until it receives commands to move to another site in the next sol s mission. In the following section, how the feature point mapping error propagates into the rover motion recovery between the two adjacent stops is investigated. Figure 3.2 Stereo VO configuration in rover traverses 64

81 As already discussed in previous section 2.3.2, assuming the rover measures the same feature points from two adjacent stops, then a 6-DoF rigid transformation (i.e. 3-DoF rotation and 3-DoF translation) is utilized to adjust the coordinates of these two sets of observations as expressed in equation (2.36). Introducing Gaussian noise to the feature point measurements, equation (2.36) can be rewritten as: X C + υ XC r 11 r 12 r 13 X P + υ XP ΔX [ Y C + υ YC ] = [ r 21 r 22 r 23 ] [ Y P + υ YP ] + [ ΔY] (3.4) Z C + υ ZC r 31 r 32 r 33 Z P + υ ZP ΔZ Since only the azimuthal and range directions represented by X and Z, respectively, are taken into account, the above 6-DoF transformation can be simplified into a 4-DoF rigid transformation as the flowing equation by ignoring the elevation direction Y. [ X C + υ XC ] = [ r 11 r 13 Z C + υ ZC r 31 r ] [ X P + υ XP ] + [ ΔX 2D rotation 33 Z P + υ ZP ΔZ ] translation [ X C + υ XC ] = [ a b Z C + υ ZC b a ] [X P + υ XP ] + [ c (3.5) Z P + υ ZP d ] In the above equation, the rotation matrix [ r 11 r 13 r 31 r ] actually denotes a 2-D rotation, thus 33 it can be represented by a 2-D rotation matrix [ a b b ] including only two unknown a parameters. Based on the least-squares method, the optimized rigid transformation parameters (a, b, c, d) can be derived by minimizing the differences between the feature measurements and the rigid transformation results. The least-squares method is an iterative process in which adjustments are calculated and then added to the unknown parameters 65

82 after each iteration. When the norm of the adjustment is below a predefined threshold, the least-squares process is finished and the optimized unknown parameters are obtained. Therefore, the adjustments ( a, b, c, d) are added to the unknowns (a, b, c, d) respectively, and the following equation can be obtained. [ X C + υ XC a + Δa b Δb ] = [ Z C + υ ZC b + Δb a + Δa ] [X P + υ XP c + Δc ] + [ Z P + υ ZP d + Δd ] υ XP Δa a b 1 0 [ b a 0 1 ] [ υ ZP X υ ] + [ P Z P 1 0 XC Z P X P 0 1 ] [ Δb ] Δc υ ZC Δd = [ X C (ax P bz P + c) Z C (bx P + az P + d) ] AV + BΔ = L (3.6) Since all of the coordinate observations of feature points have errors, (υ XP, υ ZP ) and (υ XC, υ ZC ) are used to describe the associated observation error with feature point P from both previous and current measurements. Also in the above linearized equation (3.6), V is the vector of coordinate correction of the observations and is the correction of the unknown parameters. According to the error propagation principle for least-squares adjustment, the covariance matrix of the unknown rigid transformation parameters is estimated as: Σ ΔΔ = [B T (AΣ 1 A T )B] 1 (3.7) where is the covariance matrix of feature point observations from adjacent stops, which can be expressed as following. Σ = [ Σ P 0 0 Σ C ] (3.8) 66

83 Finally, the same optimized rigid transformation can be applied to the initial positions (X S2, Z S2 ) of the second stop and obtain its adjusted coordinates ( X S2, Z S2 ). [ X S2 ] = [ X a a S 2 Z S2 1 0 Z Z S2 S2 X S2 1 1 ] [ b c ] = C [ b c ] (3.9) d d Consequently, the covariance matrix of the new position for the rover s position at the second stop is expressed, according to error propagation law, as following. Σ S2 = CΣ ΔΔ C T (3.10) Improvement of mapping and localization by incorporating HRC Since the HRC has a much longer focal length than the WAC, it is enabled to observe distant objects in more detail compared with the WAC. From Table 3.1, we can observe the angular resolution of HRC can be seen to be 7 times as high as that of the WAC. This means that when observing the same object at a given distance, the HRC has a spatial resolution 7 times that of the WAC in both the horizontal and vertical directions. According to the ExoMars specifications, the HRC is placed in between the two stereo WAC cameras and with a bias of 154 mm to the center of the stereo WAC (Fig. 3.1). Thus the HRC and 67

84 the left WAC were used to form a stereo pair resulting in a baseline of 404 mm (rather than HRC and the right WAC which has a shorter baseline of only 96 mm). Figure 3.3 Stereo configuration for HRC and WAC As the HRC and WAC have different focal lengths, their image planes were no longer at the same plane compared with a standard stereo pair such as the stereo WAC discussed before. In order to leverage the stereo analysis done for the stereo WAC, the HRC and WAC were brought to the same image plane by converting the WAC s focal length to that of the HRC. As shown in Fig. 3.3, this can be done by upsampling the original WAC, i.e. 68

85 interpolating on the original image to generate a new one with a higher resolution. The interpolation factor depends on the ratio between the new focal length and the original one. In our case, the spatial resolution of the interpolated WAC is about 9 times its original size. After interpolation, the same stereo analysis method used for the stereo WAC can be applied and the following covariance matrix for feature point mapping accuracy is obtained. [ (X B)Z Bf Z2 Bf Σ P = J [ σ 2 x WAC 0 2 ] J T = 0 σ xhrc XZ (X B)Z Bf Z 2 [ σ 2 x WAC 0 2 ] 0 σ xhrc Bf] [ Bf XZ Bf Z2 Bf Z 2 Bf ] (3.11) In the above equation (3.11), the feature matching accuracy between the interpolated WAC 2 2 and the HRC is represented by σ xwac and σ xhrc, respectively. Note the standard deviation for the new WAC should be approximately 9 times as large as that in the original image due to the interpolation process. However, compared with the wide 34 FOV of WAC, the HRC only has much a narrower FOV of 4.8. For a triplet of images taken by the stereo WAC and HRC at the same time, there are only a small portion of features within the HRC s FOV that can be used to improve the feature mapping accuracy by incorporating the HRC data. How much 69

86 improvement the HRC can bring to the stereo VO is quantified in the next section, which reports simulated results of the PanCam geometric modeling Bundle adjustment of PanCam imagery network We have derived the error covariance associated with the recovered rover motion in the two-stop VO processing. This error will propagate stop by stop as the rover moves multiple stops forward. The following equation expresses the general rigid transformation between adjacent rover stop in VO processing in homogeneous coordinates. T k,k 1 S k = [ R k,k ] [S k 1 1 ] where, k = 1,2, (3.12) The notation Sk is the 3-D position of the rover at stop k; Rk,k-1 and Tk,k-1 are the orientation and translation change from adjacent stops k-1 to k. Thus, the rover s position at stop n can be represented with respect to the relationship of its origin starting position S0 as following. T k,k 1 S n = [ R k,k ] [R i,i 1 T i,i ] [R 1,0 T 1,0 0 1 ] [S 0 1 ] where i = 1, 2,, n (3.13) Obviously from the above equation it can be seen that the error covariance of the rigid transformation parameters (Ri,i-1 and Ti,i-1) between adjacent stops will propagate to the next 70

87 stop. For simplicity, only the planar movement of the rover (range and azimuthal direction) was taken into account in this theoretical analysis. The 6-DoF 3-D rigid transformation is degenerated to 4-DoF 2-D rigid transformation, and the covariance matrix Σ associated with the 4-DoF parameters in equation (3.7) was derived. Therefore the error covariance matrix Σ n of the rigid transformation at the end of the n-stop VO becomes the following: Σ n = (Σ ΔΔ ) n 1 (3.14) In MER operations, the VO can only maintain a relative accuracy of 1% over a short distance of 10 m (Xu, 2004). Here relative accuracy means the ratio between the accumulated error and the distance traveled. If the rover travels more than 10 m, the relative accuracy would exceed 1% using the VO processing method. In order to achieve higher accuracy rover localization, a least-squares based bundle adjustment (BA) could be applied to refine the rover s position as well as the feature point measurement. The concept of BA, in which a network of images taken at different positions and viewpoints are linked together by the landmarks / tie points (feature points), originated from photogrammetry. The BA is used to optimize the EO parameters of all the images and positions of all the tie points based on the collinearity condition equations. As discussed previously, the rover in the MER missions would travel an average distance of 30 m every day and stop moving overnight until receiving the commands to move the following day. Typically, the rover would take a full or partial panorama at each site 71

88 covering the distance it has traversed during the day before it goes to sleep (Fig. 3.4). BA processing is normally performed on these panoramas between adjacent sites. Figure 3.4 Panoramas at adjacent traverse sites From the perspective of the simplified theoretical model, the workflow for BA processing is very similar to the VO procedures. However the following differences between BA and VO for rover localization should be pointed out. 1. The feature points / tie points can be distributed anywhere in between the adjacent sites as long as they can be identified at both sites in BA-based processing, whereas in VO processing method the feature points only in the FOV of the camera in the stereo pair; 72

89 2. The VO process can be fully automated because the feature points are in the forward-looking directions of adjacent stereo pairs, whereas the BA typically involves manual selections of tie points which are viewed from the backwarddirection at the second site and can be difficult to identify using an automated matching algorithm; and 3. The BA processing includes more images with larger viewpoint and position changes than the VO, the latter of which only involves two stereo pairs taken at nearby positions. Thus the refined results from BA processing are more reliable since they cover a much larger area than VO. In the next section, the localization results from both VO and BA are compared Simulated Results of PanCam Mapping and Localization The error covariance model of feature correspondences for stereo triangulation has been established in equation (3.3). Also, the error propagation model from two-stop VO in equation (3.10) to multiple-stop VO in equation (3.14) has been established. Moreover, the improvement to stereo VO from adding the HRC to form an additional stereo pair was investigated. Finally, the potential improvement to rover localization accuracy by applying the photogrammetric BA process to panoramas taken at the adjacent traverse sites was 73

90 briefly discussed. In this section, the simulated results from the previous theoretical analysis are presented Mapping accuracy in stereo triangulation First, the feature points mapping accuracy for the ExoMars stereo WAC, MER stereo Navcam, and MER stereo Pancam based on covariance error analysis are shown. As the standard deviation of the feature point measurement is quadratic to the distance between the feature point and the stereo camera, it can easily be estimated that this standard deviation goes beyond 0.5 m when the feature point is more than 20 m away using equations (3.3) and (3.4). Therefore it was assumed that the maximum range of the feature points is around 20 m with respect to the camera center, and they are evenly distributed within the stereo camera s FOV. The following, Fig. 3.5 to Fig. 3.7, show the distribution of the feature points within the FOV and their associated error covariance calculated from the stereo camera specifications for ExoMars WAC, MER Navcam, and MER Pancam, respectively. Note the error ellipses of the distant feature points are so big that they are overlap each other for the MER stereo Navcam. From the above error ellipses results, it can be seen that the ExoMars stereo WAC bears a better mapping accuracy than MER Navcam due to its longer baseline and relative large focal length, also it boasts a larger FOV than MER 74

91 Pancam. These advantages in turn would benefit the rover motion recovery in the stereo VO procedures. Figure 3.5 Feature points distribution and associated error ellipse for ExoMars WAC Figure 3.6 Feature points distribution and associated error ellipses for MER Navcam 75

92 Figure 3.7 Feature points distribution and associated error ellipses for MER Pancam Error propagation from feature points mapping to motion recovery in VO Next, it shows how the feature mapping error propagates to the rover motion recovery in the stereo VO and the error covariance matrix associated with the second rover position in the two-stop stereo VO processing. 76

93 Figure 3.8 Standard deviation of rover localization vs. feature points number First it is necessary to analyze the influence of the total number of feature points on the localization accuracy as denoted by standard deviation. Assuming the rover moves a fix distance, for example 2 m along the range direction, where the standard deviation of the rover s position after the movement is calculated by finding the rigid transform of the same set of feature points from the two stops, respectively. If the maximum range of feature points is assumed to be increasing from 10 m to 20 m by 1 m per iteration, the number of feature points increases accordingly as indicated in Fig. 3.5(a). The following relationship 77

94 between the standard deviation of rover location and number of feature points in the motion recovery calculation can thus be obtained. Obviously, the more feature points that are observed, the more accurate the rover localization results obtained from the stereo VO processing can be. Figure 3.9 Standard deviation of rover localization vs. distance rover has travelled using only stereo WAC Then the relationship between the rover localization accuracy with respect to the distance the rover has travelled was investigated. This can be derived from Fig. 3.5(a), as the rover moves a longer distance, fewer corresponding feature points viewing from both stops can 78

95 be used for VO. Fig. 3.9 shows the relationship between the standard deviation of rover localization results and the distance the rover has travelled in the two-stop VO scenario Accuracy improvement by adding HRC to stereo WAC We have shown in equation (3.11) that the accuracy of feature points can be improved by adding the HRC to form an additional stereo pair with the left WAC. The following, Fig. 3.10, shows the improvement to feature points mapping accuracy within the HRC s narrow FOV. 79

96 Figure 3.10 Feature points mapping accuracy improvement by adding HRC The two lines in cyan color represent the HRC s narrow FOV with respect to the WAC s wide FOV. Only the feature points within the HRC s FOV have improvement in mapping accuracy. The covariance matrices of these points are replaced by the calculation from the HRC and left WAC. Therefore, the similar results of the standard deviation of rover localization vs. the distance the rover has travelled becomes the green dash-dot line below in Fig if the HRC is used in the calculation. The blue dash line represents the results using only the stereo WAC. 80

97 Figure 3.11 Comparisons of rover localization results between using stereo WAC only and stereo WAC plus HRC It can be seen that the improvement in the rover localization is not obvious if the rover only moves a relatively small distance. This is because the portion of the feature points that can be improved by HRC is trivial. As the rover moves farther along the range direction, this ratio becomes larger, thus the improvement to rover localization is more obvious by incorporating HRC. 81

98 BA processing for PanCam image network As has been discussed previously, the stereo VO processing is good for short-range rover movement and a relatively small change in view angle. However, the localization error would soon accumulate beyond the tolerance level as the rover travels longer distances. For long-range accurate rover localization, the BA-based rover localization method is preferable in order to maintain a relative error of 1%. In this section, a theoretical analysis of the maximum distance the rover can traverse between adjacent sites while maintaining a 1% relative localization error is presented. Since the two-site BA model is based on the partial or full panoramas taken at each site, in theory the FOV of the BA processing can be expanded to a full circle (i.e. 360 ) corresponding to the full panorama. The optimized FOV angle or convergence angle where the localization error is minimized for a fixed distance (such as 20 m) between the adjacent sites has to be found. Again, it was assumed that the feature points are evenly distributed within the forward- and backward-looking FOVs of site 1 and site 2, respectively (Fig. 3.12). 82

99 Figure 3.12 Feature points / landmarks distribution between adjacent sites Then, for a fixed-distance traverse segment between site 1 and site 2, the relative localization error with respect to various FOVs was investigated and the optimal FOV when the relative error was smallest was found. The different number of landmarks used in finding the optimal FOV angle was also studied. The results are shown in Fig

100 Figure 3.13 Relative localization error vs. various FOV in BA The optimal FOV angle is around 105 when there are 9 or 12 landmarks evenly distributed between the two sites, while this optimal FOV becomes 70 for 16 landmarks. Finally, the optimal FOV angle for each scenario of landmark distribution was fixed, and the traverse segment length of adjacent sites (e.g. from 10 m to 100m) changed to find the relationship between relative localization error and traverse segment length. The results for 84

101 9-landmarks scenario are shown in the following, Fig. 3.14, the relative error comparisons using only stereo WAC and stereo WAC plus HRC are also indicated. Figure 3.14 Relative localization error with respect to traverse segment lengths From the above figure it can be seen that the relative localization error is approximately linear to the traverse segment length and that the maximum traverse segment length when the rover maintains the 1% relative error standard is 58 m if using only the stereo WAC or 65 m if the HRC is incorporated into the localization. The following table shows the 85

102 optimal FOV results, and the maximum traverse segment length, while maintaining a 1% relative error among ExoMars PanCam and MER Navcam and Pancam under the condition of 9 evenly distributed landmarks. Table 3.2 ExoMars and MER vision system BA-based localization comparison Camera ExoMars stereo WAC ExoMars WAC + HRC MER stereo Navcam MER stereo Pancam No. of Landmarks Optimal FOV ~105 ~105 ~88 ~105 Maximum Segment Length 57.9 m 65.2 m 22.2 m 92.3 m Based on the results listed in this table and the above analysis, the following conclusions for BA-based rover localization at two sites can be reached. Rover localization error varies with the traverse length, tie point (landmark) number and distribution, and camera system (stereo base, focal length, etc.). In general, the relative localization error is approximately proportional to the traverse length. Moreover, the localization error decreases as the baseline length or focal length increases. 86

103 With 6 to 16 well-distributed tie points in the middle of two sites, the rover localization error is within 1% at a traverse distance of approximate 58 m for the ExoMars PanCam stereo WAC, and this number can be increased by 10% when incorporating the HRC. This can be compared to 22 m for the MER Navcam and a traverse segment of 92 m for the MER Pancam. These results are from theoretically optimized analysis, and should be regarded as merely a reference in the practical mission. The corresponding results from the practical mission should be affected by other adverse factors in terms of unevenly distributed feature points, and their distribution in conformation with optimal FOV, imaging qualities, etc. 87

104 4. PANCAM PROTOTYPE DESIGN AND CONTROL In order to systematically test and evaluate the ExoMars PanCam localization and mapping capabilities, a PanCam prototype was assembled consisting of two identical WAC and one High-Resolution Camera as close as possible to the specifications of the ExoMars PanCam system. Introduced in this chapter are: the specifications for individual components of this PanCam prototype; system calibration to remove lens distortion and to obtain stereo rectified images; image matching aided by stereo constraints and rotary platform angle information; and rover localization results from the PanCam prototype. The workflow of the rover localization is shown in Fig. 4.1 below. Figure 4.1 Workflow of rover localization using PanCam prototype 88

105 4.1. Selection of Vision Sensors and Rotary Platform The stereo WAC used were two identical Pantera TF 1M30 cameras with Zeiss Distagon T* 2.8/21 ZE lenses. These two cameras have a focal length of 21 mm, pixel size of 12 µm, frame rate of 30 fps, and resolution of 1024 x 1024 pixels. The Camera Link interfaces of the stereo WAC were connected to the host computer through an image acquisition device, the EPIX PIXCI ECB2 frame grabber. The grabber s 191 megabytes per second data transfer rate guaranteed full capture of the image data from the stereo WAC with no information loss. The HRC was simulated by a Lumenera Lw575C camera having a Pentax C5028-M lens, focal length of 50 mm, pixel size of 2.2 µm, frame rate of 7 fps, and resolution of pixels. In order to simulate the resolution according to the HRC s specification, the technique of binning an array of 2 2 pixels into one large pixel was applied. This binning improved the signal-to-noise ratio and decreased the image resolution to , meanwhile the FOV remained the same at approximate 5. These three cameras were placed at their specified location on a metal bar arm, which was mounted on a pan/tilt system, the FLIR PTU D This FLIR system has a pan range of +/-180 and a tilt range of +31 /-80 with a position resolution of The host computer could send ASCII-based commands to control this system via an RS232 serial port. The weight of the entire prototype was around 10 lbs. It was powered by 12 V DC current with an estimated 50 W power consumption. The entire PanCam prototype and 89

associate accessories, as well as the host PC, was placed on a steel platform with four heavy-duty caster wheels (Fig. 4.2), allowing the platform to travel on unpaved testing fields. Figure 4.

106 associate accessories, as well as the host PC, was placed on a steel platform with four heavy-duty caster wheels (Fig. 4.2), allowing the platform to travel on unpaved testing fields. Figure 4.2 PanCam prototype and mounting platform 4.2. PanCam Prototype System Calibration In order to obtain accurate photogrammetric measurements using the PanCam prototype, the IO parameters had to be obtained and the lens distortion effect had to be removed for each camera in the system. In addition, accurate boresight parameters were needed (i.e. 6- DoF rigid transformation) between any two cameras of the three camera system. The 90

107 reason for boresight calibration is that the stereo WAC images were not perfectly aligned with each other during the installation, and thus accurate displacement parameters were required to cancel this effect and to obtain stereo rectified images. Furthermore, the boresight between HRC and WAC served as a useful constraint in the image matching between them. This part of stereo matching with constraints is explored in the next section. The calibration of the individual camera and boresight parameters between any two cameras were completed in a 3-D calibration facility, containing around 170 welldistributed control points (Fig. 4.3). The calibration and boresight processes were carried out using these control points with known coordinates in the object space and corresponding 2-D coordinates in the image space. The collinearity equation was then linearized, and the interior-orientations and lens-distortion parameters of each camera and the relative offsets and orientations between any two cameras (i.e. left and right WAC, HRC and left WAC) were iteratively solved using the least-squares method. 91

108 Figure D calibration field for camera geometric calibration Feature point extraction and recognition for single camera calibration The nonlinear pinhole camera model was discussed in equation (2.11) where the relationship of the 3-D points, their 2-D projections on the image plane, the camera s EO and IO parameters, and lens distortion parameters based on the collinearity condition are described. This equation can be linearized based on the first-order Taylor series expansion as follows. 92

109 [ a 11 a 12 a 13 a 14 a 15 a 16 a 21 a 22 a 23 a 24 a 25 a 26 ] [ t z ] ω φ κ t x t y [ t z ] k 1 k 2 p 1 f x f y + [ b 11 b 12 b 13 b 14 ] [ ] b 21 b 22 b 23 b 24 u 0 v 0 + [ c 11 c 12 c 13 c 14 x R u c 21 c 22 c 23 c ] [ ] = [ 24 y R v ] p 2 ω φ f x k 1 κ f A t x + B [ y k 2 x R u t y ] + C [ u 0 p ] = [ 1 y R v ] v 0 p 2 (4.1) The above coefficient matrices A, B, C are partial derivatives of the nonlinear camera projection model with respect to the EO, IO, and lens distortion parameters respectively. The (xr, yr) coordinates are the estimated corresponding projections of the 3-D points calculated from the nonlinear camera projection model with predicted EO and IO parameters and the lens distortion addition to the measured or observed corresponding image coordinates (u, v). Usually, a bunch of images are taken with the same camera in order to form enough observations of the known control points. If a total of M images taken, the total number of unknowns includes 6M EO parameters, 3 IO parameters, and 4 distortion parameters. Therefore, the total number of control points N observed in all the M images should be at least 6M+7 to solve for all the unknowns. If N > 6M+7, then the least-squares method can be used to iteratively solve all the unknowns. 93

110 In order to start the least-squares solution process, it is necessary to obtain good initial estimates of the unknowns. It is always a good assumption to locate the principal point at the center of the image plane. For the focal length and EO parameters of each image, the direct linear transformation (DLT) was used to calculate good initial estimates. Depending on whether all the control points in a given image lie on the same plane, either 3-D DLT (Cheng et al., 1994) or 2-D DLT (Zhang, 2000) should be used to get the correct estimates. The detailed derivation of these two DLT methods can be found in the above listed references. As discussed earlier, it is critical to automatically find the accurate control point correspondences on the image plane for applications such as camera calibration and boresight calibration. Normally the control point is represented with a certain pattern to facilitate the detection of correspondences in the image. There are several commonly used target patterns, as shown in Fig. 4.4, including circles, circular checkerboard, and checkerboard. In our 3-D calibration facility, a circular checkerboard pattern was chosen to represent the control points for the sake of simplicity and repeatability. The process for extracting this pattern in the image is briefly discussed. 94

Figure 4.4 Three types of target pattern from left to right: circle, circular checkerboard, and checkerboard Apparently, the center point of the circular checkerboard pattern is a corner type.

111 Figure 4.4 Three types of target pattern from left to right: circle, circular checkerboard, and checkerboard Apparently, the center point of the circular checkerboard pattern is a corner type. Typically there are no more than 30 control points in one image, and thus the Harris corner detection algorithm is applied to the image first to find the top 100 strongest responses. This initial corner detection covers all the potential control points in the image. Next, the process is to iterate through all the candidate corners and extract a small window of empirical size around each corner from the neighborhood area. The window size has to be big enough to cover the entire target pattern but not too large or it introduces unnecessary noises. This small neighborhood is examined to determine whether it is the desired circular checkerboard pattern. This window is converted to a binary image using Otsu s adaptive threshold method (Otsu, 1979). Then the black and white pixels are counted respectively for the binary image. If the ratio between the black and white pixels falls in the empirical range between 0.6 and 1.4, this corner would be treated as the desired circular checkerboard pattern. The following figure shows the detected circular targets center in red dots. 95

Figure 4.5 Detected targets in the 3-D calibration facility 4.2.

112 Figure 4.5 Detected targets in the 3-D calibration facility Boresight calibration for stereo rectification In order to get the stereo rectification pair, the relative orientation and position parameters are needed. This is called the boresight or stereo calibration for estimating the 6-DoF relative displacement between the left and right cameras. Assume the EO parameters for the stereo cameras are represented by (R1, t1) and (R2, t2) for the left and right cameras respectively. The relationship between the left EO and right EO are connected by the 6- DoF boresight parameters (Rb, tb) as shown in equation (4.2). 96

113 R { 2 = R b R 1 (4.2) t 2 = R b t 1 + t b The process of estimating the parameters (Rb, tb) is also based on collinearity conditions. Suppose multiple stereo pairs have been taken in the 3-D calibration field similar to the single camera calibration. For each stereo pair, the corresponding feature points that appear on both left and right images are extracted. Based on the collinearity conditions, the relationship between the coordinates of the feature points on the left image and their correspondences in the 3-D object space can be established by introducing the unknown EO parameters (R1, t1) for the image. For the right image, the same collinearity condition equation was established by solving the unknown EO parameters (R2, t2). However, the right camera s EO can be repreented by equation (4.2). That is to say, for the feature points on the right image, the collinearity condition is established based on the EO parameters of the left image (R1, t1) and the boresight parameters (Rb, tb) rather than its own EO parameters. Therefore, if there are a number of N stereo image pairs and a total of M points in all the images. The number of unknowns is 6N+6 for 4M collinearity equations. Then the least-squares method can be used as long as the latter is larger than the former. The structure of the Jacobian matrix of the linearized collinearity equations is illustrated below. 97

114 Figure 4.6 Jacobian matrix for solving boresight parameters The black stripes are partial derivatives to the corresponding unknown parameters indicated on the top row. The initial EO for both left and right images were obtained based on the aforementioned 2-D or 3-D DLT approach. Thus the initial boresight parameters could also be estimated, and the solutions for the boresight parameters was based on iterative refinement. The completion criterion is based on checking whether the changes in unknowns between two successive iterations are below a predefined threshold. 98

115 Once the boresight parameters are refined, the stereo pairs were aligned by rotating the left and right camera-centered coordinate systems, respectively, as discussed in section The resulting rectified stereo images have the same camera-centered coordinate system but a 1-D translation along the baseline B, which is actually the norm of the boresight translation parameter tb. The rectified stereo pair can facilitate feature matching within the stereo image, and the coordinates of the matches can easily be calculated with equation (2.28) Stereo Matching with Constraints The general approach for feature extraction and matching has been outlined in section 2.3 by reviewing the commonly used feature detectors and associated feature descriptors. As for matching the stereo image pair, especially stereo rectified images, the correspondences generally satisfy the following additional constraints (Tang et al., 2002). Epipolar: the correspondence of the feature on the left image should lie on a line on the right image, or vice versa. For a stereo rectified stereo pair, this line is further degenerated to a horizontal line on the same row (scanning line) as the feature on the left image. Similarity: the local appearance of the correspondences on the left and right images are similar to each other, which means simple and efficient descriptors can be used for fast matching. 99

116 Smoothness: the disparity change is smooth within a small region except for a few abrupt values, which could help remove the outliers by using the Delaunay triangulation network established on some reliable matches. Ordering: the relative position between any two features on the left image should be the same for their correspondences on the right image. Besides the above constraints, the correspondences should also satisfy the mutual consistency check, i.e. the candidate match on the right image for the feature on the left image should also sustain the consistency check by finding a match on the left image for this candidate on the right image. Moreover, mismatches can also be detected and removed based on a Delaunay triangulation mesh network of the disparity model. On the premise that the disparity map of a stereo pair can be approximated by the piecewise smoothing surface network, a mesh such as Delaunay triangulation reconstructed from the feature correspondences is regarded as an appropriate approximation of the disparity model for this stereo pair. The Delaunay triangulation partitions the stereo images into multiple triangular patches, and the disparity model within each triangulation should be a smooth surface. If the disparity change between any two nodes of the triangular patch is larger than a predefined threshold (such as 5 pixels), the edge connecting these two nodes is marked as an abnormal edge. These abnormal edges are removed from the correct matches. 100

117 In the following section, first a reliable stereo matching method aided by the angle output from the rotary platform is introduced, and then the matching technique between HRC and WAC with large scale differences is discussed Stereo matching aided by angle information from rotary platform The rotary platform was able to control the movement in pan and tilt directions and provide accurate angle information up to a resolution of 0.013, the initial EO parameters can be approximated from these output angles. Although this EO estimation is not accurate, it is sufficient to facilitate the search for reliable matches between adjacent stereo pairs only having pan and tilt movements. This is useful when the PanCam is taking a panorama at a stationary site, and a series of stereo pairs are taken to form the panorama. The relative pan and tilt angles between adjacent stereo pairs are known and the change of viewpoint angle is usually around 20 between them. If the matching is performed without any prior information, it is often time consuming and unreliable, which also leads to incorrect matches. 101

118 Figure 4.7 Stereo matching aided by pan/tilt rotary platform As shown in Fig. 4.7., the 3-D coordinates of the matches within the first stereo images first can be triangulated to calculate the location of the point P. If prior knowledge of the pan/tilt angles (ϕ, ω) between the adjacent stereo pairs is available, P s 3-D coordinates can be converted to the camera-centered coordinates of the second stereo pair by applying a rigid transformation based on these relative pan/tilt angles. This 3-D point can then be back-projected to the image frames of the second stereo pair. Clearly, if the back projections are out of the boundary of the image frame, these points may be disregarded because of no match could be found for them in the second stereo pair. On the other hand, if the back projections are within the image boundary, the correct matches should lie within the neighborhood of the back projections. The search for correspondences can be constrained to this given neighborhood area and should be more reliable than the bruteforce method of comparing with all the features. 102

119 Next, how to get the relative orientation and position between two stereo frames given the output pan/tilt angles from the rotary platform is briefly introduced. As illustrated in Fig. 3.7, the pan and tilt angles between these two stereo pairs are represented by ϕ and ω, respectively. The rotation in the pan direction is equivalent to rotation about the Y axis of the camera-centered coordinate system, and rotation in the tilt direction is equivalent to rotation about the X axis. Therefore, the relative orientation between the two stereo pairs consists of a rotation ϕ about the Y axis and then a rotation ω about the X axis. The combined rotation matrix is expressed as equation (4.3). In addition, the displacement between the rotation axis of the rotary platform and the camera center can be estimated with the baseline and height difference between the rotation axis and camera center. The resulting 6-DoF relative orientation between these two stereo pairs is expressed by: R = R ω R φ ΔX B 2 [ ΔY] = R T Δ [ H cos ω] ΔZ H sin ω (4.3) The triangulated 3-D point P in the camera-centered coordinates of the first stereo frame can be transformed to the second stereo frame using equation (2.36). Then the back projection of these 3-D points on the left and right image of the second frame can be expressed as equation (2.37). Finally, searching for correspondences in the second stereo frame can be performed in the neighborhood of these back projections. 103

120 High resolution HRC and low resolution WAC image matching The HRC and left WAC form a special stereo pair where there is a large scale difference between these two cameras. This scale difference can be approximated by the FOV ratio between them, and thus the HRC is zoomed out by a scale factor of around 7 compared to the low-resolution WAC image. The key procedure of matching between high-resolution and low-resolution images is to correctly detect and represent the features in scale space (Mikolajczyk, 2004). One premise is that the scale changes are equivalent to image smoothing with a Gaussian kernel with various standard deviations, sigma values σ (Lindeberg, 1997). In general, the matching process of the high-resolution HRC and lowresolution WAC involves: Filtering the WAC image by a Gaussian kernel with σd=1, and the HRC image by a Gaussian kernel with 7σd; Computing the modified Harris corner metric in the scale space. Another Gaussian kernel with σi=2 was used here for weighting purposes. Then the Harris corners in scale space were detected by nonmaxima suppression; and The descriptor based on the local gradients was calculated for each feature and matching was performed among these features. Since the fundamental matrix for 104

121 the WAC and HRC stereo cold also be estimated, it can be used to reduce the search space to a 1-D line. The modified autocorrelation matrix for a pixel in scale space is as following. A(σ i, σ d ) = σ d 2 G(σ i ) [ I x 2 (σ d ) I x I y (σ d ) I x I y (σ d ) I y 2 (σ d ) ] where { I x = G x (σ d ) I I y = G y (σ d ) I (4.4) Compared with the correlation matrix for the general Harris corner detector defined in equation (2.29), the local derivatives Ix and Iy are computed by convolution of the image and the first-order Gaussian derivative kernel (σd) rather than the first-order differential operator. According to the commutativity and associativity properties of the convolution, this operation is also equivalent to the convolving the Gaussian kernel with the image gradient at first. Then the local derivatives are convolved with another Gaussian kernel (σi) for smoothing and weighting purposes. Once this autocorrelation matrix on scale space is obtained, the Harris corner metric in scale space can be defined by: R s = det(a(σ i, σ d )) κ trace 2 (A(σ i, σ d )) (4.5) The local maxima of the corner metric determines the locations of the feature points. In the Fig. 4.8, for example the feature points were extracted for WAC at a scale σd =1 and for HRC at scales varying from σd, 3σd, 7σd, 7σd, and 9σd. We can see that the best matching is achieved when the HRC has a scale factor of 7 with respect to the WAC. 105

122 Figure 4.8 Feature points detected at fixed scale for low-resolution WAC (left) compared with feature points detected at varying scales for high-resolution HRC (right) 106

123 4.4. Long-Range Rover Localization through Visual Odometry and Bundle Adjustment The field experiment was carried out on an unpaved trail near Denver, Colorado (Fig. 4.9), which was mostly covered by gravel and surrounded by grasses, trees, and other vegetation. The PanCam triplet images were snapped around every 1 to 2 meter as the platform moved along the trail. Overall it travelled a total length of approximate 900 meters. The rover stopped roughly between 30 m to 60 m to take a full panorama to cover the routes it had traversed and was about to travel. Because of the large relatively elevation variations and sharp turns on the trail, the adjacent sites were sometimes close to each other at these positions to ensure overlap between the adjacent panoramas. Meanwhile, the ground truth was obtained from a GPS receiver with an accuracy of around 2 m in positioning. The initial EO parameters for all the stereo WAC images were estimated from the stereo VO processing, which also gave the rover s positions over the entire traverse based on the stereo WAC data. Then the HRC was added to the stereo WAC to improve the accuracy of feature points mapping, which thus in succession refined the traverses obtained from the stereo VO processing. Finally an incremental bundle adjustment was performed between adjacent sites to further improve the estimated rover s positions over the entire traverse. 107

Figure 4.9 PanCam prototype field experiment near Denver, Colorado (top) and the highlighted traverses (bottom) (courtesy of Google Maps) 4.4.1.

124 Figure 4.9 PanCam prototype field experiment near Denver, Colorado (top) and the highlighted traverses (bottom) (courtesy of Google Maps) Initial rover localization though stereo WAC data The standard stereo VO was applied to get the initial rover positions over the entire traverse. The PanCam platform was aligned to face the west at the starting point so the Z axis of the camera-centered coordinates roughly pointed to the west. As the platform moved, feature matching was preformed within the adjacent stereo WAC image pairs. Then the relative 6-108

125 DoF orientation and position changes between these two stereo frames can be estimated from the stereo VO processing. Given all the relative changes between adjacent stereo frames, the relative orientation and position changes of any stereo frame with respect to the first stereo frame at the starting point of the traverses could be obtained. Fig illustrates the trajectory of the PanCam platform estimated from stereo WAC VO processing method. The ground truth trajectory from the GPS data is also shown in the figure as comparison. Figure 4.10 Trajectories constructed from stereo WAC compared with the GPS ground truth reference 109

126 The average error, disclosure error, and relative disclosure are calculated for evaluating the results. The average error is defined as the accumulated positional errors for all rover stops over the total number of stops; the disclosure error is the positional error at the end of the traverses compared with the ground truth; and the relative disclosure error is the ratio between the disclosure error and total length the rover has travelled calculating from ground truth. Compared with the GPS trajectory, the average error, disclosure error, and relative disclosure of the trajectory recovered from the stereo WAC VO processing are m, m and 9.22% respectively. It can be seen from the comparisons that the stereo VO method yields good results at the beginning of the traverses where the route consists of nearly a straight line with smooth curvature. The orientation and position discrepancy compared to the ground truth is within tolerable level at this trajectory segment. However, as the rover traversed to the midway point (400 m West, 120 m North) and made the first sharp turning at the position, there are two obvious error patterns associated with the recovered stereo VO trajectory. The first one is that the orientation estimation from the stereo VO fails to converge with the ground truth data. And the other problem is the stereo VO trajectory seems to be shortened compared to the GPS trajectory. The orientation s failure to converge at the midway point is because the feature matching between adjacent stereo frames would fail at large view point change. Whenever the feature 110

matching is failed, the algorithm would assign the previous stereo frame s EO parameters to the current one. And the stereo VO processing moves forward to the next stereo frame. As shown in the Fig.

127 matching is failed, the algorithm would assign the previous stereo frame s EO parameters to the current one. And the stereo VO processing moves forward to the next stereo frame. As shown in the Fig. 4.11, the feature matching still works at the small turning i.e. frame k to k+1. However, there is no feature that can be matched at the sharp turning from frame k+1 to k+2. The VO algorithm will assume there is no movement between frame k+2 and k+1 if the feature matching fails. This is why the error increases dramatically at the frames where the feature matching algorithm fails. Besides at sharp turnings, it happens in the featureless areas where the feature matching is unavailable. Figure 4.11 Feature matching results at turnings: feature matching available at small turnings (left); feature matching fails at large turnings (right) For the second problem that the stereo VO trajectory is shorter than the GPS trajectory, the accumulated traverse distances were calculated for both two trajectories. The relationship between the accumulated traverse distances of stereo VO trajectory with respect to that of 111

Visual Odometry. Features, Tracking, Essential Matrix, and RANSAC. Stephan Weiss Computer Vision Group NASA-JPL / CalTech

Visual Odometry. Features, Tracking, Essential Matrix, and RANSAC. Stephan Weiss Computer Vision Group NASA-JPL / CalTech Visual Odometry Features, Tracking, Essential Matrix, and RANSAC Stephan Weiss Computer Vision Group NASA-JPL / CalTech Stephan.Weiss@ieee.org (c) 2013. Government sponsorship acknowledged. Outline The