Feature Tracking Evaluation for Pose Estimation in Underwater Environments

Size: px

Start display at page:

Download "Feature Tracking Evaluation for Pose Estimation in Underwater Environments"

Collin Bell
5 years ago
Views:

Feature Tracking Evaluation or Pose Estimation in Underwater Environments Florian Shkurti, Ioannis Rekleitis, and Gregory Dudek School o Computer Science McGill University, Montreal, QC, Canada

1 Feature Tracking Evaluation or Pose Estimation in Underwater Environments Florian Shkurti, Ioannis Rekleitis, and Gregory Dudek School o Computer Science McGill University, Montreal, QC, Canada {lorian, yiannis, dudek}@cim.mcgill.ca Abstract In this paper we present the computer vision component o a DOF pose estimation algorithm to be used by an underwater robot. Our goal is to evaluate which eature trackers enable us to accurately estimate the 3D positions o eatures, as quickly as possible. To this end, we perorm an evaluation o available detectors, descriptors, and matching schemes, over dierent underwater datasets. We are interested in identiying combinations in this search space that are suitable or use in structure rom motion algorithms, and more generally, vision-aided localization algorithms that use a monocular camera. Our evaluation includes rame-by-rame statistics o desired attributes, as well as measures o robustness expressed as the length o tracked eatures. We compare the it o each combination based on the ollowing attributes: number o extracted keypoints per rame, length o eature tracks, average tracking time per rame, number o alse positive matches between rames. Several datasets were used, collected in dierent underwater locations and under dierent lighting and visibility conditions. Keywords-Underwater eature tracking; Sensor usion; Mapping; State Estimation; Perormance evaluation techniques; I. INTRODUCTION The task o localization is one o the undamental problems in mobile robotics. It is particularly important in any scenario where robots are required to exhibit autonomous behavior, since many navigation and mapping algorithms require, or beneit rom, the existence o some prior belie about the robot s pose. In the underwater domain the problem o localization is particularly challenging as it is a GPS denied, highly unstructured natural environment, without man-made landmarks. This paper presents an overview o a six degree o reedom (DOF) state estimation algorithm or an underwater vehicle, and in particular, it presents an analysis o the computer vision components that it relies on. The described algorithm combines measurements rom an Inertial Measurement Unit (IMU) sensor and a camera in order to accurately track the pose o the vehicle. An essential part o this algorithm is eature tracking using a monocular camera, and estimation o 3D eature positions with respect to the camera rame. Our goal in this paper is to evaluate which eature detectors, descriptors, and matching strategies are most suitable or integration with the rest o the state estimation algorithm. While our ocus is directed towards this particular algorithm, our evaluation is suiciently inormative or other closely related problems, such as visual odometry and structure rom motion. Figure. The Aqua vehicle starting a dataset collection run The target vehicle is a robot belonging to the Aqua amily [] o amphibious robots; see Fig.. It is a six-legged vehicle capable o reaching depths o 35 m in remotelycontrolled or autonomous operation. It is equipped with three IEEE-39 IIDC cameras and an IMU. One o the primary uses o this vehicle is to conduct visual mapping over coral rees. The state estimation algorithm, irst proposed by Mourikis and Roumeliotis [] uses visual input rom a downward-acing camera to correct the errors resulting rom the double integration o the IMU signal. II. RELATED WORK Corke et. al [3] presented experimental results o an underwater robot perorming stereo visual odometry using the Harris eature detector and the normalized cross-correlation similarity measure (ZNCC). They reported that, ater outlier rejection, to 5 eatures were tracked rom rame to rame, and those were suicient or the stereo reconstruction o the robot s 3D trajectory, which was a square o area 3m by 3m. Baroot [] described a DOF pose estimation technique, adapted rom FastSLAM. [5], whose input was the odometry and stereo images o a terrestrial mobile robot. SIFT eatures were used as observations in the ilter and were matched via best-bin-irst search in a kd-tree. The author demonstrated through ield experiments that the proposed approach led to.5% to % localization errors over distances o - meters.

2 Newcombe and Davison [], as well as Klein and Murray [7], have presented online, accurate 3D reconstruction o small environments using a monocular camera. Their tracking system uses the FAST detector and perorms search along epipolar lines in order to ind matching pixel neighborhoods. The landscape o perormance evaluations o eature detectors and descriptors is certainly rich, as there has been extensive related work in trying to experimentally assess the degree o invariance o said detectors to aine transormations. For example, Mikolajczyk and Schmid [] compared the ollowing detectors: Harris corners, Harris- Laplace, Hessian-Laplace, Harris-Aine and Hessian-Aine, based on their repeatability, in other words, their ability to consistently detect the same keypoints in two dierent images o the same scene. As or eature descriptors, their evaluation included SIFT [9], Gradient Location and Orientation Histogram (GLOH), cross-correlation o neighborhood pixel values, PCA-SIFT, moment invariants, shape context, steerable ilters, spin images, and dierential invariants. Some o these descriptors are low dimensional while others are high. The evaluation criterion they used in their experimental setup was precision and recall o eature matches, where a true match was deined as having small vector distance between the corresponding descriptors. Their dataset included scenes that had dierent lighting and geometric transormations, however, the images were not successive video rames o the same scene, and they did not capture underwater environments. The main conclusion o the above work was that the high dimensional detectors, like GLOH and SIFT were robust and most distinctive, regardless o the detector used. Closely related to our work is the quantitative comparison o eature extractors or visual SLAM done by Klippenstein and Zhang []. Their experimental methodology included comparison o precision and recall curves o the Harris corner detector, the Kanade-Lucas-Tomasi (KLT) tracker [], and the SIFT detector. Their conclusion was that all detectors could be tuned to perorm well or visual SLAM, but in particular, the KLT tracker was better suited or image sequences with small baseline, compared to SIFT. III. DOF STATE ESTIMATION OF AN AUTONOMOUS UNDERWATER VEHICLE The localization algorithm [], [] used is based on the extended Kalman ilter (EKF) ormulation. It integrates IMU measurements [3], thus providing direct (but noisy) estimates o vehicle dynamics. Furthermore, by perorming eature correspondence it estimates motion parameters, and obtains constraints, rom the camera data, thereby correcting the driting IMU integration estimates; see Fig. or a schematic o the algorithm. The innovation o this algorithm is that the state contains the latest pose o the IMU and a list o m camera poses, instead o the vehicle pose and the detected eature positions. As such, this approach does not suer rom the dimensionality explosion common to traditional SLAM algorithms. More ormally, the state vector that we want to estimate is the ollowing: x = [ q I G b g v G I b a p G I q C G pg... q Cm G pg C m ] where qg I is the quaternion that expresses the rotation rom a global rame o reerence to the IMU rame, which is attached on the robot, vi G is the velocity, and p G I is the position o the IMU rame in coordinates o the global rame. b g and b a are the bias vectors o the gyroscope and the accelerometer, respectively. The pairs {q Ci G pg C i } represent the transormations between the global rame and the i- th camera rame. The number m o such transormations embedded in the state vector is bounded above or practical purposes, and because eatures are expected to be tracked in a relatively small number o images. The main idea behind using inormation rom these two sources is that, provided we are able to track eatures in the world and correctly estimate their 3D position with respect to the camera, we can suiciently correct the driting IMU pose estimates, which are obtained by integration; More speciically, the above vector will be modiied in three phases: (i) The propagation phase, where IMU measurements are integrated so as to predict the IMU pose. (ii) The augmentation phase, which occurs when an image is recorded. The transormation rom the global rame to the camera rame is appended to the right o the state vector. (iii) The update phase, which occurs whenever a eature track terminates. Then we can estimate its 3D position and correct the state vector, speciically the IMU integration drit, based on the constraints set by the motion o the eature. In this paper we ocus our attention on (iii), and in particular, on the vision component o the system. Our goal is to evaluate which detectors, descriptors, and matching strategies enable us to estimate the 3D positions o eatures, as quickly and accurately as possible, using a monocular camera. We aim to use this estimator in underwater environments, which are generally more challenging than most o their indoor counterparts or the ollowing reasons: (a) They are prone to rapid changes in lighting conditions. The canonical example that illustrates this is the presence o caustic patterns, which are due to reraction, and the non-uniorm relection, and penetration o light when it transitions rom air to the rippling surace o the water. (b) They oten have limited visibility. This is due to many actors, some o which include light scattering rom suspended plankton and other matter, which cause blurring and snow eects. Another reason involves the incident Recall that quaternions provide a well-behaved parameterization o rotation. We are ollowing the JPL ormulation o quaternions; see [3], [] ()

3 IMU Hz Images 5 Hz Propagation Augmentation Feature Tracking State Vector Terminated Feature Trail Update 3D Feature Position Estimation Figure. An overview o the DOF state estimation algorithm. The state vector appears in Eq. ( ). The green-colored components are examined in this paper. angle at which light rays hit the surace o the water. Smaller angles lead to less visibility. (c) They impose loss o contrast and colour inormation with depth. As light travels at increasing depths, dierent parts o its spectrum are absorbed. Red is the irst color that is seen as black, and eventually orange, yellow, green and blue ollow. This absorption sequence reers to clear water, and is not necessarily true or other types. A. Feature Detectors And Descriptors Our evaluation compares the ollowing eature detectors: SURF, Shi-Tomasi, FAST, CenSurE, and the SURF descriptor. The SURF [5] detector is designed to be scale and (partially) rotation invariant. It detects keypoints at local extrema o the determinant o the image s approximate Hessian. This ilter response is evaluated at dierent scales so as to achieve scale invariance. The scale space is divided into overlapping octaves, each o which consists o ilter responses at increasing scales. Rotation invariance is due to the assignment o a dominant orientation o Haar wavelet responses around the detected keypoint. Each keypoint is characterized by a SURF descriptor, which is a vector that consists o sums o Haar wavelet responses around its oriented neighborhood. Neither the detector nor the descriptor use any color inormation. The Shi-Tomasi detector [] is invariant under aine transormations, but not under scaling. Its keypoints are image locations where the nd moment matrix has two large eigenvalues, indicating image intensity change in two directions, i.e. a corner. The FAST (Features rom Accelerated Segment Test) [7] detector, on the other hand, ocuses on speed o keypoint detection rather than robustness to noise and invariance properties. It uses a decision tree to classiy a pixel as a keypoint i a circle o pixels around it has arcs that are much brighter or darker than the center. The authors mention that while this classiier is not very robust to noise, scaling and illumination variations, it has high repeatability and rotation invariance. Finally, the CenSurE (Center Surround Extrema) [] detector aims to identiy keypoints at any scale, without resorting to image subsampling like SIFT, or ilter upsampling like SURF. It does this by searching or extrema o the image s approximate Laplacian, using polygonal (e.g. octagonal) center-surround Haar wavelet ilters. By comparison, SURF uses box-shaped ilters or its descriptor, resulting in less symmetry and partial rotation invariance. Finally, among the detected extrema, only the ones with high Harris response are kept, so as to eliminate keypoints detected along edges, which might be unstable or tracking. B. Feature Matching The eature matching component o our evaluation is based on Muja and Lowe s work [9]. More speciically, we match eature pairs based on Approximate Nearest Neighbor search among the respective SURF descriptors. We accept a match i the distance ratio o the nearest neighbor over the second nearest neighbor is below a certain threshold, and i the eature correspondence is bi-directional. For the Approximate Nearest Neighbor Search we experimented with both hierarchical K-means and randomized kd-trees. For combinations that use FAST, CenSurE, or Shi-Tomasi keypoints, we use normalized cross-correlation as a measure o similarity between image patches around said keypoints. C. Feature Position In Camera Coordinates Consider a single eature,, that has been tracked in n consecutive camera rames,, C,..., C n. In the state vector mentioned previously, each camera rame C i is represented by {q Ci G, pg C i } where G is a global reerence rame, q Ci G is the unit quaternion that expresses the rotation rom the global rame G to the camera rame C i, and p G C i expresses the position o the origin o camera rame C i in global rame coordinates. [ ] Now, let us denote p Ci = X Ci Y Ci Z Ci to be the 3D position o eature expressed in camera rame C i coordinates. This is what we are interested in estimating as accurately as possible. Also, let R(q Ci G ) denote the rotation matrix that represents the same rotation expressed by the quaternion q Ci G. Then, we can write the ollowing: R(q Ci ) = R(q Ci G )R(qG ) () p Ci = R(q Ci G )(pg p G C i ) (3) about the transormation between the irst rame in which the eature was seen and the rest. So, we have:

4 p Ci = R(q Ci )p C + p Ci () [ ] X t = Z C R(q Ci Y C ) + p Ci Z C Z C Z C I we let α = X Z ( p Ci = Z C = Z C, β = Y C Z, and γ = Z R(q Ci ) [α β ] t + γ p Ci h i (α, β, γ ) h i (α, β, γ ) h i3 (α, β, γ ) then ) (5) () Now, assuming a simple pinhole camera model, and disregarding the eects o the camera s calibration matrix, we can model the projection o eature on the projection plane Z Ci = as ollows: [ hi (α z Ci =, β, γ ) h i3 (α, β, γ ) h i (α, β, γ ) ] + n Ci (7) where z Ci is the measurement, and n Ci is noise associated with the process, due to miscalibration o the camera, motion blur, and other actors. I we stack the measurements rom all cameras into a single n vector z and similarly or the projection unctions h i into a n vector h, then we will have expressed the problem o estimating the 3D position o a eature as a nonlinear leastsquares problem with 3 unknowns: argmin z h (α, β, γ ) () α,β,γ Provided we have at least 3 measurements o eature, i.e. provided we track it in at least 3 rames, we use the Levenberg-Marquardt nonlinear optimization algorithm in order to get an estimate ˆp C o the true solution. One problem with this approach is that nonlinear optimization algorithms do not guarantee a global minimum, only a local one, provided they converge. Another problem is that i eature is recorded around the same pixel location in all the rames in which it appears, then the measurements we will get are going to be linearly dependent, thus providing no inormation about the depth o. In other words, eature tracks that have small baseline have to be considered outliers, unless we exploit other local inormation around the eature to iner its depth. It is worth considering whether we would get better 3D eature position estimates i we treated γ as the only unknown in the system, and we ixed [α β ] = z C. Our analysis o empirical data suggests that this did not contribute to a signiicant improvement, and the reason is that Eq. () is the optimal solution given the noise characteristics in Eq. (7). Finally, another potential pitall is that the transormations in Eq. (3) might be themselves noisy, which will also aect the solution o the least squares problem. D. Outlier Detection In order to identiy alse positive matches between two consecutive rames we use epipolar constraints imposed by the undamental matrix between said rames. We rely on two dierent methods o estimating the undamental matrix: one is the RANSAC method [], which works well as long as there is a signiicant number o inliers among the matches. The other one is via direct computation [], which is made possible due to the presence o the IMU: F = K t [ p Ci C i+ ] R(q Ci+ C i ) = R(q Ci+ G )R(qG C i ) p Ci C i+ = R(q Ci G )(pg C i+ p G C i ) R(q Ci C i+ )K (9) where [ ] is the cross-product matrix, and K is the calibration matrix o the camera. The latter estimation method is useul as a complement to the ormer, when the presence o outlier matches is signiicant. We classiy a pair o keypoints, x and x as an outlier match i or either o the two estimates o the undamental matrix: x t F x > τ () where τ is a threshold (in our experiments τ =. pixels). We classiy a eature track as an outlier i any o its constituent matching pairs are outliers. Figure 3. IV. EXPERIMENTAL RESULTS The Aqua vehicle over the marked area. Our experimental comparison included the ollowing combinations o detectors, descriptors, and matching strategies: SURF keypoints annotated with the SURF descriptor and matched using Approximate Nearest Neighbours, SURF keypoints matched using ZNCC, FAST keypoints matched using ZNCC, CenSurE keypoints matched using ZNCC,

A. Underwater Camera And IMU Datasets SURF/ANN x FAST/ZNCC x... Number o tracks Number o eature tracks and Shi-Tomasi keypoints matched using ZNCC.

The outlier detection parameters were ixed or all experiments, so that all combinations are treated in the same way.

The datasets represent distinct underwater environments that are relatively eature rich, and have been recorded under varying illumination conditions.

The camera is a PointGrey Dragonly Hi-Color 7 pixels. The IMU sensor we used was a MicroStrain 3DM-GX, and its sampling rate was set at 5Hz.

Distribution o eature track lengths or two representative combinations. The igure on the let shows the length distribution or the SURF detector, and Approximate Nearest Neighbor matching.

between the robot and the tags, which will be useul as ground truth when we try to estimate the 3D positions o the eatures tracked. Its greyscale images were collected in approximately 5 seconds.

It is the most eature-rich and blur-ree among the datasets we have used, it consists o 3 images and it diers rom the previous two datasets in that the robot changes depth. B.

The sea bottom is mostly lat, and the robot moves about meters over it.

Its greyscale camera rames were collected in approximately 3 minutes.

5 A. Underwater Camera And IMU Datasets SURF/ANN x FAST/ZNCC x... Number o tracks Number o eature tracks and Shi-Tomasi keypoints matched using ZNCC. For each o the experiments conducted we attempted to tune the parameters o individual combinations in such a way that each o them perorms as well as possible. The outlier detection parameters were ixed or all experiments, so that all combinations are treated in the same way..... We rely on 3 dierent datasets o synchronized camera and IMU input in order to conduct our evaluation. The datasets represent distinct underwater environments that are relatively eature rich, and have been recorded under varying illumination conditions. All datasets were recorded at approximately 3 eet depth. The rame rate o the downwardlooking camera was set at 5Hz and each image is (ater undistortion and cropping) 7 5 pixels. The camera is a PointGrey Dragonly Hi-Color 7 pixels. The IMU sensor we used was a MicroStrain 3DM-GX, and its sampling rate was set at 5Hz. Both sensors are on board o one o the Aqua amily o amphibious robots []. More speciically: Track lengths Track length 7 9 Figure 5. Distribution o eature track lengths or two representative combinations. The igure on the let shows the length distribution or the SURF detector, and Approximate Nearest Neighbor matching. The one on the right shows it or the FAST detector, matched by the normalized crosscorrelation score. between the robot and the tags, which will be useul as ground truth when we try to estimate the 3D positions o the eatures tracked. Its greyscale images were collected in approximately 5 seconds. Dataset3 eatures a remotely controlled trajectory over a coral ree, where the robot moves or seconds at speed.m/s. It is the most eature-rich and blur-ree among the datasets we have used, it consists o 3 images and it diers rom the previous two datasets in that the robot changes depth. B. Feature Track Lengths Figure. Sample images rom each dataset Dataset eatures a straight 3 meter-long trajectory, where the robot moves at approximately. meters/sec orward, while preserving its depth. The sea bottom is mostly lat, and the robot moves about meters over it. A white 3 meter-long tape has been placed on the bottom, both to provide ground truth or distance travelled, and to acilitate the straight line trajectory. Its greyscale camera rames were collected in approximately 3 minutes. Dataset is an L-shaped trajectory, where the robot goes straight or about 7 meters, turns let, and then continues straight or about 3 meters, while preserving depth. Again, the sea bottom is mostly lat, and the robot moves about meters over it, at speed.3 meters/sec. The dierence compared to Dataset is that in this case we have placed ARToolKit [] tags on the sea bottom, in order to be able to get estimates o relative rotations and translations The distribution o eature track lengths is shown in Fig. 5, and it is a decaying curve or all combinations. The SURF detector seems to track most eatures or an average o 5 camera rames, and it is virtually unable do it or more than 5 rames, using Approximate Nearest Neighbors as the matching method. From the datasets we observed that, given the almost constant orward velocity o the robot, each eature is visible in the camera s ield o view or about 5 rames. This means that the combination mentioned above can normally track approximately /5th o the real trajectory o a eature. Fig. shows that this tracking is very accurate. C. False Positives Figure demonstrates the dierence in accuracy between the combinations involving SURF, and all the rest. Matching FAST keypoints with the normalized cross-correlation similarity score results in as many as 9% alse matches, while the same quantity appears to be 5% or SURF. This dierence is pronounced when the image has motion blur, since the normalized cross-correlation similarity measure does not take into account image intensity gradients o any sort, so one should apply very aggressive outlier detection schemes when using it.

6 Average ratio o alse positive matches / all matches FAST/ZNCC Shi Tomasi/ZNCC CenSurE/ZNCC SURF/ZNCC Dataset Dataset Dataset3 Number o eatures SURF/ANN Depth o eatures rom camera (meters) Figure. Average ratio o alse positive matches / all matches. High value implies matching that is prone to outliers. SURF eature matching, and Shi-Tomasi eatures matched with ZNCC appear to be overall the most robust. All detectors did well on datasets and 3, which were not blurry. FAST and CenSurE were more prone to outliers on Dataset, which had the highest motion blur. D. 3D Feature Position Estimation Oline processing o the ARToolKit tags that we placed underwater on the bottom o the experiment area, showed that the robot was swimming approximately.5 meters over the bottom, mostly parallel to it, so we will treat that as ground truth or the depth o each eature. Ater removing outlier paths using the epipolar constraints implied by Eq. (), we rejected 3D eature position estimates that were outside the ield o view o the robot, less than.5 meters or more than meters away rom the robot. The distribution o depths rom the camera rame, or the remaining inliers is shown in Fig. 7, which suggests that the majority o inliers had depths that ell in the [, ] meter range. E. Running Time We compared the average time spent on keypoint extraction and matching, normalized by the total number o keypoints processed. The results were obtained on an Intel Pentium Core Duo at.ghz and GB o RAM. The OpenCV C++ library implementations were used or each detector, descriptor and matching scheme. The results in Fig. (a) clearly show that SURF and the Shi-Tomasi detector spend approximately millisecond on each eature, while CenSurE spends 3 milliseconds. The latter is most likely due to the act that CenSurE computes keypoints at all scales. Fig. (b) depicts a very signiicant dierence between the running times o Approximate Nearest Neighbor matching through either randomized kd-trees or K-means, and normalized-cross correlation matching. This is not surprising, as the latter has complexity O(nm) where Figure 7. Distribution o estimated eature depths rom the irst camera rame in which they were seen. n, m are the number o eatures extracted rom the previous and the current image, respectively. On the other hand, kdtrees are constructed in O(nlogn) and queried in O(logm). Thereore, i real-time processing o images is a requirement, limiting the number o eatures extracted by the detectors becomes necessary. FAST Shi Tomasi CenSurE SURF Feature Extraction Times Average time / eature (sec) x 3 (a) Average time / eature (sec) 9 x Feature Matching Times KD TREE K MEANS ZNCC Figure. (a) Average eature extraction times per eature. CenSurE is the slowest among the detectors that we evaluated. (b) Average eature matching times per eature. The normalized cross-correlation similarity score (ZNCC) is computed by a brute orce comparison o all possible eature pairs, hence it requires more computation time than Approximate Nearest Neighbors. The time spent on computing the undamental matrix is not included in these matching times. F. State Estimation Figure 9 presents two illustrative examples o the state estimation algorithm applied to Dataset3, in one case using the SURF detector adjusted so that it detects approximately 5- eatures per image, and in the other case using the the Shi-Tomasi detector, tuned so that it detects approximately -3 eatures. Less than hal o those eatures (b)

7 z (meters) x (meters) y (meters) z (meters) y (meters) x (meters) (a) (b) z (meters) x (meters) y (meters) z (meters) 3 5 y (meters) x (meters) (c) (d) Figure 9. (a) Estimated trajectory or Dataset3, using SURF and Approximate Nearest Neighbor matching. (b) 3D coral structure, estimated via Eq. () along with the trajectory. (c) Estimated trajectory or Dataset3, using Shi-Tomasi eatures and ZNCC matching. (d) 3D coral structure were matched rom rame to rame, yet the robot s trajectory was reproduced at Hz camera rame rate or SURF and 3Hz rame rate or Shi-Tomasi. The estimated trajectory and 3D structure o the coral is shown in Fig. 9. V. CONCLUSION We perormed an evaluation o combinations o dierent eature detectors, descriptors and matching strategies or the purposes o evaluating their suitability or our state estimation algorithm, which we want to use in underwater environments, in ull DOF motion. We tested these combinations based on our criteria o interest: length o eature tracks, running time, alse positive matches, and overall ability to acilitate estimation o 3D positions o eature rom the camera rame. We used three underwater datasets that: (a) are representative o the environments that we want to run this algorithm in, (b) have suicient lighting and motion blur variations, and (c) provided us with some notion o ground truth. The main conclusion o our evaluation is that the most suitable combination or real-time processing would involve either SURF keypoints matched via Approximate Nearest Neighbor search, or Shi-Tomasi eatures matched via ZNCC. On the other hand, the FAST and CenSurE detectors were shown to be inaccurate in image sequences with high levels

8 o motion blur. ACKNOWLEDGMENTS The authors would like to thank Philippe Giguere or assistance with robot navigation; Yogesh Girdhar and Chris Prahacs or their support as divers; Marinos Rekleitis and Rosemary Ferrie or their invaluable help with the experiments. The authors would also like to thank NSERC or its generous support. REFERENCES [] J. Sattar, G. Dudek, O. Chiu, I. Rekleitis, P. Giguère, A. Mills, N. Plamondon, C. Prahacs, Y. Girdhar, M. Nahon, and J.- P. Lobos, Enabling autonomous capabilities in underwater robotics, in Proceedings o the IEEE/RSJ International Conerence on Intelligent Robots and Systems, IROS, Nice, France, September. [] A. I. Mourikis and S. I. Roumeliotis, A multi-state constraint Kalman ilter or vision-aided inertial navigation, in Proceedings o the IEEE International Conerence on Robotics and Automation, Rome, Italy, April - 7, pp [3] P. Corke, C. Detweiler, M. Dunbabin, M. Hamilton, D. Rus, and I. Vasilescu, Experiments with underwater robot localization and tracking, in Proc. o IEEE International Conerence on Robotics and Automation (ICRA), 7, pp [] T. Baroot, Online visual motion estimation using astslam with sit eatures, in Intelligent Robots and Systems, 5. (IROS 5). 5 IEEE/RSJ International Conerence on, 5, pp [5] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, FastSLAM.: An improved particle iltering algorithm or simultaneous localization and mapping that provably converges, in Proceedings o the Sixteenth International Joint Conerence on Artiicial Intelligence (IJCAI). Acapulco, Mexico: IJCAI, 3. [] R. A. Newcombe and A. J. Davison, Live dense reconstruction with a single moving camera, in CVPR,, pp [7] G. Klein and D. Murray, Parallel tracking and mapping or small AR workspaces, in Proc. Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 7), Nara, Japan, November 7. [] S. Baker and I. Matthews, Lucas-kanade years on: A uniying ramework: Part, Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RI-TR--, July. [] A. I. Mourikis and S. I. Roumeliotis, A dual-layer estimator architecture or long-term localization, in Proceedings o the Workshop on Visual Localization or Mobile Platorms, Anchorage, AK, June, pp.. [3] N. Trawny and S. I. Roumeliotis, Indirect kalman ilter or 3d attitude estimation, University o Minnesota, Dept. o Comp. Sci. and Eng.,, Tech. Rep. 5-, March 5. [] W. G. Breckenridge, Quaternions proposed standard conventions, JPL, Tech. Rep. INTEROFFICE MEMORANDUM IOM , 999. [5] H. Bay, T. Tuytelaars, and L. V. Gool, Sur: Speeded up robust eatures, in ECCV,, pp. 7. [] J. Shi and C. Tomasi, Good eatures to track, in 99 IEEE Conerence on Computer Vision and Pattern Recognition (CVPR 9), 99, pp [7] E. Rosten and T. Drummond, Machine learning or highspeed corner detection, in European Conerence on Computer Vision, vol., May, pp [] M. Agrawal, K. Konolige, and M. R. Blas, Censure: Center surround extremas or realtime eature detection and matching, in ECCV (),, pp. 5. [9] M. Muja and D. G. Lowe, Fast approximate nearest neighbors with automatic algorithm coniguration, in International Conerence on Computer Vision Theory and Application VISSAPP 9). INSTICC Press, 9, pp [] R. C. B. M. A. Fischler, Random sample consensus: A paradigm or model itting with applications to image analysis and automated cartography, Comm. o the ACM,, vol., pp , 9. [] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, nd ed. Cambridge University Press, ISBN: 555,, pp. 5. [] [Online]. Available: [] K. Mikolajczyk and C. Schmid, A perormance evaluation o local descriptors, IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 7, no., pp. 5 3, 5. [9] D. G. Lowe, Distinctive image eatures rom scale-invariant keypoints, Int. J. Comput. Vision, vol., pp. 9, November. [] J. Klippenstein and H. Zhang, Quantitative evaluation o eature extractors or visual slam, in Computer and Robot Vision, 7. CRV 7. Fourth Canadian Conerence on, May 7, pp. 57.

State Estimation of an Underwater Robot Using Visual and Inertial Information

State Estimation of an Underwater Robot Using Visual and Inertial Information 11 IEEE/RSJ International Conerence on Intelligent Robots and Systems September 5-3, 11. San Francisco, CA, USA State Estimation o an Underwater Robot Using Visual and Inertial Inormation Florian Shkurti,