Accurate Quadrifocal Tracking for Robust 3D Visual Odometry

Similar documents
Real-time Quadrifocal Visual Odometry

calibrated coordinates Linear transformation pixel coordinates

Today. Stereo (two view) reconstruction. Multiview geometry. Today. Multiview geometry. Computational Photography

A General Expression of the Fundamental Matrix for Both Perspective and Affine Cameras

Stereo and Epipolar geometry

Visual Tracking of Planes with an Uncalibrated Central Catadioptric Camera

Rectification and Distortion Correction

A Spherical Robot-Centered Representation for Urban Navigation

Perception and Action using Multilinear Forms

Stereo Vision. MAN-522 Computer Vision

Real Time Localization and 3D Reconstruction

arxiv: v1 [cs.cv] 28 Sep 2018

Euclidean Reconstruction Independent on Camera Intrinsic Parameters

Step-by-Step Model Buidling

Index. 3D reconstruction, point algorithm, point algorithm, point algorithm, point algorithm, 263

Feature Transfer and Matching in Disparate Stereo Views through the use of Plane Homographies

Flexible Calibration of a Portable Structured Light System through Surface Plane

EE795: Computer Vision and Intelligent Systems

3D Model Acquisition by Tracking 2D Wireframes

A REAL-TIME TRACKING SYSTEM COMBINING TEMPLATE-BASED AND FEATURE-BASED APPROACHES

Computer Vision I - Algorithms and Applications: Multi-View 3D reconstruction

Index. 3D reconstruction, point algorithm, point algorithm, point algorithm, point algorithm, 253

Dense RGB-D mapping of large scale environments for real-time localisation and autonomous navigation

A Summary of Projective Geometry

COMPARATIVE STUDY OF DIFFERENT APPROACHES FOR EFFICIENT RECTIFICATION UNDER GENERAL MOTION

Segmentation and Tracking of Partial Planar Templates

Computational Optical Imaging - Optique Numerique. -- Single and Multiple View Geometry, Stereo matching --

CS 532: 3D Computer Vision 7 th Set of Notes

Structure from Motion. Introduction to Computer Vision CSE 152 Lecture 10

SIMULTANEOUS Localization and Mapping (SLAM) has

Computational Optical Imaging - Optique Numerique. -- Multiple View Geometry and Stereo --

Jakob Engel, Thomas Schöps, Daniel Cremers Technical University Munich. LSD-SLAM: Large-Scale Direct Monocular SLAM

Visualization 2D-to-3D Photo Rendering for 3D Displays

EXAM SOLUTIONS. Image Processing and Computer Vision Course 2D1421 Monday, 13 th of March 2006,

Complex articulated object tracking

Real-time dense appearance-based SLAM for RGB-D sensors

Fast, Unconstrained Camera Motion Estimation from Stereo without Tracking and Robust Statistics

Image Transfer Methods. Satya Prakash Mallick Jan 28 th, 2003

Multiple Views Geometry

Multiple View Geometry

Computer Vision Lecture 17

Invariant Features from Interest Point Groups

3D SSD Tracking from Uncalibrated Video

CS231A Course Notes 4: Stereo Systems and Structure from Motion

Epipolar Geometry and Stereo Vision

3D SSD Tracking with Estimated 3D Planes

arxiv: v1 [cs.cv] 28 Sep 2018

Video Alignment. Final Report. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin

Video Alignment. Literature Survey. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin

Unit 3 Multiple View Geometry

Prof. Fanny Ficuciello Robotics for Bioengineering Visual Servoing

Camera Calibration Using Line Correspondences

Visual Recognition: Image Formation

Computer Vision Lecture 17

But First: Multi-View Projective Geometry

Visual planes-based simultaneous localization and model refinement for augmented reality

Rectification for Any Epipolar Geometry

Accurate and Dense Wide-Baseline Stereo Matching Using SW-POC

Hartley - Zisserman reading club. Part I: Hartley and Zisserman Appendix 6: Part II: Zhengyou Zhang: Presented by Daniel Fontijne

Week 2: Two-View Geometry. Padua Summer 08 Frank Dellaert

Multiple View Geometry in Computer Vision Second Edition

Lecture 9: Epipolar Geometry

Structure and motion in 3D and 2D from hybrid matching constraints

Homography-based Tracking for Central Catadioptric Cameras

Occluded Facial Expression Tracking

Visual Odometry. Features, Tracking, Essential Matrix, and RANSAC. Stephan Weiss Computer Vision Group NASA-JPL / CalTech

Geometric camera models and calibration

Rectification and Disparity

Generic and Real-Time Structure from Motion

Dense 3D Reconstruction. Christiano Gava

Stereo Image Rectification for Simple Panoramic Image Generation

Geometry for Computer Vision

Epipolar Geometry in Stereo, Motion and Object Recognition

Accurate Motion Estimation and High-Precision 3D Reconstruction by Sensor Fusion

CSE 252B: Computer Vision II

Augmenting Reality, Naturally:

Visual Odometry for Non-Overlapping Views Using Second-Order Cone Programming

Object Recognition with Invariant Features

Structure from Motion

The Geometry of Dynamic Scenes On Coplanar and Convergent Linear Motions Embedded in 3D Static Scenes

Keeping features in the camera s field of view: a visual servoing strategy

Multiview Stereo COSC450. Lecture 8

Compositing a bird's eye view mosaic

Multiview Image Compression using Algebraic Constraints

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

Long-term motion estimation from images

ECE 470: Homework 5. Due Tuesday, October 27 in Seth Hutchinson. Luke A. Wendt

Stereo. 11/02/2012 CS129, Brown James Hays. Slides by Kristen Grauman

Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras

Stereo II CSE 576. Ali Farhadi. Several slides from Larry Zitnick and Steve Seitz

Dealing with Scale. Stephan Weiss Computer Vision Group NASA-JPL / CalTech

Estimating Geospatial Trajectory of a Moving Camera

Fast Outlier Rejection by Using Parallax-Based Rigidity Constraint for Epipolar Geometry Estimation

Visual Tracking (1) Tracking of Feature Points and Planar Rigid Objects

Two-view geometry Computer Vision Spring 2018, Lecture 10

Real-Time Vehicle Global Localisation with a Single Camera in Dense Urban Areas: Exploitation of Coarse 3D City Models

arxiv: v1 [cs.cv] 18 Sep 2017

Epipolar Geometry and Stereo Vision

Agenda. Rotations. Camera models. Camera calibration. Homographies

Factorization Method Using Interpolated Feature Tracking via Projective Geometry

Transcription:

007 IEEE International Conference on Robotics and Automation Roma, Italy, 10-14 April 007 WeA. Accurate Quadrifocal Tracking for Robust 3D Visual Odometry A.I. Comport, E. Malis and P. Rives Abstract This paper describes a new image-based approach to tracking the 6dof trajectory of a stereo camera pair using a corresponding reference image pairs instead of explicit 3D feature reconstruction of the scene. A dense minimisation approach is employed which directly uses all grey-scale information available within the stereo pair (or stereo region) leading to very robust and precise results. Metric 3D structure constraints are imposed by consistently warping corresponding stereo images to generate novel viewpoints at each stereo acquisition. An iterative non-linear trajectory estimation approach is formulated based on a quadrifocal relationship between the image intensities within adjacent views of the stereo pair. A robust M-estimation technique is used to reject outliers corresponding to moving objects within the scene or other outliers such as occlusions and illumination changes. The technique is applied to recovering the trajectory of a moving vehicle in long and difficult sequences of images. I. INTRODUCTION This study is part the MOBIVIP project aimed at autonomous vehicle navigation in urban environments. Here the core issue is 3D visual odometry is considered in the context of rapidly moving vehicles, real sequences, large scale distances, with traffic and other types of occluding information. Indeed, tracking in urban canyons is a nontrivial problem [17], [1]. It is clear that pose estimation and visual tracking are also important in many applications including robotics, augmented reality, medical imaging, etc... Model-based techniques have shown that 3D CAD models are essential for robust, accurate and efficient 3D motion estimation [6], however, they have the major drawback of requiring an a-priori model which is not always available or extremely difficult to obtain as in the case of shapeless objects or large-scale urban environments. Alternative techniques propose to perform 3D structure and motion estimation online. Among this class, visual simultaneous localisation and mapping approaches [7], [4] are based on an implementation of the Extended Kalman Filter and have limited computational efficiency (manipulation and inversion of large feature co-variance matrices) and limited inter-frame movements (due to approximate non-iterative estimation). In [13] stereo and monocular visual odometry approaches are proposed based on a combination of feature extraction, matching, tracking, triangulation, RANSAC pose estimation and iterative refinement. In [1], a similar monocular technique is proposed but drift is minimised using a local bundle adjustment technique. Feature based methods (eg. [7], [4], [13], [1]) all rely on an intermediary estimation processes based on detection This work was supported by the MOBIVIP PREDIT project. The authors are with INRIA, Sophia-Antipolis, France. name.surname@sophia.inria.fr thresholds. This feature extraction process is often badly conditioned, noisy and not robust therefore relying on higher level robust estimation techniques. Since the global estimation loop is never closed on the image measurements (intensities) these multi-step techniques systematically propagate feature extraction error and accumulate drift. To eliminate drift these approaches resort to techniques such as local bundle adjustment or SLAM. Appearance and optical flow based techniques, on the other hand, are image-based and minimise an error directly based on the image measurements. Unfortunately, they are often only monocular and make heavy assumptions about the nature of the structure within the scene or the camera model. For example in [8] an affine camera model is assumed and in [] and [3] planar homography models are assumed. In this way the perspective effects or the effects of non-planar 3D objects are not considered and tracking fails easily under large movements. Of course many papers avoid the problems of monocular algorithms (i.e. scale factor, initialisation, observability, etc.) by using multi-view constraints and a multitude of work exist on multiview-geometry (see [9] and references therein). However, to our knowledge no work has been done on deriving an efficient region-based tracker as in [3] using stereo warping and novel view synthesis as in [1]. Another very important issue is the registration problem. Purely geometric, or numerical and iterative approaches may be considered. Linear approaches use a least-squares method to estimate the pose and are considered to be suitable for initialization procedures. Full-scale non-linear optimisation techniques (e.g., [8], [], [6]) consist of minimizing an objective function using numerical iterative algorithms such as Newton-Raphson or Levenberg-Marquardt. The main advantage of these approaches are their computational efficiency and accuracy, however, they may be subject to local minima and, worse, divergence. In this paper an efficient second-order approach [11], [3] is employed which improves efficiency and helps to avoid any local minima. The technique proposed in this paper provides a generic 3D visual odometry technique which is able to accurately handle large scale scenes efficiently whilst avoiding error prone feature extraction. This is achieved by defining a quadrifocal warping function which closes a non-linear iterative estimation loop directly with the image. A set of key reference image-pairs are used to initialize tracking locally around the reference positions. These reference pairs provide a calibrated set of highly redundant dense correspondences to perform tracking and pose estimation. As will be shown, this leads to very impressive results in real-scenes with occlusions, large inter-frame displacements, and very little drift over very long sequences of images. 1-444-060-1/07/$0.00 007 IEEE. 40

II. TRAJECTORY ESTIMATION A framework is described for estimating the trajectory of a stereo-camera rig along a sequence from a designated region within the image. The tracking problem will essentially be considered as a pose estimation problem which will be related directly to the grey-level brightness measurements within the stereo pair via a non-linear model which accounts for the 3D geometric configuration of the scene. Since our final objective is to control a robot within Euclidean space, a calibrated camera pair is considered. Consider a stereo camera pair with two brightness functions I(p,t) and I (p,t) for the left and right cameras respectively, where p =(u, v) and p =(u,v ) are pixel locations within the two images acquired at time t. It is convenient to consider the set of image measurements in vector form such that I =(I, I ) R n is a vector of intensities of the left image stacked on top of the right. Similarly P = {p, p } are stereo image correspondences from the reference template pair. I will be called the current view pair and I as the reference view pair. A superscript will be used throughout to designate the reference view variables. Any set of corresponding pixels from the reference imagepair are considered as a reference template, denoted by R = {{p, p } 1, {p, p },...,{p, p } n } where n is the number of corresponding point pairs in the stereo images. The motion of the camera pair or objects within the scene induces a deformation of the reference template. The 3D geometric deformation of a stereo rig can be fully defined by a motion model w(p, T, K, K ; T(t)). The motion model w considered in this paper is the quadrifocal warping function which will be detailed further in section III. K and K contain the intrinsic calibration parameters for the left and right cameras respectively. T =(R, t ) SE(3) is the homogeneous matrix of the extrinsic camera pose of the right camera w.r.t. the left and T =( R, t) SE(3) is the current pose of the stereo rig relative to the reference position. Throughout, R is a rotation matrix and t the translation vector. Since both the intrinsic and extrinsic calibration parameters do not vary with time they will be assumed implicit. It follows that the reference image is obtained by warping the current image as: I (P )=I ( w ( P ; T ),t ), P R. (1) where T is the true pose. Suppose that at the current image an estimate of the pose T fully represents the pose of the stereo pair with respect to a pair of reference images. The tracking problem then becomes one of estimating the incremental pose T(x), where it is supposed that x : T( x) T = T. The estimate is updated by a homogeneous transformation T T(x) T. The unknown parameters x R 6 are defined as: x = 1 0 (ω, υ)dt se(3), () which is the integral of a constant velocity twist which produces a pose T. The pose and the twist are related via Left Reference Image l Left Current Image p P T Right Current Image p p T T p Right Reference Image Fig. 1. The quadrifocal geometry of a stereo pair at two subsequent time instants. Two points p and p are initialised only once at the beginning of the tracking process to be in correspondence. The central pose T is estimated via a non-linear warping function which warps all points in the reference stereo pair to the current image points p and p. The quadrifocal warping function is defined by choosing two lines l and l passing through corresponding points in the first image. The extrinsic parameters T are assumed known a-priori. the exponential map as T = e [x] with the operator [.] as: [ ] [ω] υ [x] =, 0 0 where [.] represents the skew symmetric matrix operator. Thus the pose and the trajectory of the camera pair can be estimated by minimising a non-linear least squares cost function: C(x) = P R ( ( ) I w(p ; T(x) T I (P )). (3) This function is minimised using the robust, efficient and precise second order minimisation procedure detailed in Section IV. III. NOVEL VIEW SYNTHESIS AND WARPING The geometric configuration of the stereo pair is based on the paradigm that four views of a scene satisfy quadrifocal constraints. Thus given a reference stereo view in correspondence and the quadrifocal tensor, a third view and fourth view can be generated by means of a warping function. This warping function subsequently provides the required relationship between two views of the scene and an adjacent view-pairs in a sequence of images. A. Quadrifocal Geometry A point P R 3 in 3D Euclidean space projects onto the 3D camera plane by a 3 4 projection matrix M = K[R t] P(3) where the image point is given by p = MP so that p =(u, v, 1) is the homogeneous pixel vector (see Figure 1). In [16] it is shown that the quadrifocal tensor can be decomposed into two trifocal tensors. In this way the geometry l 41

between two stereo pairs is defined in a manner that is simple for subsequent developments using the canonical coordinates of two triplets of images. First of all, consider the triplet consisting of the left reference camera, the right reference camera and the left current camera. The left reference camera matrix is chosen as the origin so that M = K[I 0]. The reference projection matrix for the right camera (the extrinsic camera pose) and the current projection matrices for the left camera are then as : M = K [R ]t ] and M = K[R t ]. The second triplet is defined in a similar manner such that the right reference camera is chosen as the origin and the left reference camera and right current camera matrices are defined with respect to this origin. In order to construct the quadrifocal relation it is necessary to combine these two triplets of images. This is done symmetrically by defining the world origin as the geodesic center between the two reference cameras. To do this the extrinsic parameters must be separated into two distinct poses with respect to the center as: T c = e (log(t ))/ and T c = T c T 1, (4) where e and log are the matrix exponential and logarithm. The pose from the left reference camera to the current one is therefore composed of a central pose as: T = T c 1 TT c, (5) where T is the unknown pose to be estimated. B. Quadrifocal warping The quadrifocal warping function w(p ; T) from (3) can now be considered to be composed of a trifocal tensor for both left and right images. These functions are then minimized simultaneously using (5) in section IV so that the quadrifocal constraints are held. The trifocal tensor is used to transfer (warp) corresponding points from two views to a third view. This tensor depends only on the relative motion of the cameras as well as the intrinsic and extrinsic camera parameters. The compact tensor notation of multi-focal geometry will be used here with a covariant-contravariant summation convention. Contravariant point vectors p i are denoted with a superscript and their covariant counterpart representing lines l j P, are denoted with a subscript. A contraction or summation over two tensors occurs when there are repeated indices in both contravariant and covariant variables (i.e. p i l i = n j=1 pj l j ). An outer-product of two first order tensors (vectors), a i b j is a second order tensor (matrix) c j i which is equivalent to C = ba in matrix notation. The trifocal tensor T is a third order tensor represented by a homogeneous 3 3 3 array of elements. The calibrated trifocal tensor is given as: T jk i = k j mt m k n i r o nk 1k o k k mt m k n i r o nk 1j, where (r, t ) and (r, t ) are the tensor forms of the rotation matrix and translation vector for the second and third camera matrices respectively. k and k are the intrinsic calibration components of the left and right camera matrices respectively. Note that k = k or k = k depending on whether one is warping to the left or right camera at the next time instant. Given any line l coincident with p or any line l coincident with p then the trifocal tensor contracts so as to become a homography h which maps points from one reference image to the current image. i.e. a line defined in one of the reference views defines a plane which can be used to warp a point between the remaining reference image and the current image. Thus the warping from the left reference image to the left current image via a plane in the right reference image is given by: p k = p i l jt jk i = h k i p i, where p i is a point in the left reference image, l k is a line defined in the right reference image and p k is the warped point in the left current image. This equation is used similarly for warping a point in the right reference image to a point in the right current via a plane in the left reference image. As opposed to transfer using the fundamental matrix, the tensor approach is free from singularities when the 3D point lies on the trifocal plane. The only degenerate situation that occurs is if a 3D point lies on the baseline joining the first and the second cameras since the rays through p and p are co-linear. The stereo warping operator is then given by: [ ] [ p k p p n = i l jt jk i p l l m Tl mn ], (6) where the indexes of the two trifocal-tensors indicate tensors transferring to the left and right cameras. The lines l and l are chosen similarly to [1] to be the diagonal line ( 1, 1,u+ v) coincident with the point (u, v). It is important for further developments to highlight that the stereo warping operator w(p ; T) is a group action. Indeed, the following operations hold: 1) The identity map: w(p ; I) =P, P R 4, (7) ) The composition of an action corresponds to the action of a composition T 1, T SE(3): w(w(p, T 1 ), T )=w(p, T 1 T ) P R 4. (8) IV. ROBUST SECOND-ORDER MINIMISATION The aim now is to minimise the objective criterion defined previously (3) in an accurate and robust manner. The robust objective function therefore becomes: O(x) = ( ) ρ I(w(P ; T(x) T) I (P ), (9) P R where ρ(u) is a robust function [10] that grows subquadratically and is monotonically non-decreasing with increasing u (see [5]). Since this is a non-linear function of the unknown pose parameters an iterative minimisation procedure is employed. The robust objective function is minimized by: 4

x O(x) x=x =0, where x is the gradient operator with respect to the unknown parameters () and there exists a stationary point x = x which is the global minimum of the cost function. Since both the reference image and current image are available it is possible to use the efficient second order approximation (ESM) proposed in [11], [3]. In this case the ESM approximation is given as: I( x) 1 ( I(0)+ J(0)+J( x) ) x where J(0) is the current image Jacobian and J( x) is reference image Jacobian. The current Jacobian J(0) is quite straightforward and can be decomposed into modular parts as: J(0) =J I J K J P J w J Tc(0)J V, where J I is the current image-pair gradient of dimension n 4n, J K is the partial derivative of the pixel coordinates with respect to their corresponding metric coordinates of dimension 4n 4n, J P is the partial derivative of each metric coordinates w.r.t the un-normalised point coordinates of dimension 4n 6n, J w the Jacobian of the un-normalized point coordinates with respect to the elements warping function of dimension 6n (3 3 3) and J Tc(0) is the partial derivative of the tensor elements w.r.t the canonical pose parameters of dimension 7 6. Note that there are two sets of unknown parameters at this stage corresponding to the left and right trifocal tensors respectively. The last Jacobian J V of dimension 6 6 is used to center the two components of J Tc(0), corresponding to the left and right canonical coordinate systems, so that they represent the same minimal set of unknown parameters. By the definition given in (5) the unknown parameters are defined to be the those which are related to the central pose of the stereo-pair. This Jacobian therefore corresponds to a pair of twist transformation matrices. This twist transformation (the adjoint map), is given as: [ ] R c t V = c R c 0 3 R c, (10) where T c = (R c, t c ) is the centering pose given in (4), which maps the current left camera matrix to the stereo center according to (4). Similarly, an adjoint map can be obtained to transform the twist of the right current camera with respect to the right reference camera using T c. The reference Jacobian J( x) is obtained as: J( x) =J I J K J P J w J Tc( x)j V, where only J I and J Tc( x) differ from the current Jacobian. Computing J Tc( x) usually requires knowing the solution x to the estimation problem. However, due to the left invariant structure of the Lie group it can be shown that: J Tc( x)j V x = J Tc(0)J V x. (11) In this way the ESM second order approximation is given by: J (x) x= x = (J I + J I ) J K J P J T J T (0) J V, (1) where only J I varies with time and needs to be computed at each iteration. The objective function is minimised by iteratively solving (9) by using (1) and (6) for: x = λ(dj ) + D(I I ), (13) where (DJ ) + is the pseudo-inverse, D is a diagonal weighting matrix determined from a robust function (see [6]) and λ is the gain which ensures an exponential decrease of the error. A. Reference Image-pairs As opposed to 3D model-based estimation techniques, no pose initialisation is required and the 3D odometry estimation simply begins at the origin (identity). Dense correspondences are, however, required for each reference image-pair (see Section V-A). As the camera pair moves through the scene the reference image may be no longer visible or the warped resolution becomes so poor that it is necessary to interpolate many pixels. In both cases this leads to miss-tracking. Therefore, in order to perform large scale tracking it is necessary to continually update the reference image pair I. An update is detected by monitoring the error norm along with a robust estimate of the scale of the error distribution (i.e. the Median Absolute Deviation). As soon as they become too large another set of dense correspondences between the stereo pair is made so as to reinitialise the tracking. As long as the same reference image is used then the minimisation cut-off thresholds can be tuned for speed since the next estimation will recover any leftover error, however, if the reference image is changed the previous estimate is minimised with smaller cut-off thresholds so as to minimise any drift that may be left over. V. IMPLEMENTATION A. Dense Correspondences As mentioned, the reference image pair(s) need to be initialized with dense correspondences. The correspondence problem has been heavily studied in the computer vision literature and many different approaches are possible [15]. When the cameras are calibrated the correspondence problem reduces to a 1D search along epipolar lines. This can either be performed off-line in a learning phase or on-line at each new acquisition depending on computational requirements (real-time approaches are feasible [18]). In this paper the approach given in [14] was used, however, any other type of dense correspondence algorithm could be used. The chosen method is particularly suited to urban canyon environments since the notions of horizontal and vertical slant are used to approximate first-order piecewise continuity. In this way the geometric projection of slanted surfaces from N pixels on one epipolar line to M pixels on another is not necessarily one-to-one but can be many-to-one or one-to-many. See Figure for correspondence results of a typical image pair. In this case the disparity search region was fixed in a range of 0 to 60 pixels along the epipolar lines. Since the baseline is relatively small it is possible to obtain a highly redundant number of dense correspondences. 43

(a) (b) (c) Fig.. Dense correspondence of an urban canyon with correspondence occlusion in black: (a) left image and (b) right image warped to the same shape as the left, (c,d) the original images. (d) Fig. 4. Trajectory tracking around along a road in Versailles. The trajectory shown in white has been superimposed on a satellite image. An typical stereo image is shown at the top. Fig. 3. Robust outlier rejection: Two images showing the outlier rejection weights. The darker the points, the less influence they have on the estimation process. In (a) it can be seen that a moving truck has been rejected. In (b) a moving pedestrian has been rejected. It can also be noted that other outliers are detected in the image. These points generally correspond to matching error, noise in the image or the self occlusion of the corners of the buildings. B. Robust Estimation A robust M-estimation technique (as detailed in [5]) was used to reject outliers not corresponding to the definition of the objective function. The use of robust techniques is very interesting in the case of a highly redundant set of measurement as is the case of a set of dense correspondences. The outliers generally correspond to occlusions, illumination changes, matching error, noise in the image or the self occlusion of the corners of the buildings. In figure 3 (a) a moving truck has been rejected as an outlier whilst a stationary truck in the background was used to estimate the pose. In this way it can be seen that the proposed algorithm has exploited all the useful information in the image so as to estimate the pose. In figure 3 (b) a moving pedestrian has been rejected and it can be seen that both the pedestrian projected from the reference image as well as the current position of the pedestrian have been rejected. This type of information could be useful in an application for determining the trajectory of moving obstacles. C. Extended Kalman Filter In the context of tracking the trajectory of a car, very large inter-frame movements are observed. In the sequences considered in the following results, typical inter-frame movement was 1- meters per image with a car travelling between (a) (b) 50 and 70 km/hr. Even though tracking succeeds without predictive filtering, in order to improve computational efficiency (significantly less iterations in the minimisation) it was necessary to implement a predictive filter. In this paper the well known Extended Kalman filter described in [19] was used for filtering the pose (i.e. the pose estimate was considered to be the measurement input to the filter). D. Trajectory Estimation The algorithm was tested on real full-scale sequences as can be seen in Figure 4 and 5. Radial distortion has been removed from the images before processing. Several test sequences from different streets in Versailles, France, were used to validate the results. These video demonstrations plus more are available online at the authors websites. The sequence shown in Figure 4 is that of a relatively straight road. The estimated path length is compared to satelite views (precision of.9cm/pixel) by comparing road markings in both the stereo images and satellite views. The path length measured by Google earth and the tracker were both about 440m. It is difficult to comment further on the precision since it is not possible to register the satellite image with the projection of the trajectory unless three corresponding non-collinear points are available. However, they do have approximately the same absolute length (ignoring tilt of the cameras and the incline of the road). Throughout the sequence several moving vehicles pass in front of the cameras and at one stage a car is overtaken. The sequence shown in Figure 5, is particularly illustrative since a full loop of the round-about was performed. In particular this enables the drift to be measured at the crossing point in the trajectory. In the case of this round-about the drift at the crossing point was approximately 0cm in the vertical direction to the road-plane. Considering that the trajectory around the round-about is approximately 00m long (measured using Google earth), this makes a drift of 0.01% on the measurable axis. In the case of large scale scenes such as this one it was necessary to detect and update the reference image periodically when it was no longer visible or too approximate. Due to the highly redundant amount of data, the robust estimator was able to successfully reject 44

(a) (b) Fig. 5. Trajectory tracking around a round-about in Versailles : (a) The trajectory shown in white has been superimposed on a satellite image and it can be seen visually that the trajectory aligns with the four corners of the round-about (4 points are required to estimate the pose). The length of the path is approximately 39m taken in 698 images. The maximum inter-frame displacement was 1.78m and the maximum inter-frame rotation was.3 o (b and c) several occlusions which occurred during the sequence and image 300 and 366 respectively (on the right side of the round-about). pedestrians and moving cars from the estimation process. It can be noted, however, that all static information available was used to estimate the pose (including the parked cars) therefore leading to a very precise result with minimal drift over large displacements. E. Computational requirements A prototype has been written in Matlab to prove the concept (there is no code or hardware optimization). Even so, it computes on average at approximately 1.5Hz when 10% of the information is used (the strongest gradients) and at around 10sec/image-pair when an image of 759 80 is used. Furthermore, there is only a small difference in precision between the full and reduced images. With the numbers given here there was only about 0.004% drift in translation and 0.03deg/deg drift in rotation when measured from a 360m long sequence. The approach is very efficient and could be implemented to run in real-time at video-rate. Two main options are possible: 1. Use an off-line training sequence to obtain a set of dense corresponding reference image-pairs for use online.. Perform dense correspondences online (i.e. realtime [18]). VI. CONCLUSIONS AND FUTURE WORKS The quadrifocal tracking methodology described in this paper has shown to be very efficient, accurate (very small drift) and robust over a wide range of scenarios. The approach is very interesting because trajectory estimation is integrated into a single global sensor-based process that does not depend of intermediate level features. Tracking is initialised automatically at the origin within the visual odometry approach. Furthermore, a compact image-based stereo model of the environment may be obtained using standard dense stereo correspondence algorithms and instead (c) of explicit estimation of an a-priori 3D model. The robust efficient second order minimisation technique also allows minimisation of a highly redundant non-linear function in a precise manner. Indeed the algorithm rejects outliers such as pedestrians, traffic, building occlusions and matching error. Further work will be devoted to estimating optimal stereo image-based models of the environment by updating the dense correspondences in a Simultaneous Localisation and Correspondence style approach. It would be interesting to test loop closing procedures and devise strategies to recognise previously seen places within this framework. REFERENCES [1] S. Avidan and A. Shashua. Threading fundamental matrices. Pattern Analysis and Machine Intelligence, 3(1):73 7, Janvier 001. [] S. Baker and I. Matthews. Equivalence and efficiency of image alignment algorithms. In Proceedings of the 001 IEEE Conference on Computer Vision and Pattern Recognition, December 001. [3] S. Benhimane and E. Malis. Real-time image-based tracking of planes using efficient second-order minimization. In IEEE International Conference on Intelligent Robots Systems, Sendai, Japan, 8 September - October 004. [4] A. Chiuso, P. Favaro, H. Jin, and S. Soatto. Structure from motion causally integrated over time. IEEE Trans. Pattern Anal. Mach. Intell., 4(4):53 535, 00. [5] A.I. Comport, E. Marchand, and F. Chaumette. Statistically robust d visual servoing. IEEE Transactions on Robotics, ():415 41, April 006. [6] A.I. Comport, E. Marchand, M. Pressigout, and F. Chaumette. Realtime markerless tracking for augmented reality: the virtual visual servoing framework. IEEE Transactions on Visualization and Computer Graphics, 1(4):615 68, July 006. [7] A. J. Davison and D. W. Murray. Simultaneous localisation and mapbuilding using active vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, July 00. [8] G. Hager and P. Belhumeur. Efficient region tracking with parametric models of geometry and illumination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 0(10):105 1039, October 1998. [9] R. Hartley and A. Zisserman. Multiple View Geometry in computer vision. Cambridge University Press, 001. Book. [10] P.-J. Huber. Robust Statistics. Wiler, New York, 1981. [11] E. Malis. Improving vision-based control using efficient secondorder minimization techniques. In IEEE International Conference on Robotics and Automation, ICRA 04, volume, pages 1843 1848, New Orleans, April 6-May 1 004. [1] E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. Real-time localization and 3d reconstruction. In IEEE Conference of Vision and Pattern Recognition, New-York, USA, June 006. [13] D. Nistér, O. Naroditsky, and J. Bergen. Visual odometry. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages 65 659, CVPR 004, July 004. [14] A.S. Ogale and Y. Aloimonos. Shape and the stereo correspondence problem. International Journal of Computer Vision, 65(1), October 005. [15] D. Scharstein, R. Szeliski, and R. Zabih. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In Proceedings of the IEEE Workshop on Stereo and Multi-Baseline Vision, Kauai, HI., December 001. [16] A. Shashua and L. Wolf. On the structure and properties of the quadrifocal tensor. In European Conference on Computer Vision, pages 710 74, 000. [17] N. Simond and P. Rives. Trajectography of an uncalibrated stereo rig in urban environments. In IEEE RSJ/International conference on Intelligent Robot and System, IROS, pages 3381 3386, Sendai, Japan, 8 September - October 004. [18] W. van der Mark and D.M. Gavrila. Real-time dense stereo for intelligent vehicles. IEEE Transactions on Intelligent Transportation Systems, 7(1):38 50, March 006. [19] Z. Zhang and O. Faugeras. Three dimensional motion computation and object segmentation in a long sequence of stereo frames. International Journal of Computer Vision, 7(3):11 41, 199. 45