Semi-Direct Visual Odometry for Monocular, Wide-angle, and Multi-Camera Systems

Size: px

Start display at page:

Download "Semi-Direct Visual Odometry for Monocular, Wide-angle, and Multi-Camera Systems"

Meredith Lang
6 years ago
Views:

1 1 Sem-Drect Vsual Odometry for Monocular, Wde-angle, and Mult-Camera Systems Chrstan Forster, Zchao Zhang, Mchael Gassner, Manuel Werlberger, Davde Scaramuzza Abstract Drect methods for Vsual Odometry (VO) have ganed popularty due to ther capablty to explot nformaton from all mage gradents n the mage. However, low computatonal speed as well as mssng guarantees for optmalty and consstency are lmtng factors of drect methods, where establshed feature-based methods nstead succeed at. Based on these consderatons, we propose a Sem-drect VO (SVO) that uses drect methods to track and trangulate pxels that are characterzed by hgh mage gradents but reles on proven feature-based methods for jont optmzaton of structure and moton. Together wth a robust probablstc depth estmaton algorthm, ths enables us to effcently track pxels lyng on weak corners and edges n envronments wth lttle or hgh-frequency texture. We further demonstrate that the algorthm can easly be extended to multple cameras, to track edges, to nclude moton prors, and to enable the use of very large feld of vew cameras, such as fsheye and catadoptrc ones. Expermental evaluaton on benchmark datasets shows that the algorthm s sgnfcantly faster than the state of the art whle achevng hghly compettve accuracy. SUPPLEMENTARY MATERIAL Vdeo of the experments: I. INTRODUCTION Estmatng the sx degrees-of-freedom moton of a camera merely from ts stream of mages has been an actve feld of research for several decades [1 6]. Today, state-of-the-art vsual SLAM (V-SLAM) and vsual odometry (VO) algorthms run n real-tme on smart-phone processors and approach the accuracy, robustness, and effcency that s requred to enable varous nterestng applcatons. Examples comprse the robotcs and automotve ndustry, where the ego-moton of a vehcle must be known for autonomous operaton. Other applcatons are vrtual and augmented realty, whch requres precse and low latency pose estmaton of moble devces. The central requrement for the successful adopton of vson-based methods for such challengng applcatons s to obtan hghest accuracy and robustness wth a lmted computatonal budget. The most accurate camera moton estmate s obtaned through jont optmzaton of structure (.e., landmarks) and moton (.e., camera poses). For feature-based methods, ths s an establshed problem that s commonly known as bundle adjustment [7] and many solvers exst, The authors are wth wth the Robotcs and Percepton Group, Unversty of Zurch, Swtzerland. Contact nformaton: {forster, zzhang, gassner, werlberger, sdavde}@f.uzh.ch. Ths research was partally funded by the Swss Natonal Foundaton (project number , Swarm of Flyng Cameras), the Natonal Center of Competence n Research Robotcs (NCCR), the UZH Forschungskredt, and the SNSF-ERC Startng Grant. whch address the underlyng non-lnear least-squares problem effcently [8 11]. Three aspects are key to obtan hghest accuracy when usng sparse feature correspondence and bundle adjustment: (1) long feature tracks wth mnmal feature drft, (2) a large number of unformly dstrbuted features n the mage plane, and (3) relable assocaton of new features to old landmarks (.e., loop-closures). The probablty that many pxels are tracked relably, e.g., n scenes wth lttle or hgh frequency texture (such as sand [12] or asphalt [13]), s ncreased when the algorthm s not restrcted to use local pont features (e.g., corners or blobs) but may track edges [14] or more generally, all pxels wth gradents n the mage, such as n dense [15] or sem-dense approaches [16]. Dense or sem-dense algorthms that operate drectly on pxel-level ntenstes are also denoted as drect methods [17]. Drect methods mnmze the photometrc error between correspondng pxels n contrast to feature-based methods, whch mnmze the reprojecton error. The great advantage of ths approach s that there s no pror step of data assocaton: ths s mplctly gven through the geometry of the problem. However, jont optmzaton of dense structure and moton n real-tme s stll an open research problem, as s the optmal and consstent [18, 19] fuson of drect methods wth complementary measurements (e.g., nertal). In terms of effcency, prevous drect methods are computatonally expensve as they requre a sem-dense [16] or dense [15] reconstructon of the envronment, whle the domnant cost of feature-based methods s the extracton of features and descrptors, whch ncurs a hgh constant cost per frame. In ths work, we propose a VO algorthm that combnes the advantages of drect and feature-based methods. We ntroduce the sparse mage algnment algorthm (Sec. V), an effcent drect approach to estmate frame-to-frame moton by mnmzng the photometrc error of features lyng on ntensty corners and edges. The 3D ponts correspondng to features are obtaned by means of robust recursve Bayesan depth estmaton (Sec. VI). Once feature correspondence s establshed, we use bundle adjustment for refnement of the structure and the camera poses to acheve hghest accuracy (Sec. V-B). Consequently, we name the system sem-drect vsual odometry (SVO). Our mplementaton of the proposed approach s exceptonally fast, requrng only 2.5 mllseconds to estmate the pose of a frame on a standard laptop computer, whle achevng comparable accuracy wth respect to the state of the art on benchmark datasets. The mproved effcency s due to three reasons: frstly, SVO extracts features only for selected keyframes n a parallel thread, hence, decoupled from hard

2 2 real-tme constrants. Secondly, the proposed drect trackng algorthm removes the necessty for robust data assocaton. Fnally, contrarly to prevous drect methods, SVO requres only a sparse reconstructon of the envronment. Ths paper extends our prevous work [2], whch was also released as open source software. 1 The novelty of the present work s the generalzaton to wde FoV lenses (Sec. VII), mult-camera systems (Sec. VIII), the ncluson of moton prors (Sec. IX) and the use of edgelet features. Addtonally, we present several new expermental results n Sec. XI wth comparsons aganst prevous works. II. RELATED WORK Methods that smultaneously recover camera pose and scene structure, can be dvded nto two classes: a) Feature-based: The standard approach to solve ths problem s to extract a sparse set of salent mage features (e.g. corners, blobs) n each mage; match them n successve frames usng nvarant feature descrptors; robustly recover both camera moton and structure usng eppolar geometry; and fnally, refne the pose and structure through reprojecton error mnmzaton. The majorty of VO and V-SLAM algorthms [6] follow a varant of ths procedure. A reason for the success of these methods s the avalablty of robust feature detectors and descrptors that allow matchng mages under large llumnaton and vew-pont changes. Feature descrptors can also be used to establsh feature correspondences wth old landmarks when closng loops, whch ncreases both the accuracy of the trajectory after bundle adjustment [7, 21] and the robustness of the overall system due to re-localzaton capabltes. Ths s also where we draw the lne between VO and V-SLAM: Whle VO s only about ncremental estmaton of the camera pose, V-SLAM algorthms, such as [22], detect loop-closures and subsequently refne large parts of the map. The dsadvantage of feature-based approaches s ther low speed due to feature extracton and matchng at every frame, the necessty for robust estmaton technques that deal wth erroneous correspondences (e.g., RANSAC [23], M- estmators [24]), and the fact that most feature detectors are optmzed for speed rather than precson. Furthermore, relyng only on well localzed salent features (e.g., corners), only a small subset of the nformaton n the mage s exploted. In SVO, features are extracted only for selected keyframes, whch reduces the computaton tme sgnfcantly. Once extracted, a drect method s used to track features from frame to frame, resultng n outler-free and sub-pxel precse matches. Apart from well localzed corner features, ths allows trackng and mappng any pxel wth non-zero ntensty gradent. b) Drect methods: Drect methods estmate structure and moton drectly by mnmzng an error measure that s based on the mage s pxel-level ntenstes [17]. The local ntensty gradent magntude and drecton s used n the optmzaton compared to feature-based methods that consder only the dstance to a feature-locaton. Pxel correspondence s gven drectly by the geometry of the problem, elmnatng the need for robust data assocaton technques. However, ths 1 svo makes the approach dependent on a good ntalzaton that must le n the basn of attracton of the cost functon. Usng a drect approach, the sx degree of freedom (DoF) moton of a camera can be recovered by mage-to-model algnment, whch s the process of algnng the observed mage to a vew syntheszed from the estmated 3D map. Early drect VO methods tracked and mapped few sometmes manually selected planar patches [25 29]. By estmatng the surface normals of the patches [3], they could be tracked over a wde range of vewponts. In [31], the local planarty assumpton was relaxed and drect trackng wth respect to arbtrary 3D structures computed from stereo cameras was proposed. For RGB-D cameras, where a dense depth-map for each mage s gven by the sensor, dense mage-tomodel algnment was subsequently ntroduced n [32 34]. In conjuncton wth dense depth regstraton ths has become the standard n camera trackng for RGB-D cameras [35 38]. Wth DTAM [15], a drect method was ntroduced that computes a dense depthmap from a sngle movng camera n real-tme. The camera pose s found through drect whole mage algnment usng the depthmap. However, nferrng a dense depthmaps from monocular mages s computatonally ntensve and s typcally addressed usng GPU parallelsm, such as n the open-source REMODE algorthm [39]. Early on t was realzed that only pxels wth an ntensty gradent provde nformaton for moton estmaton [4]. In ths sprt, a sem-dense approach was proposed n [41] where the depth s only estmated for pxels wth hgh ntensty gradents. In our expermental evaluaton n Sec. XI-A we show that t s possble to reduce the number of tracked pxels even more for frame-to-frame moton estmaton wthout any notceable loss n robustness or accuracy. Therefore, we propose the sparse mage-to-model algnment algorthm that uses only sparse pxels at corners and along mage ntensty gradents. A dsadvantage of drect methods s that jont optmzaton of dense structure and moton n real-tme s stll an open research problem. For ths reason, the standard approach s to estmate the latest camera pose wth respect to a prevously accumulated dense map and subsequently, gven a set of estmated camera poses, update the dense map [15, 42]. Clearly, ths separaton of trackng and mappng only results n optmal accuracy when the output of each stage yelds the optmal estmate. Other algorthms optmze a graph of poses but do not allow a deformaton of the structure once trangulated [16]. Contrarly, some algorthms gnore the camera poses and nstead allow non-rgd deformaton of the 3D structure [36, 38]. The obtaned results are accurate and vsually mpressve, however, a thorough probablstc treatment s mssng when processng measurements, separatng trackng and mappng, or fxatng and removng states. To the best of our knowledge, t s therefore currently not possble to obtan accurate covarance estmates from dense VO. Hence, the consstent fuson [18, 43] wth complementary sensors (e.g., nertal) s currently not possble. In the proposed work, we use drect methods only to establsh feature correspondence. Subsequently, bundle adjustment can be used for jont optmzaton of structure and moton where t s also possble to nclude nertal measurements as we have demonstrated n prevous work [44].

3 3 New Image Last Frame Moton Estmaton Thread Sparse Model-based Image Algnment T CB T k,k 1 T CB Feature Algnment Pose & Structure Refnement Map I c k 1 u 3 I c k u 1 u 2 u 1 u 2 u 3 Frame Queue yes Feature Extracton Mappng Thread Is Keyframe? no Update Depth-Flters ρ 1 ρ 2 Fg. 2: Changng the relatve pose T k,k 1 between the current and the prevous frame mplctly moves the poston of the reprojected ponts n the new mage u. Sparse mage algnment seeks to fnd T k,k 1 that mnmzes the photometrc dfference between mage patches correspondng to the same 3D pont (blue squares). Note, n all fgures, the parameters to optmze are drawn n red and the optmzaton cost s hghlghted n blue. ρ 3 Intalze Depth-Flters Converged? Fg. 1: Trackng and mappng ppelne III. SYSTEM OVERVIEW yes: nsert new Pont Fgure 1 provdes an overvew of the proposed approach. We use two parallel threads (as n [21]), one for estmatng the camera moton, and a second one for mappng as the envronment s beng explored. Ths separaton allows fast and constant-tme trackng n one thread, whle the second thread extends the map, decoupled from hard real-tme constrants. The moton-estmaton thread mplements the proposed sem-drect approach to moton estmaton. Our approach s dvded nto three steps: sparse mage algnment, relaxaton, and refnement (Fg. 1). Sparse mage algnment estmates frame-to-frame moton by mnmzng the ntensty dfference of features that correspond to the projected locaton of the same 3D ponts. A subsequent step relaxes the geometrc constrant to obtan sub-pxel feature correspondence. Ths step ntroduces a reprojecton error, whch we fnally refne by means of bundle adjustment. In the mappng thread, a probablstc depth-flter s ntalzed for each feature for whch the correspondng 3D pont s to be estmated. New depth-flters are ntalzed whenever a new keyframe s selected for corner pxels as well as for pxels along ntensty gradent edges. The flters are ntalzed wth a large uncertanty n depth and undergo a recursve Bayesan update wth every subsequent frame. When a depth flter s uncertanty becomes small enough, a new 3D pont s nserted n the map and s mmedately used for moton estmaton. IV. NOTATION The ntensty mage recorded from a movng camera C at tmestep k s denoted wth I C k : ΩC R 2 R, where Ω C s the mage doman. Any 3D pont ρ R 3 maps to the mage coordnates u R 2 through the camera projecton model: u = π(ρ). Gven the nverse scene depth ρ > at pxel u R C k, the poston of a 3D pont s obtaned usng the back-projecton model ρ = πρ 1 (u). Where we denote wth Ω those pxels for whch the depth s known at tme R C k k n camera C. The projecton models are known from pror calbraton [45]. The poston and orentaton of the world frame W wth respect to the k th camera frame s descrbed by the rgd body transformaton T kw SE(3) [46]. A 3D pont W ρ that s expressed n world coordnates can be transformed to the k th camera frame usng: k ρ = T kw W ρ. V. MOTION ESTIMATION In ths secton, we descrbe the proposed sem-drect approach to moton estmaton, whch assumes that the poston of some 3D ponts correspondng to features n prevous frames are known from pror depth estmaton. A. Sparse Image Algnment Image to model algnment estmates the ncremental camera moton by mnmzng the ntensty dfference (photometrc error) of pxels that observe the same 3D pont. To smplfy a later generalzaton to multple cameras, we ntroduce a body frame B that s rgdly attached to the camera frame C wth known extrnsc calbraton T CB SE(3) (see Fg. 2). Our goal s to estmate the ncremental moton of the. body frame T kk 1 = TB such that the photometrc error kbk 1 s mnmzed: T kk 1 = arg mn T kk 1 u R C k r I C u (T kk 1) 2 Σ I, (1) where the photometrc resdual r I C u s defned by the ntensty dfference of pxels n subsequent mages I C k and IC k 1 that observe the same 3D pont ρ u : r I C u (T kk 1 ). = I C k( π(tcb T kk 1 ρ u ) ) I C k 1( π(tcb ρ u ) ). (2) The 3D pont ρ u (whch s expressed n the reference frame B k 1 ) can be computed for pxels wth known depth by means of back-projecton: ρ u = T BC π 1 ρ (u), u R C k 1, (3)

4 n δu δu (a) Sparse (b) Sem-Dense (c) Dense (a) Edge algnment. (b) Corner algnment. Fg. 3: An mage from the ICL-NUIM dataset (Sec.

Dense approaches (c) use every pxel n the mage, sem-dense (b) use just the pxels wth hgh ntensty gradent, and the proposed sparse approach (a) uses selected pxels at corners or along ntensty gradent

(1) ncludes only a subset of those pxels R C k 1 RC k 1, namely those for whch the back-projected ponts are also vsble n the mage I C k : R C k 1 = { u u R C k 1 π ( T CB T kk 1 T BC π 1 ρ (u) ) Ω C}.

4 4 n δu δu (a) Sparse (b) Sem-Dense (c) Dense (a) Edge algnment. (b) Corner algnment. Fg. 3: An mage from the ICL-NUIM dataset (Sec. XI-B3) wth pxels used for mage-to-model algnment (marked n green for corners and magenta for edgelets) for sparse, sem-dense, and dense methods. Dense approaches (c) use every pxel n the mage, sem-dense (b) use just the pxels wth hgh ntensty gradent, and the proposed sparse approach (a) uses selected pxels at corners or along ntensty gradent edges. However, the optmzaton n Eq. (1) ncludes only a subset of those pxels R C k 1 RC k 1, namely those for whch the back-projected ponts are also vsble n the mage I C k : R C k 1 = { u u R C k 1 π ( T CB T kk 1 T BC π 1 ρ (u) ) Ω C}. Image to model algnment has prevously been used n the lterature to estmate camera moton. Apart from mnor varatons n the formulaton, the man dfference among the approaches s the source of the depth nformaton as well as the regon R C k 1 n mage IC k for whch the depth s known. As dscussed n Secton II, we denote methods that know and explot the depth for all pxels n the reference vew as dense methods [15]. Converseley, approaches that only perform the algnment for pxels wth hgh mage gradents are denoted sem-dense [41]. In ths paper, we propose a novel sparse mage algnment approach that assumes known depth only for corners and features lyng on ntensty edges. Fg. 3 summarzes our notaton of dense, sem-dense, and sparse approaches. To make the sparse approach more robust, we propose to aggregate the photometrc cost n a small patch centered at the feature pxel. Snce the depth for neghborng pxels s unknown, we approxmate t wth the same depth that was estmated for the feature. To summarze, sparse mage algnment solves the non-lnear least squares problem n Eq. (1) wth R C k 1 correspondng to small patches centered at corner and edgelet features wth known depth. Ths optmzaton can be solved effcently usng standard teratve non-lnear least squares algorthms such as Levenberg-Marquardt. More detals on the optmzaton, ncludng the analytc Jacobans, are provded n the Appendx. B. Relaxaton and Refnement Sparse mage algnment s an effcent method to estmate the ncremental moton between subsequent frames. However, to mnmze drft n the moton estmate, t s paramount to regster a new frame to the oldest frame possble. One approach s to use an older frame as reference for mage algnment [16]. However, the robustness of the algnment cannot be guaranteed as the dstance between the frames n the algnment ncreases (see experment n Secton XI-A). We therefore propose to relax the geometrc constrants gven by the reprojecton of 3D ponts and to perform an ndvdual 2D algnment of correspondng feature patches. The algnment Fg. 4: Dfferent algnment strateges for corners and edgelets. The algnment of an edge feature s restrcted to the normal drecton n of the edge. of each patch n the new frame s performed wth respect to a reference patch from the frame where the feature was frst extracted; hence, the oldest frame possble, whch should maxmally mnmze feature drft. However, the 2D algnment generates a reprojecton error that s the dfference between the projected 3D pont and the algned feature poston. Therefore, n a fnal step, we perform bundle adjustment to optmze both the 3D pont s poston and the camera poses such that ths reprojecton error s mnmzed. In the followng, we detal our approach to feature algnment and bundle adjustment. Thereby, we take specal care of features lyng on ntensty gradent edges. 2D feature algnment mnmzes the ntensty dfference of a small mage patch P that s centered at the projected feature poston u n the newest frame k wth respect to a reference patch from the frame r where the feature was frst observed (see Fg. 4). To mprove the accuracy of the algnment, we apply an affne warpng A to the reference patch, whch s computed from the estmated relatve pose T kr between the reference frame and the current frame [21]. For corner features, the optmzaton computes a correcton δu R 2 to the predcted feature poston u that mnmzes the photometrc cost: u = u + δu, wth u = π ( T CB T kr T BC πρ 1 (u) ) (4) δu 1 = arg mn I C k (u +δu+ u) I C δu 2 r(u + A u) 2, u P where u s the terator varable that s used to compute the sum over the patch P. Ths algnment s solved usng the nverse compostonal Lucas-Kanade algorthm [47]. For features lyng on ntensty gradent edges, 2D feature algnment s problematc because of the aperture problem features may drft along the edge. Therefore, we lmt the degrees of freedom n the algnment to the normal drecton to the edge. Ths s llustrated n Fg. 4a, where a warped reference feature patch s schematcally drawn at the predcted poston n the newest mage. For features on edges, we therefore optmze for a scalar correcton δu R n the drecton of the edge normal n to obtan the correspondng feature poston u n the newest frame: u = u + δu n, wth (5) δu 1 =arg mn I C k (u +δu n+ u) I C δu 2 r(u + A u) 2. u P Ths s smlar to prevous work on VO wth edgelets, where feature correspondence s found by samplng along the normal drecton for abrupt ntensty changes [14, 48 52]. However,

5 5 n our case, sparse mage algnment provdes a very good ntalzaton of the feature poston, whch drectly allows us to follow the ntensty gradent n an optmzaton. After feature algnment, we have establshed feature correspondence wth subpxel accuracy. However, feature algnment volated the eppolar constrants and ntroduced a reprojecton error δu, whch s typcally well below.5 pxels. Therefore, n the last step of moton estmaton, we refne the camera poses and landmark postons X = {T kw, ρ } by mnmzng the squared sum of reprojecton errors: X = arg mn X k K L C k + k K L E k 1 2 u π ( T CB T kw ρ ) 2 1 ( 2 nt u π ( )) T CB T kw ρ 2 where K s the set of all keyframes n the map, L C k the set of all landmarks correspondng to corner features, and L E k the set of all edge features that were observed n the k th camera frame. The reprojecton error of edge features s projected along the edge normal because the component along the edge cannot be determned. The optmzaton problem n Eq. (6) s a standard bundle adjustment problem that can be solved n real-tme usng SAM2 [9]. In [44] we further show how the objectve functon can be extended to nclude nertal measurements. Whle optmzaton over the whole trajectory n Eq. (6) results n the most accurate results (see Sec. XI-B), we found that for many applcatons (e.g. for state estmaton of mcro aeral vehcles [2, 53]) t suffces to only optmze the latest camera pose and the 3D ponts separately. VI. MAPPING (6) In the prevous secton, we assumed that the depth at sparse feature locatons n the mage s known. In ths secton, we descrbe how the mappng thread estmates ths depth for newly detected features. Therefore, we assume that the camera poses are known from the moton estmaton thread. The depth at a sngle pxel s estmated from multple observatons by means of a recursve Bayesan depth flter. New depth flters are ntalzed at ntensty corners and along gradent edges when the number of tracked features falls below some threshold and, therefore, a keyframe s selected. Every depth flter s assocated to a reference keyframe r, where the ntal depth uncertanty s ntalzed wth a large value. For a set of prevous keyframes 2 as well as every subsequent frame wth known relatve pose {I k, T kr }, we search for a patch along the eppolar lne that has the hghest correlaton (see Fg. 5). Therefore, we move the reference patch along the eppolar lne and compute the zero mean sum of squared dfferences. From the pxel wth maxmum correlaton, we trangulate the depth measurement ρ k, whch s used to update 2 In the prevous publcaton of SVO [2] and n the open source mplementaton we suggested to update the depth flter only wth newer frames k > r, whch works well for down-lookng cameras n mcro aeral vehcle applcatons. However, for forward motons, t s benefcal to update the depth flters also wth prevous frames k < r, whch ncreases the performance wth forward-facng cameras. I r u ρ mn ˆρ T r,k I k ρ k ρ max Fg. 5: Probablstc depth estmate ˆρ for feature n the reference frame r. The pont at the true depth projects to smlar mage regons n both mages (blue squares). Thus, the depth estmate s updated wth the trangulated depth ρ k computed from the pont u of hghest correlaton wth the reference patch. The pont of hghest correlaton les always on the eppolar lne n the new mage. the depth flter. If enough measurements were obtaned such that uncertanty n the depth s below a certan threshold, we ntalze a new 3D pont at the estmated depth, whch subsequently can be used for moton estmaton (see system overvew n Fg. 1). Ths approach for depth estmaton also works for features on gradent edges. Due to the aperture problem, we however skp measurements where the edge s parallel to the eppolar lne. Ideally, we would lke to model the depth wth a nonparametrc dstrbuton to deal wth multple depth hypotheses (top rows n Fg. 6). However, ths s computatonally too expensve. Therefore, we model the depth flter accordng to [54] wth a two dmensonal dstrbuton: the frst dmenson s the nverse depth ρ [55], whle the second dmenson γ s the nler probablty (see bottom rows n Fg. 6). Hence, a measurement ρ k s modeled wth a Gaussan + Unform mxture model dstrbuton: an nler measurement s normally dstrbuted around the true nverse depth ρ whle an outler measurement arses from a unform dstrbuton n the nterval [ρ mn, ρ max ]: p( ρ k ρ, γ ) = γ N ( ρ k ρ, τ 2 ) +(1 γ )U ( ρ k u ρ mn, ρ max ), (7) where τ 2 the varance of a good measurement that can be computed geometrcally by assumng a dsparty varance of one pxel n the mage plane [39]. Assumng ndependent observatons, the Bayesan estmaton for ρ on the bass of the measurements ρ r+1,..., ρ k s gven by the posteror p(ρ, γ ρ r+1,..., ρ k ) p(ρ, γ) k p( ρ k ρ, γ), (8) wth p(ρ, γ) beng a pror on the true nverse depth and the rato of good measurements supportng t. For ncremental computaton of the posteror, the authors of [54] show that (8) can be approxmated by the product of a Gaussan dstrbuton for the depth and a Beta dstrbuton for the nler rato: q(ρ, γ a k, b k, µ k, σ 2 k) = Beta(γ a k, b k )N (ρ µ k, σ 2 k), (9) where a k and b k are the parameters controllng the Beta dstrbuton. The choce s motvated by the fact that the Beta Gaussan s the approxmatng dstrbuton mn-

6 Reference patch Eppolar lne Correct match Outler match 1 γ.5 1 γ.5 5 1 15 2 25 3 35 4 (a) After 3 measurements wth 7% nler probablty.

The hstogram n the top rows show the measurements affected by outlers. The dstrbuton n the mddle rows show the posteror dstrbuton when modelng the depth wth a sngle varate Gaussan dstrbuton.

6 6 Reference patch Eppolar lne Correct match Outler match 1 γ.5 1 γ (a) After 3 measurements wth 7% nler probablty. Inverse Depth Inverse Depth (b) After 3 measurements wth 7% nler probablty. Fg. 6: Illustraton of posteror dstrbutons for depth estmaton. The hstogram n the top rows show the measurements affected by outlers. The dstrbuton n the mddle rows show the posteror dstrbuton when modelng the depth wth a sngle varate Gaussan dstrbuton. The bottom rows show the posteror dstrbuton of the proposed approach that s usng the model from [54]. The dstrbuton s b-varate and models the nler probablty (vertcal axs) together wth the nverse depth (horzontal axs). mzng the Kullback-Lebler dvergence from the true posteror (8). Upon the k-th observaton, the update takes the form p(ρ, γ ρ r+1,..., ρ k ) q(d, γ a k 1, b k 1, µ k 1, σ 2 k 1) p( ρ k d, γ) const, (1) and the authors of [54] approxmated the true posteror (1) wth a Beta Gaussan dstrbuton by matchng the frst and second order moments for ˆd and γ. The updates formulas for a k, b k, µ k and σk 2 are thus derved and we refer to the orgnal work n [54] for the detals on the dervaton. Fg. 6 shows a small smulaton experment that hghlghts the advantage of the model proposed n [54]. The hstogram n the top rows show the measurements that are corrupted by 3% outler measurements. The dstrbuton n the mddle rows show the posteror dstrbuton when modelng the depth wth a sngle varate Gaussan dstrbuton as used for nstance n [41]. Outler measurements have a huge nfluence on the mean of the estmate. The fgures n the bottom rows show the posteror dstrbuton of the proposed approach that s usng the model from [54] wth the nler probablty drawn n the vertcal axs. As more measurements are receved at the same depth, the nler probablty ncreases. In ths model, the mean of the estmate s less affected by outlers whle the nler probablty s nformatve about the confdence of the estmate. Fg. 7 shows qualtatvely the mportance of robust Fg. 7: Illustraton of the eppolar search to estmate the depth of the pxel n the center of the reference patch n the left mage. Gven the extrnsc and ntrnsc calbraton of the two mages, the eppolar lne that corresponds to the reference pxel s computed. Due to self-smlar texture, erroneous matches along the eppolar lne are frequent. depth estmaton n self-smlar envronments, where outler matches are frequent. In [39] we demonstrate how the same depth flter can be used for dense mappng. VII. LARGE FIELD OF VIEW CAMERAS To model large optcal dstorton, such as fsheye and catadoptrc (see Fg. 8), we use the camera model proposed n [56], whch models the projecton π( ) and unprojecton π 1 ( ) functons wth polynomals. Usng the Jacobans of the camera dstorton n the sparse mage algnment and bundle adjustment step s suffcent to enable moton estmaton for large FoV cameras. For estmatng the depth of new features (c.f., Sec. VI), we need to sample pxels along the eppolar lne. For dstorted mages, the eppolar lne s curved (see Fg. 7). Therefore, we regularly sample the great crcle, whch s the ntersecton of the eppolar plane wth the unt sphere centered at the camera pose of nterest. The angular resoluton of the samplng corresponds approxmately to one pxel n the mage plane. For each sample, we apply the camera projecton model π( ) to obtan the correspondng pxel coordnate on the curved eppolar lne. VIII. MULTI-CAMERA SYSTEMS The proposed sem-drect camera moton estmaton starts drectly wth an optmzaton of the relatve pose T kk 1. Snce n Sec. V-A we already ntroduced a body frame B that s rgdly attached to the camera, t s now straghtforward to generalze sparse mage algnment to multple rgdly attached and synchronzed cameras. Let us assume that gven a camera rg wth M cameras (see Fg. 9). The extrnsc calbraton of the ndvdual cameras c C wth respect to the body frame T CB s assumed to be known from pror extrnsc calbraton 3. To use multple cameras, we only need to add an extra summaton n the cost functon of Eq. (1) to use the nformaton from all mages for sparse mage algnment: T kk 1 = arg mn T kk 1 C C u R C k r I C u (T kk 1) 2 Σ I. (11) The same summaton s necessary n the bundle adjustment step to sum the reprojecton errors from all cameras. The remanng steps of feature algnment and mappng are ndependent of how many cameras are used. To summarze, the 3 We use the calbraton toolbox Kalbr [45], whch s avalable at https: //gthub.com/ethz-asl/kalbr

7 Tkk 1 (a) Perspectve (b) Fsheye Body Frame TBC1 TBC2 Fg. 9: Vsual odometry wth multple rgdly attached and synchronzed cameras.

8: Dfferent optcal dstorton models that are supported by SVO.

M OTION P RIORS In feature-poor envronments, durng rapd motons, or n case of dynamc obstacles t can be very helpful to employ a moton pror.

7 7 Tkk 1 (a) Perspectve (b) Fsheye Body Frame TBC1 TBC2 Fg. 9: Vsual odometry wth multple rgdly attached and synchronzed cameras. We know the relatve pose of each camera to the body frame TBCj from extrnsc calbraton and the goal s to estmate the relatve moton of the body frame Tkk 1. (c) Catadoptrc Fg. 8: Dfferent optcal dstorton models that are supported by SVO. only modfcaton to enable the use of multple cameras s to refer the optmzatons to a central body frame, whch requres us to nclude the extrnsc calbraton TCB n the Jacobans as shown n the Appendx. IX. M OTION P RIORS In feature-poor envronments, durng rapd motons, or n case of dynamc obstacles t can be very helpful to employ a moton pror. A moton pror s an addtonal term that s added to the cost functon n Eq. (11), whch penalzes motons that are not n agreement wth the pror estmate. Thereby, jumps n the moton estmate due to unconstraned degrees of freedom or outlers can be suppressed. In a car scenaro for nstance, a constant velocty moton model may be assumed as the nerta of the car prohbts sudden changes from one frame to the next. Other prors may come from addtonal sensors such as gyroscopes, whch allow us to measure the ncremental rotaton between two frames. Let us assume that we are gven a relatve translaton pror p kk 1 (e.g., from a constant velocty assumpton) and a relatve rotaton pror R kk 1 (e.g., from ntegratng a gyroscope). In ths case, we can employ a moton pror by addng addtonal terms to the cost of the sparse mage algnment step: T?kk 1 = arg mn Tkk 1 X X C C u R Ck 1 1 kr C (Tkk 1 )k2σi 2 Iu (12) 1 + kpkk 1 p kk 1 k2σp k log(r T kk 1 Rkk 1 ) kσr, 2 where the covarances Σp, ΣR are set accordng to the uncer. tanty of the moton pror and the varables (pkk 1, Rkk 1 ) = Tkk 1 are the current estmate of the relatve poston and orentaton (expressed n body coordnates B). The logarthm map maps a rotaton matrx to ts rotaton vector (see Eq. (18)). Note that the same cost functon can be added to the bundle adjustment step. For further detals on solvng Eq. (12), we refer the nterested reader to the Appendx. X. I MPLEMENTATION D ETAILS In ths secton we provde addtonal detals on varous aspects of our mplementaton. A. Intalzaton The algorthm s bootstrapped to obtan the pose of the frst two keyframes and the ntal map usng the 5-pont relatve pose algorthm from [57]. B. Sparse Image Algnment For sparse mage algnment, we use a patch sze of 4 4 pxels. In the expermental secton we demonstrate that the sparse approach wth such a small patch sze acheves comparable performance to sem-dense and dense methods n terms of robustness when the nter-frame dstance s small, whch typcally s true for frame-to-frame moton estmaton. In order to cope wth large motons, we apply the sparse mage algnment algorthm n a coarse-to-fne scheme. Therefore, the mage s halfsampled to create an mage pyramd of fve levels. The photometrc cost s then optmzed at the coarsest level untl convergence, startng from the ntal condton Tkk 1 = I4 4. Subsequently, the optmzaton s contnued at the next fner level to mprove the precson of the result. To save processng tme, we stop after convergence on the thrd level, at whch stage the estmate s accurate enough to ntalze feature algnment. To ncrease the robustness aganst dynamc obstacles, occlusons and reflectons, we addtonally employ a robust cost functon [24, 34]. C. Feature Algnment For feature algnment we use a patch-sze of 8 8 pxels. Snce the reference patch may be multple frames old, we employ an affne llumnaton model to cope wth llumnaton changes [58]. For all experments we lmt the number of matched features to 18 n order to guarantee a constant cost per frame. D. Mappng In the mappng thread, we dvde the mage n cells of fxed sze (e.g., pxels). For every keyframe a new depthflter s ntalzed at the FAST corner [59] wth hghest score n the cell unless there s already a 2D-to-3D correspondence present. In cells where no corner s found, we detect the pxel wth hghest gradent magntude and ntalze an edge feature. Ths results n evenly dstrbuted features n the mage. To speed up the depth-estmaton we only sample a short range along the eppolar lne; n our case, the range corresponds to twce the standard devaton of the current depth estmate. We use a 8 8 pxel patch sze for the eppolar search.

Dense approaches use every pxel n the mage, sem-dense use just the pxels wth hgh ntensty gradent, and the proposed sparse approach uses selected pxels at corners or along ntensty gradent edges.

8 (c) Sparse (b) Depth of the scene (d) Sem-Dense (e) Dense Fg. 1: An mage from the Urban Canyon dataset [6] (Sec. XI-A) wth pxels used for mage-to-model algnment (marked n green) for sparse, semdense, and dense methods. Dense approaches use every pxel n the mage, sem-dense use just the pxels wth hgh ntensty gradent, and the proposed sparse approach uses selected pxels at corners or along ntensty gradent edges. converged poses [%] (a) Synthetc scene converged poses [%] converged poses [%] 8 4 The Urban Canyon dataset [6] s avalable at converged poses [%] In ths secton we evaluate the robustness of the proposed sparse mage algnment algorthm (Sec. V-A) and compare ts performance to sem-dense and dense mage algnment alternatves. Addtonally, we nvestgate the nfluence of the patch-sze that s used for the sparse approach. The experment s based on a synthetc dataset wth known camera moton, depth and calbraton [6].4 The camera performs a forward moton through an urban canyon as the excerpt of the dataset n Fg. 1a shows. The dataset conssts of 25 frames wth.2 meters dstance between frames and a medan scene depth of 12.4 meters. For the experment, we select a reference mage Ir wth known depth (see Fg. 1b) and estmate the relatve pose Trk of 6 subsequent mages k {r + 1,..., r + 6} along the trajectory by means of mage to model algnment. For each mage par {Ir, Ik }, the algnment s repeated 8 tmes wth ntal perturbaton that s sampled unformly wthn a 2 m range around the true value. We perform the experment at 18 reference frames along the trajectory. The algnment s consdered converged when the estmated relatve pose s closer than.1 meters from the ground-truth. The goal of ths experment s to study the magntude of the perturbaton from whch mage to model sparse (2x2) sparse (3x3) sparse (4x4) sparse (5x5) converged poses [%] A. Image Algnment: From Sparse to Dense sparse (1x1) 8 1 converged poses [%] We mplemented the proposed VO system n C++ and tested ts performance n terms of accuracy, robustness, and computatonal effcency. We frst compare the proposed sparse mage algnment algorthm aganst sem-dense and dense mage algnment algorthms and nvestgate the nfluence of the patch sze used n the sparse approach. Fnally, n Sec. XI-B we compare the full ppelne n dfferent confguratons aganst the state of the art on nne dfferent dataset sequences. converged poses [%] XI. E XPERIMENTAL E VALUATION sem-d frames dstance 45 6 dense frames dstance 45 6 Fg. 11: Convergence probablty of the model-based mage algnment algorthm as a functon of the dstance to the reference mage and evaluated for sparse mage algnment wth patch szes rangng from 1 1 to 5 5 pxels, sem-dense, and dense mage algnment. The colored regon hghlghts the 68% confdence nterval.

9 9 algnment s capable to converge as a functon of the dstance to the reference mage. The performance n ths experment s a measure of robustness: successful pose estmaton from large ntal perturbatons shows that the algorthm s capable of dealng wth rapd camera motons. Furthermore, large dstances between the reference mage I r and test mage I k smulates the performance at low camera frame-rates. For the sparse mage algnment algorthm, we extract 1 FAST corners n the reference mage (see Fg. 1c) and ntalze the correspondng 3D ponts usng the known depthmap from the renderng process. We repeat the experment wth patch-szes rangng from 1 1 pxels to 5 5 pxels. We evaluate the sem-drect approach (as proposed n the LSD framework [41]) by usng pxels along ntensty gradents (see Fg. 1d). Fnally, we perform the experment usng all pxels n the reference mage as proposed n DTAM [15]. The results of the experment are shown n Fg. 11. Each plot shows a varant of the mage algnment algorthm wth the vertcal axs ndcatng the percentage of converged trals and the horzontal axs the frame ndex counted from the reference frame. We can observe that the dfference between sem-dense mage algnment and dense mage algnment s margnal. Ths s because pxels that exhbt no ntensty gradent are not nformatve for the optmzaton as ther Jacobans are zero [4]. We suspect that usng all pxels becomes only useful when consderng moton blur and mage defocus, whch s out of the scope of ths evaluaton. In terms of sparse mage algnment, we observe a gradual mprovement when ncreasng the patch sze to 4 4 pxels. A further ncrease of the patch sze does not show mproved convergence and wll eventually suffer from the approxmatons adopted by not warpng the patches accordng to the surface orentaton. Compared to the sem-dense approach, the sparse approaches do not reach the same convergence radus, partcularly n terms of dstance to the reference mage. For ths reason SVO uses sparse mage algnment only to algn wth respect to the prevous mage (.e., k = r + 1), n contrast to LSD [41] whch algns wth respect to the last keyframe. In terms of computatonal effcency, we note that the complexty scales lnearly wth the number of pxels used n the optmzaton. The plots show that we can trade-off usng a hgh frame rate camera and a sparse approach wth a lower frame-rate camera and a sem-dense approach. The evaluaton of ths trade-off would deally ncorporate the power consumpton of both the camera and processors, whch s out of the scope of ths evaluaton. B. Real and Synthetc Experments In ths secton, we compare the proposed algorthm aganst the state of the art on real and synthetc datasets. Therefore, we present results of the proposed ppelne on the EUROC benchmark [61], the TUM RGB-D benchmark dataset [62], the synthetc ICL-NUIM dataset [37], and our own dataset that compares dfferent feld of vew cameras. A selecton of these experments, among others (e.g., from the KITTI benchmark), can also be vewed n the vdeo attachment of ths paper. 1) Euroc Datasets: The EUROC dataset [61] conssts of stereo mages and nertal data that was recorded usng a VI- Sensor [63] that was mounted on a mcro aeral vehcle and flown nsde a machne hall. Extracts from the dataset are shown n Fg. 12a and 12b. The dataset provdes a precse ground-truth trajectory that was obtaned usng a Leca MS5 laser trackng system. In Table I we present results of varous monocular and stereo confguratons of the proposed algorthm on the frst two trajectores of the dataset. The trajectores are 65 and 58 meters long respectvely. We compare our algorthm aganst the open-source versons of ORB-SLAM [22] and LSD-SLAM [16] on these datasets wth ther default parameter settngs. In contrast to SVO, ORB- SLAM and LSD-SLAM detect loop-closures and subsequently perform a global optmzaton. Hence, durng loop-closure refnements, the latest camera pose may undergo sgnfcant jumps. For ths reason, we base the evaluaton on the fnal poses of all keyframes n the trajectory. We observed that the ntal poses of LSD-SLAM are not accurate due to the ntalzaton procedure and therefore dscard the frst 35 frames n the evaluaton for LSD-SLAM. To understand the nfluence of the proposed extensons of SVO, we run the algorthm n varous confguratons. SVO Mono (only corners) uses only the mages that were recorded wth the left camera of the sensor. SVO Mono + Pror ndcates that we use measurements from the gyroscope as prors n the mage algnment step as we dscussed n Sec. IX. In the next settng we addtonally use edgelet features n combnaton wth corner features. In these frst three settngs, we only optmze the latest pose; conversely, the keyword Bundle Adjustment ndcates that results were obtaned by optmzng the whole hstory of keyframes by means of the ncremental smoothng algorthm SAM2 [9]. Therefore, we nsert and optmze every new keyframe n the SAM2 graph when a new keyframe s selected. In ths settng, we do nether use moton prors nor edgelets. Snce SVO s a vsual odometry, t does not not detect loop-closures and only mantans a small local map of the last keyframes. To provde a far comparson wth ORB-SLAM and LSD-SLAM, we deactvated ther capablty to detect large loop closures va mage retreval. Addtonally, we provde results usng both mage streams of the stereo camera. Therefore, we apply the approach ntroduced n Sec. VIII to estmate the moton of a mult-camera system. To obtan a measure of accuracy of the dfferent approaches, we algn the fnal trajectory of keyframes wth the ground-truth trajectory usng the least-squares approach proposed n [64]. Snce scale cannot be recovered usng a sngle camera, we also rescale the estmated trajectory to best ft wth the ground-truth trajectory. Subsequently, we compute the Eucldean dstance between the estmated and ground-truth keyframe poses and compute the mean, medan, and Root Mean Square Error (RMSE) n meters. We chose the absolute trajectory error measure nstead of relatve drft metrcs [62] because the fnal trajectory n ORB-SLAM conssts only of a sparse set of keyframes, whch makes drft measures on relatvely short trajectores less expressve. The reported results are averaged over three runs.

1 (a) (b) (c) Machne Hall 1 (d) Machne Hall 2 Fg. 12: Fgures (a) and (b) show excerpts of the EuRoC dataset [61] wth tracked corners marked n green and edgelets marked n magenta.

CPU@2 fps SVO Mono.224.198.269.531.356.652 2.53.42 55 ±1 % SVO Mono + Pror.199.131.27.345.314.48 2.32.4 7 ± 8 % SVO Mono + Pror + Edgelet.171.149.21.368.318.425 2.51.

10 1 (a) (b) (c) Machne Hall 1 (d) Machne Hall 2 Fg. 12: Fgures (a) and (b) show excerpts of the EuRoC dataset [61] wth tracked corners marked n green and edgelets marked n magenta. Fgures (c) and (d) show the reconstructed trajectory and pontcloud on the frst two trajectores of the dataset. Machne Hall 1 Machne Hall 2 Tmng Mean Medan RMSE [m] Mean Medan RMSE [m] Mean [ms] St.D. CPU@2 fps SVO Mono ±1 % SVO Mono + Pror ± 8 % SVO Mono + Pror + Edgelet ± 7 % SVO Mono + Bundle Adjustment ±13 % SVO Stereo ± 6 % SVO Stereo + Pror ± 7 % SVO Stereo + Pror + Edgelet ± 7 % SVO Stereo + Bundle Adjustment ±13 % ORB Mono SLAM (No loop closure) ±32 % LSD Mono SLAM (No loop closure) ±37 % TABLE I: Absolute translaton errors n meters after 6 DoF algnment wth the ground-truth trajectory and tmng measurements on laptop computer wth Intel Core 7-481MQ CPU (2.8 GHz) averaged over three runs of the EUROC Machne Hall 1 dataset. Loop closure detecton and optmzaton was deactved for ORB and LSD SLAM to allow a far comparson wth SVO. The frst and second column report mean and standard devtaton of the processng tme. Snce all algorthms use mult-threadng, the thrd column reports the average CPU load when provdng new mages at a constant rate of 2 Hz. Thread Intel 7 [ms] Jetson TX1 [ms] Sparse mage algnment Feature algnment Optmze pose & landmarks Extract features Update depth flters TABLE II: Mean tme consumpton n mllseconds by ndvdual components of SVO Mono on the EUROC Machne Hall 1 dataset. We report tmng results on a laptop wth Intel Core 7 (2.8 GHz) processor and on the NVIDIA Jetson TX1 ARM processor. The results show that usng a stereo camera n general results n much hgher accuracy. Apart from the addtonal vsual measurements, the man reason for the mproved results s that the stereo system does not drft n scale and nter camera trangulatons allow to quckly ntalze new 3D landmarks n case of on-spot rotatons. Notce that SVO wth bundle adjustment s twce as accurate as ORB and LSD SLAM. However, the power of the less accurate confguratons of SVO becomes obvous when analyzng the tmng and processor usage of the approaches, whch are reported n the rght columns of Table I. In the table, we report the mean tme to process a sngle frame n mllseconds and the standard devaton over all measurements. Snce all algorthms make use of mult-threadng and the tme to process a sngle frame may therefore be msleadng, we addtonally report the CPU usage (contnuously sampled durng executon) when provdng new mages at a constant rate of 2 Hz to the algorthm. All measurements are averaged over 3 runs of the frst EUROC dataset and computed on the same laptop computer (Intel Core 7-276QM CPU). In Table II, we further report the average tme consumpton of ndvdual components of SVO on the laptop computer and an NVIDIA TX1 ARM processor, whch s popular n moble robotcs applcatons. The results show that the SVO approach s up to ten tmes faster than ORB- SLAM and LSD-SLAM and requres only a fourth of the CPU usage. The reason for ths sgnfcant dfference s that SVO does not extract features and descrptors n every frame, as n ORB-SLAM, but does so only for keyframes n the concurrent mappng thread. Addtonally, ORB-SLAM beng a SLAM approach spends most of the processng tme n fndng matches to the map (see Table I n [65]), whch n theory results n a pose-estmate wthout drft n an already mapped area. Contrarly, n the frst three confguratons of SVO, we estmate only the pose of the latest camera frame wth respect to the local map. Compared to LSD-SLAM, SVO s faster because t operates on sgnfcantly less numbers of pxels, hence, also does not result n a sem-dense reconstructon of the envronment. Ths, however, could be acheved n a parallel process as we have shown n [39, 53, 66]. We further remark that the bundle adjustment verson has a sgnfcantly hgher standard devaton n the tmng as we update SAM2 at every keyframe, whch takes approxmately 1 mllseconds longer than processng a regular frame. Usng a moton pror further helps to mprove the effcency as the sparse-magealgnment optmzaton can be ntalzed closer to the soluton, and therefore needs less teratons to converge. An edgelet provdes only a one-dmensonal constrant n the mage doman, whle a corner provdes a two-dmensonal

11 (a) fr2 desk (b) fr2 xyz (a) Fg. 13: Impressons from the TUM RGB-D benchmark dataset [62] wth tracked corners n green and edgelets n magenta. (a) fr2 desk (sde) (b) fr2 desk (top) (b) Fg.

14: Estmated trajectory and pontcloud of the TUM fr2 desk dataset.

9.7 6.7 1.1.8 4.5.9 / 13.5 1.5.3.2 / 24.3 3.8 1.8 9.5 1.2 2.6 TABLE III: Results on the TUM RGB-D benchmark dataset [62].

Algorthms marked wth F use a depth-sensor, and ndcates loop-closure detecton. The symbol ndcates that trackng the whole trajectory dd not succeed. constrant.

11 11 (a) fr2 desk (b) fr2 xyz (a) Fg. 13: Impressons from the TUM RGB-D benchmark dataset [62] wth tracked corners n green and edgelets n magenta. (a) fr2 desk (sde) (b) fr2 desk (top) (b) Fg. 15: Impressons from the synthetc ICL-NUIM dataset [37] wth tracked corners marked n green and edgelets n magenta. (a) Lvng Room (b) Lvng Room 1 (c) Lvng Room 2 (d) Lvng Room 3 Fg. 14: Estmated trajectory and pontcloud of the TUM fr2 desk dataset. SVO Mono (wth edgelets) SVO Mono + Bundle Adjustment LSD-SLAM [16] ORB-SLAM [22] PTAM [21] Sem-Dense VO [41] Drect RGB-D VO [34] F Feature-based RGB-D SLAM [67] F fr2 desk RMSE [cm] fr2 xyz RMSE [cm] / / TABLE III: Results on the TUM RGB-D benchmark dataset [62]. Results for [16, 34, 41, 67] were obtaned from [16] and for PTAM we report two results that were publshed n [22] and [41] respectvely. Algorthms marked wth F use a depth-sensor, and ndcates loop-closure detecton. The symbol ndcates that trackng the whole trajectory dd not succeed. constrant. Therefore, whenever suffcent corners can be detected, the SVO algorthm prortzes the corners. Snce the envronment n the EUROC dataset s well textured and provdes many corners, the use of edgelets does not mprove the accuracy. However, the edgelets brng a beneft n terms of robustness when the texture s such that no corners are present. 2) TUM Datasets: A common dataset to evaluate vsual odometry algorthms s the TUM Munch RGB-D benchmark [62]. The dataset was recorded wth a Mcrosoft Knect RGBD camera, whch provdes mages of worse qualty (e.g. rollng shutter, moton blur) than the VI-Sensor EUROC dataset. Fg. 13 shows excerpts from the fr2_desk and fr2_xyz datasets whch have a trajectory length of 18.8 m and 7 m respectvely. Groundtruth s provded by a moton capture system. Table III shows the results of the proposed algorthm (averaged over three runs) and comparsons aganst related works. The resultng trajectory and the recovered landmarks are shown n Fg. 14. The results from related works Fg. 16: Results on the ICL-NUIM [37] nosy synthetc lvng room dataset. SVO Mono (wth edgelets) LSD SLAM ORB SLAM [22] DVO [34] F FOVIS [68] F ICP [69] F ICP+RGB-D [7] F lr kt RMSE lr kt1 RMSE lr kt2 RMSE lr kt3 RMSE TABLE IV: Results on the ICL-NUIM Dataset [37]. Algorthms marked wth F use a depth-sensor, and ndcates loop-closure detecton. The symbol ndcates that trackng the whole trajectory dd not succeed. were obtaned from the evaluaton n [22] and [16]. We argue that the better performance of ORB-SLAM and LSD-SLAM s due to the capablty to detect loop-closures. 3) ICL-NUIM Datasets: The ICL-NUIM dataset [37] s a synthetc dataset that ams to benchmark RGB-D, vsual odometry and SLAM algorthms. The dataset conssts of four trajectores of length 6.4 m, 1.9 m, 7.3 m, and 11.1 m. The syntheszed mages are corrupted by nose to smulate real

12 camera mages. Ground-truth and calbraton are provded by the dataset. Most reported results on ths dataset use the synthetszed measurements from the depth sensor together wth the rendered mages.

Table IV reports the results of the proposed algorthm (averaged over three runs) and the results from other algorthms on the lvng room sequence.

12 12 camera mages. Ground-truth and calbraton are provded by the dataset. Most reported results on ths dataset use the synthetszed measurements from the depth sensor together wth the rendered mages. Indeed, the datsets are very challengng for purely vson-based odometry due to dffcult texture and frequent on-spot rotatons as can be seen n the excerpts from the dataset n Fg. 15. Table IV reports the results of the proposed algorthm (averaged over three runs) and the results from other algorthms on the lvng room sequence. Smlar to the prevous datasets, we report the root mean square error after rotaton, translaton and scale algnment wth the ground-truth trajectory. Fg. 16 shows the reconstructed maps and recovered trajectores. The maps are very nosy due to the fne graned texture of the scene. We also run ORB-SLAM and LSD-SLAM on the dataset. ORB-SLAM fals to ntalze on the second sequence and we dd not manage to obtan any results from LSD- SLAM on all datasets. The reason for the falures s lack of texture for ntalzaton and frequent on-spot rotatons. For SVO, ths also requred us to set a partcularly low FAST corner detecton threshold on ths dataset (to 5 nstead of 2). A lower threshold results n detecton of many low-qualty features. However, features are only used n SVO once ther correspondng scene depth s sucessfully estmated by means of the robust depth flter descrbed n Sec. VI. Hence, the process of depth estmaton helps to dentfy the stable features wth low score that can be relably used for moton estmaton. In ths dataset, we were not able to refne the results of SVO wth bundle adjustment. The reason s that the SAM2 backend s based on Gauss Newton whch s very senstve to underconstraned varables that render the lnearzed problem ndetermnant. The frequent on-spot rotatons and very low parallax angle trangulatons result n many underconstraned varables. Usng an optmzer that s based on Levenberg Marquardt or addng addtonal nertal measurements [44] would help n such cases. For comparson, Table IV also shows the results reported n [37] of algorthms that also use the depth measurements. 4) Crcle Dataset: In the last experment, we want to demonstrate the usefulness of wde feld of vew lenses for VO. We recorded the dataset wth a mcro aeral vehcle that we flew n a moton capture room and commanded t to fly a perfect crcle wth downfacng camera. Subsequently, we flew the exact same trajectory agan wth a wde fsheye camera. Excerpts from the dataset are shown n Fg. 17. We run SVO (wthout bundle adjustment) on both datasets and show the resultng trajectores n Fg. 18. To run SVO on the fsheye mages, we use the modfcatons descrbed n Sec. VII. Whle the recovered trajectory from the perspectve camera slowly drfts over tme, the result on the fsheye camera perfectly overlaps wth the groundtruth trajectory. We also run ORB- SLAM and LSD-SLAM on the trajectory wth the perspectve mages. The result of ORB-SLAM s as close to the groundtruth trajectory as the SVO fsheye result. However, f we deactvate loop-closure detecton (shown result) the trajectory drfts more than SVO. We were not able to run LSD-SLAM and ORB-SLAM on the fsheye mages as the open source mplementatons do not support very large FoV cameras. Due x [m] (a) Perspectve. (b) Fsheye. Fg. 17: SVO trackng wth (a) perspectve and (b) fsheye camera lens ORB SLAM (no loop-closure) SVO Fsheye SVO Perspectve Groundtruth y [m] Fg. 18: Comparson of perspectve and fsheye lense on the same crcular trajectory that was recorded wth a mcro aeral vehcle n a moton capture room. The ORB-SLAM result was obtaned wth the perspectve camera mages and loop-closure was deactvated for a far comparson wth SVO. ORB-SLAM wth a perspectve camera and wth loop-closure actvated performs as good as SVO wth a fsheye camera. to the dffcult hgh-frequency texture of the floor, we were not able to ntalze LSD-SLAM on ths dataset. A more ndepth evaluaton of the beneft of large FoV cameras for SVO s provded n [6]. XII. DISCUSSION In ths secton we dscuss the proposed SVO algorthm n terms of effcency, accuracy, and robustness. A. Effcency Feature-based algorthms ncur a constant cost of feature and descrptor extracton per frame. For example, ORB-SLAM requres 11 mllseconds per frame for ORB feature extracton only [22]. Ths constant cost per frame s a bottleneck for feature-based VO algorthms. On the contrary, SVO does not have ths constant cost per frame and benefts greatly from the use of hgh frame-rate cameras. SVO extracts features only for selected keyframes n a parallel thread, thus, decoupled from hard real-tme constrants. The proposed trackng algorthm, on the other hand, benefts from hgh frame-rate cameras: the sparse mage algnment step s automatcally ntalzed closer to the soluton and, thus, converges faster. Therefore,

13 C. Robustness Fg. 19: Successful trackng n scenes of hgh-frequency texture. ncreasng the camera frame-rate actually reduces the computatonal cost per frame n SVO.

13 13 C. Robustness Fg. 19: Successful trackng n scenes of hgh-frequency texture. ncreasng the camera frame-rate actually reduces the computatonal cost per frame n SVO. The same prncple apples to LSD-SLAM. However, LSD-SLAM tracks sgnfcantly more pxels than SVO and s, therefore, up to an order of magntude slower. To summarze, on a laptop computer wth an Intel GHz CPU processor ORB-SLAM and LSD-SLAM requre approxmately 3 and 23 mllseconds respectvely per frame whle SVO requres only 2.5 mllseconds (see Table I). B. Accuracy SVO computes feature correspondence wth sub-pxel accuracy usng drect feature algnment. Subsequently, we optmze both structure and moton to mnmze the reprojecton errors (see Sec. V-B). We use SVO n two settngs: f hghest accuracy s not necessary, such as for moton estmaton of mcro aeral vehcles [53], we only perform the refnement step (Sec. V-B) for the latest camera pose, whch results n the hghest frame-rates (.e., 2.5 ms). If hghest accuracy s requred, we use SAM2 [9] to jontly optmze structure and moton of the whole trajectory. SAM2 s an ncremental smoothng algorthm, whch leverages the expressveness of factor graphs [8] to mantan sparsty and to dentfy and update only the typcally small subset of varables affected by a new measurement. In an odometry settng, ths allows SAM2 to acheve the same accuracy as batch estmaton of the whole trajectory, whle preservng real-tme capablty. Bundle adjustment wth SAM2 s consstent [44], whch means that the estmated covarance of the estmate matches the estmaton errors (e.g., are not over-confdent). Consstency s a prerequste for optmal fuson wth addtonal sensors [18]. In [44], we therefore show how SVO can be fused wth nertal measurements to acheve a drft that s approxmately.1% of the traveled dstance. LSD-SLAM, on the other hand, only optmzes a graph of poses and leaves the structure fxed once computed (up to a scale). The optmzaton does not capture correlatons between the sem-dense depth estmates and the camera pose estmates. Ths separaton of depth estmaton and pose optmzaton s only optmal f each step yelds the optmal soluton. SVO s most robust when a hgh frame-rate camera s used (e.g., between 4 and 8 frames per second). Ths ncreases the reslence to fast motons as t s demonstrated n the vdeo attachment. A fast camera, together wth the proposed robust depth estmaton, allows us to track the camera n envronments wth repettve and hgh frequency texture (e.g., grass or asphalt as shown n Fg. 19). The advantage of the proposed probablstc depth estmaton method over the standard approach of trangulatng ponts from two vews only s that we observe far fewer outlers as every depth flter undergoes many measurements untl convergence. Furthermore, erroneous measurements are explctly modeled, whch allows the depth to converge n hghly self-smlar envronments. A further advantage of SVO s that the algorthm starts drectly wth an optmzaton. Data assocaton n sparse mage algnment s drectly gven by geometry of the problem and therefore, no RANSAC [23] s requred as t s typcal n feature-based approaches. Startng drectly wth an optmzaton also smplfes greatly the use of mult-camera systems, whch greatly mproves reslence to on-spot rotatons as the feld of vew of the system s enlarged and depth can be trangulated from nter-camera-rg measurements. Fnally, the use of gradent edge features (.e., edgelets) ncreases the robustness n areas where only few corner features are found. Our smulaton experments have shown that the proposed sparse mage algnment approach acheves comparable performance as sem-dense and dense algnment n terms of robustness of frame-to-frame moton estmaton. XIII. C ONCLUSION In ths paper, we proposed the sem-drect VO ppelne SVO that s sgnfcantly faster than the current state-of-theart VO algorthms whle achevng hghly compettve accuracy. The gan n speed s due to the fact that features are only extracted for selected keyframes n a parallel thread and feature matches are establshed very fast and robustly wth the novel sparse mage algnment algorthm. Sparse mage algnment tracks a set of features jontly under eppolar constrans and can be used nstead of KLT-trackng [71] when the scene depth at the feature postons s known. We further propose to estmate the scene depth usng a robust flter that explctly models outler measurements. Robust depth estmaton and drect trackng allows us to track very weak corner features and edgelets. A further beneft of SVO s that t drectly starts wth an optmzaton, whch allows us to easly ntegrate measurements from multple cameras as well as moton prors. The formulaton further allows usng large FoV cameras wth fsheye and catadoptrc lenses. The SVO algorthm has further proven successful n real-world applcatons such as vsonbased flght of quadrotors [53] or 3D scannng applcatons wth smartphones. Acknowledgments The authors gratefully acknowledge Henr Rebecq for creatng the Urban Canyon datasets that can be accessed here:

SVO: Semi-Direct Visual Odometry for Monocular and Multi-Camera Systems

SVO: Semi-Direct Visual Odometry for Monocular and Multi-Camera Systems 1 : Sem-Drect Vsual Odometry for Monocular and Mult-Camera Systems Chrstan Forster, Zchao Zhang, Mchael Gassner, Manuel Werlberger, Davde Scaramuzza Abstract Drect methods for Vsual Odometry (VO) have