Delayed Features Initialization for Inverse Depth Monocular SLAM

Delayed Features Intalzaton for Inverse Depth Monocular SLAM Rodrgo Mungua and Anton Grau Department of Automatc Control, Techncal Unversty of Catalona, UPC c/ Pau Gargallo, 5 E-0808 Barcelona, Span, {rodrgo.mungua;anton.grau}@upc.edu Abstract Recently, the unfed nverse depth parametrzaton has shown to be a good opton for challengng monocular SLAM problem, n a scheme of EKF for the estmaton of the stochastc map and camera pose. In the orgnal approach, features are ntalzed n the frst frame observed (undelayed ntalzaton), ths aspect has advantages but also some problems. In ths paper a delayed feature ntalzaton s proposed for addng new features to the stochastc map. The results show that delayed ntalzaton can mprove some aspects wthout losng the performance and unfed aspect of the orgnal method, when ntal reference ponts are used n order to fx a metrc scale n the map. I Index Terms Features, Intalzaton, Monocular, SLAM. I. INTRODUCTION n recent works, Montel [] and Eade [5] have shown that the use of an nverse depth parametrzaton for monocular SLAM can mprove the lnearty of the measurement equaton even for small changes n the camera poston yeldng small changes n the parallax angle, ths fact allows a Gaussan dstrbuton to cover uncertanty n depth whch spans a depth range from nearby to nfnty. In the unfed nverse depth method presented by Montel [], transton from partally to fully ntalzed features need not to be explctly tackled, makng t sutable for drect use n EKF framework for sparse mappng. In ths approach the features are ntalzed n the frst frame observed (undelayed ntalzaton) wth an ntal fxed depth and uncertanty, determned heurstcally to cover ranges from nearby to nfnty, so dstant ponts can be coded. Due to the clarty and scalablty of ths method, ths approach s a good opton for monocular-slam mplementaton. Partcularly, ths work s motvated by the problems of vson-based robot map buldng and localzaton, therefore, f monocular SLAM wants to be appled n ths context, retrevng the metrc scale of the world s very mportant. The experments wth the unfed nverse depth method show that, when ntal reference ponts are used for establshng a metrc scale n the map, the ntal features depths have to be tuned, otherwse, s lkely that new features added to the map never converges respect to the metrc reference. On the other hand, ntalzng features dstant to the optcal camera center can ncrease the possblty that features depth become negatve after a Kalman update step. Intalzng features n the frst observed frame (undelayed ntalzaton) avods the use of pre-ntalzed features n the state and allows the use of all the nformaton avalable n the feature snce t s detected, on the other hand, when features are detected n the mage wth a salency operator n order to be automatcally added to the map, usually the weak long-term mage features are added to the map. Therefore t s dffcult to match them n subsequent frames. hen a mnmum number of actve mage features want to be mantaned, t could happen that unnecessary ntalzatons are realzed. Every new feature ntalzaton ntroduces bases to the system [8]. The aforementoned ssues suggested for new features, ntal nverse depth and ther assocated ntal uncertanty, could be treated before beng added to the system state nstead of usng a fxed ntal depth and uncertanty. At the same tme features can be tested pror to be added to map n order to prune weak long-term features. II. RELATED ORK In [] a mult-hypothess method based on a partcle flter to represent the ntal depth of a feature s proposed. Ths work gves good results. However ts applcaton n large envronments s not straghtforward, as t would requre a huge number of partcles. In [4] s proposed a delayed multhypothess method based n a sum of Gaussan mxture for depth estmaton, but t uses odometry as an addtonal sensor. The work n [5] s based n the FastSLAM algorthm, where the pose of the robot s represented by partcles and a set of Kalman flters refne the estmaton of the features, ths approach s unable to code dstant ponts. In the work presented n ths paper, a delayed feature ntalzaton s proposed for addng new features to the stochastc map n a context for monocular SLAM usng nverse depth parametrzaton. The expermental results show that delayed ntalzaton can mprove some aspects wthout losng the performance and unfed aspect of the orgnal (undelayed) method presented by Montel [], where ntal reference ponts are used n order to fx a metrc scale n the map.

III. INVERSE DEPTH MONOCULAR SLAM A. Camera moton model 3 A free camera movng n any drecton n R SO(3) s consdered. The camera state x v s defned by: C C x = r q v ω ] () v [ T here r C = [x,y,z] represents the camera optcal center poston, q C =[q 0,q,q,q 3 ] represents the camera orentaton by a quaternon, v =[v x,v y,v z ] and ω =[ω x,ω y,ω z ] denote lnear and angular veloctes respectvely. At every step t s assumed an unknown lnear and angular acceleraton wth zero mean and known covarance Gaussan processes, a and α, producng an mpulse of lnear and angular velocty such as: V a t n = = () Ω α t The camera moton predcton model s: C C r rk + ( vk + Vk ) t k + C C qk + qk q ωk +Ω t (3) (( ) ) fv = = v k+ vk V + ω k + ωk +Ω Beng q( ( ω +Ω k ) t) the quaternon defned by the rotaton vector( ω +Ω ) t. k An Extended Kalman Flter propagates the camera pose and velocty estmates, as well as feature estmates. B. Features defnton and measurement The complete state that ncludes the features y s made of: Τ x = x, y, y,... y (4) Τ Τ Τ Τ v n where a feature y represents a scene 3D pont defned by the 6-dmenson state vector: y = x, y, z, θ, φ, ρ Τ (5) [ ] whch models the 3D pont located at: x y + (6) m( θ, φ ) ρ z where x,y,z are the camera optcal center coordnates when the feature was frst observed; and θ, φ represent azmuth and elevaton (respect to the world reference ) for the drectonal vector m ( θ, φ ). The pont depth d along the ray s coded by ts nverse ρ =/d. The dfferent locatons of the camera, along wth the locaton of the already mapped features, are used to predct the feature poston h. The observaton of a pont y from a camera locaton defnes a ray expressed n the camera frame as h C =[h x,h y,h z ]: x C C C h R y (7) = + m( θ, φ) r ρ z h C s observed by the camera through ts projecton n the mage. The projecton s modeled usng a full perspectve wde angle camera. Frst the projecton s modeled n the normalzed retna: hx h υ z (8) = ν h y hz The camera calbraton model s appled to produce the pxel coordnates for the predcted pont: f u0 υ u d x h = = (9) v f v0 ν d y where u 0,v 0 s the camera center n pxels, f s the focal length and d x, and d y the pxel sze. Fnally, a radal dstorton model s appled [7]. u u0 + u 0 u d + Kr (0) h = = v d v v 0 + v 0 + Kr where r = ( u u0) + ( v v0), and K s the dstorton coeffcent. Features search s constraned to ellptcal regons around the predcted h. The ellptcal regons are defned by the Τ nnovaton covarance matrx S = HPk+ H + R where H s the Jacoban of the sensor model wth respect to the state, P k+ s the pror state covarance, and measurements z assumed corrupted by zero mean Gaussan nose wth covarance R. A. Canddate ponts IV. DELAYED FEATURE INITIALIZATION In our work we consder a mnmum number of features y to be predcted appearng n the mage, otherwse new features have to be added to the map. In ths latter case, new ponts are detected n the mage wth a salency operator. Specfcally, we use Harrs corner detector, although more robust detectors can be used. If the data assocaton problem want to be addressed n a more robust way, features descrptors could be used, n prevous work [9,0] we treat ths problem. Only areas n the mage free of prevously detected ponts or features already mapped are consder for detectng new ponts, we call these ponts n the mage that do not have to be added to the map as canddate ponts, λ. hen a pont s frst detected by the salency operator n a frame k, the canddate pont s conformed by: x y z q q q q λ = ( x, yz,, σ, σ, σ, qqqq,,,, σ, σ, σ, σ, uv, ) () The values x, y, z represent the camera optcal center x y z poston, σ, σ, σ ther assocated varances taken from the q q q q state covarance matrx P k. q, q, q, q, σ, σ, σ, σ s the quaternon representng the current camera orentaton and ts assocated varances taken from the state covarance matrx P k, and u, v s the current pxel coordnates for the pont λ. In subsequent frames λ s tracked, but practcally some λ

3 ponts can not be tracked. Ths process s used for prunng weakest mage features. For trackng purposes any method can be used. The trackng for every canddate pont λ s realzed untl s pruned or ntalzed n the system. In practce for every frame, some new canddate ponts λ could be detected, others ponts could be pruned and others could be consdered to be added to the map. In our experments an average of 5 to 5 ponts λ are mantaned at every step. B. Addng features to the state As the camera freely moves through ts envronment, the translaton produces parallax n features. Parallax s really the key that allows to estmatng features depth. In the case of ndoor sequences, centmeters are enough to produce parallax, on the other hand, the more dstant the features, the more the camera have to be translated to produce parallax. Fgure. Estmate smulaton of uncertanty feature depth σ p for parallax angle α from 0.º to 0º. An ncrement n the uncertanty σ θ of the measurement angle θ s consdered as the parallax grows. Note that a few degrees parallax s enough to reduce the uncertanty n the estmaton. Fgure. Feature parametrzaton and ntalzaton. In our approach we want dynamcally to estmate an ntal depth and ts assocated uncertanty for the features added to the map. For near features, a small translaton s enough to reproduce some parallax. e use a mnmum parallax threshold α mn for consderng a canddate pont λ to be added to the map as a feature y. On the other hand dstant features wll not produce parallax but are useful to estmate the camera orentaton, and therefore t s advantageous to nclude some dstant features n the map wth bg depth uncertanty. Then, a mnmum base-lne camera translaton b mn s also consdered for addng a canddate pont y to the map. Fgure shows a smulaton for decrementng uncertanty n feature depth estmaton respect wth the ncrease of parallax angle. It can be observed that a few parallax degrees are enough for reducng sgnfcantly the depth uncertanty. In the experments α mn =3 s used. The mnmum base-lne b mn was heurstcally establshed to be the base-lne necessary to produce a parallax α 6º n the ntal reference ponts. For example f the camera ntal poston s n average one meter away from the ntal reference ponts then b mn = 8cm. So far, the uncertanty of the measurements s not consdered, and the parallax α s estmated usng ) the baselne b, ) λ usng ts assocated data ( x, y, z, q, q, q, q, u, v ), and ) the current state ( xk, yk, zk, qk, qk, qk, qk, uk, v k). The parallax angle for a λ can be estmated (Fg ): α = π ( β + γ) () The angle β s determned by the drectonal projecton ray vector h and the vector b defnng the base-lne b n the drecton of the camera trajectory by: h b β cos = (3) h b where the drectonal projecton ray vector h expressed n the absolute frame, s computed from the camera poston and the coordnates of the observed pont when t was frst observed, usng the data stored n λ C C u h = RC ( q ) h (4) v wth R ( C C q ) beng the rotaton matrx dependng on the C stored camera orentaton quaternon q = ( q, q, q, q ) and C h s the drectonal vector n the camera frame usng ( u, v) equaton 7. b s the vector representng the camera base-lne b between the camera optcal center poston x, y, z where the pont was frst observed and the current optcal center x k, y k, z k. b = [( xk x),( yk y),( zz z)] (5) The angle γ s determned n a smlar way as β but usng the drectonal projecton ray vector h and the vector b defnng the base-lne n the opposte drecton of the camera trajectory by: h b γ cos = (6) h b The drectonal projecton ray vector h expressed n the absolute frame, s computed n a smlar way as (4) but usng

4 current camera poston x v and ponts coordnates u,v. b s equal to b but pontng to the opposte drecton: C C u h = RC ( qk ) hk v (7) b = [( x xk),( y yk),( z z k )] (8) The base-lne b s the module of b or b : b= b (9) If α > α mn or b>b mn then λ s ntalzed as a new feature map: y = x, y, z, θ, φ, ρ Τ (0) where the three frst elements are obtaned drectly from the current camera optcal center poston: x xk () y= yk z z k The angles can be derved as: y x z θ arctan ( h, h + h ) = () φ x z arctan ( h, h) where h = [ h x, h y, h z ] s obtaned from equaton 7. Fnally the nverse depth ρ s derved from the sne law snα ρ = b sn β (3) C. Updatng the covarance matrx The covarance for x,,,, y z θ φ and ρ s derved from the error dagonal covarance matrx R j measurement and the state covarance matrx estmate P k. Rj = dag( σu, σv, σu, σv,... (4) x y z q q q q σ, σ, σ, σ, σ, σ, σ ) R j s conformed wth the mage measurement error varance σ u, σv, σu, σ and the varances stored n λ v x y z q q q q σ, σ, σ, σ, σ, σ, σ. The state covarance matrx after ntalzaton s: 0 new Pk Τ Pk = J J (5) 0 R j I 0 J = y y (6),0,...,0, xv h R j where I s the dentty matrx wth the same dmenson of P k. y/ x are the dervatves of y wth respect to the state x v v and y/ h the dervatves of y wth respect to measurement equatons dependng on R j. The Jacoban calculaton s complcated but a tractable matter of dfferentaton; we do not present the results here. V. EXPERIMENTAL RESULTS Real mage sequences of 30 40 pxels acqured wth a monochrome IEEE394 web-cam camera at 30 fps was used for test the performance of the method. The experments were developed n MatLab. The part of code related wth secton was based n the code provded by the author of []. The ntal reference conssts n three spatal ponts formng a trangle of known dmensons, (see Fgure 3 and 4). Pror to start the frst Kalman step, these three ponts are selected on the mage, then ther 3D poston respect to the camera are calculated usng an optmzaton technque, and fnally ncluded n the system state wth zero uncertanty. Several mage sequences movng the camera through dfferent trajectores were recorded followng a predefned path. The undelayed and delayed ntalzaton has been compared. The trajectores were desgned n order that f a feature s left behnd by the movement of the camera, ths feature wll not appear n mage agan n subsequent frames. The orgnal method have a drawback when a ntal metrc reference s used; f the features are ntalzed wth an ntal dstance close to the optcal center wth respect to the dstance to the reference ponts, the features never converge respect to the reference, and even the Kalman Flter never converges to an unscaled trajectory. Fgure 3 llustrates the ntalzaton of the frst features after the three reference ponts are ntroduced n the system for the undelayed and delayed method. The graphcs n the center show the undelayed method for an ntal feature depth of 50cm, n frame (central upper), t s possble to observe that reference ponts are located approxmate 80 cm from the ntal camera poston and the frst observed ponts are mmedately ntalzed. However at frame 30 (central lower) the mapped features never converge respect to the metrc reference. Camera trajectory ether converge, note the 4 ponts correspondng to the prnter located besdes the ntal three pont reference. On the other hand when we use an ntal depth equal to 60 cm, (rght upper and lower graphcs) the map and camera trajectory converge reasonably. In delayed approach (left graphcs) the frst feature s added to the map untl frame 5, n ths case wth a huge ntal uncertanty (upper left graphc). However at frame 30 (lower left graphc) the map and trajectory converges. Note that the frst added feature was ntalzed very near to ts fnal poston, and ts uncertanty was mnmzed. The condton for detectng new ponts wth the Harrs corner detector for both methods s appled f the number of actves features n mage goes below 30, n ths case the detector s appled over the free features mage regons. Fgure 4 shows the results for three dfferent sequences. Real fnal camera poston and trajectory was manually added to the graphcs (n black) to make easer the comparson, the ntal and fnal frames are llustrated n the center for each sequence.

5 Fgure 3. Delayed and undelayed methods, usng three pont reference to establsh metrc scale. The features postons are represented by green sold crcles and ther uncertanty by red ellpses. The camera poston s represented by a blue sold crcle and ts orentaton by a blue lne emergng from the camera poston. The camera trajectory s ndcated wth the blue path from the ntal (x=0 z=0) to the fnal camera poston. For smplcty all the maps are vewed n x-z axes. Fgure 4. Camera trajectory and map for three sequences. Undelayed method (upper graphcs) and delayed method (lower graphcs). The frst sequence corresponds to 760 frames of a house lvngroom and t s the same sequence used n the prevous experment. The second sequence corresponds to 480 frames taken n a laboratory. Note that a PC montor was used as ntal metrc reference. The thrd 360-frame sequence was taken followng a smple lnear path, but n a more occluded terrace buldng envronment, wth very near and very dstant features.

6 Sequence Method σ x,y,z Nf %c Nfc E Nf<0 Undelayed 4 47 4 4 5.7 0 Delayed 4. 35 7 0.44 0 Undelayed. 46 76 36. 0 Delayed.4 8 8 45 9. 0 3 Undelayed.4 34 44 45 7 Delayed.5 7 55 58 9 Table. The results at the end of the three sequences: (σ x,y,z): Summed standard devaton for the x,y,z, poston of the camera. (Nf):Total number of features added to the system. (%c): Percentage of features that present convergence. (Nfc): The average number of frames needed for the convergence of the features. (E): The metrc error dstance n cm from the real to fnal estmated trajectory. (Nf<0) Number of negatve nverse depth estmated at the fnal of the trajectory. Table shows the results for each sequence for the next aspects. In our experments we consder that a feature converges when ts depth uncertanty σ represents less than 5% of ts depth, n ths way we consder a convergence measurement proportonal to the dstance. The depth of a near feature should be estmated n a more accurate manner than a dstant feature. VI. CONCLUSIONS In ths work a method for delayed features ntalzaton for nverse depth parametrzaton n monocular SLAM s presented. The expermental results show that ths method can be a good choce when usng monocular SLAM. The method seems to be more robust respect to the undelayed method, when ntal metrc reference ponts are used for scalng the map. In our experments the resultng camera trajectory estmate usng the delayed method was smlar to the estmate by the undelayed method. In aspects relatng wth features depth convergence the results were smlar for both methods. Snce the delayed method s more restrctve for addng new features, a reduced percentaje of new features are added to the map (0-40%) respect to the undelayed method, wthout losng the qualty of the map. Ths aspect s desrable, because bgger envronments can be mapped wth the same number of features. On the other hand s clear that an addtonal computatonal cost s added n the delayed method, snce the canddate ponts have to be tested n order to be added to the map. The Jacoban to estmate the new covarance matrx s more complex respect to one used n the undelayed method. On the other hand s known that Kalman flter computaton cost scales poorly wth the sze of the state, and the savng computatonal cost usng 0-40% of the total amount of features can be hgher than the computatonal cost added n the delayed method. [5] E. Eade and T. Drummond. Scalable monocular SLAM. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, 006. [6] J.M.M. Montel and A. J. Davson. A vsual compass based on SLAM. In Proc. Intl. Conf. on Robotcs and Automaton, 006. [7] Andrew J. Davson, Yolanda Gonzalez Cd and Nobuyuk Kta. Real- Tme 3D SLAM wth de-angle Vson. In Proceedngs of Symposum on Intellgent Autonomous Vehcles, 004. [8] Andrew J. Davson and Nobuyuk Kta. Sequental localzaton an mapbuldng for real-tme computer vson and robotcs Robotcs and Autonomous systems 00. [9] Rodrgo Mungua and Anton Grau. Learnng Varablty of Image Feature Appearance Usng Statstcal Methods Lecture Notes n Computer Scence, 45, 006 [0] Rodrgo Mungua, Anton Grau and Alberto Sanfelu. Matchng Images Features n a de Base Lne wth ICA Descrptors. In Proceedngs of the IEEE Internatonal Congress n Pattern Recognton, ICPR, 006 REFERENCES [] J.M.M. Montel, Javer Cvera and A. J. Davson. Unfed Inverse Depth Parametrzaton for Monocular SLAM, Robotcs: Scence and Systems Conference 006. [] A. J. Davson. Real-tme smultaneous localzaton and mappng wth a sngle camera. In Proc. Internatonal Conference on Computer Vson. 003. [3] R. I. Hartley and A. Zsserman. Multple Vew Geometry n Computer Vson. Cambrdge Unversty Press,, 004. [4] Thomas Lemare, Smon Lacrox and Joan Sola. A practcal 3D Bearng-Only SLAM algorthm In Proc. Internatonal Conference on Intellgent Robots and Systems. 005.