Scale Recovery for Monocular Visual Odometry Using Depth Estimated with Deep Convolutional Neural Fields

Size: px

Start display at page:

Download "Scale Recovery for Monocular Visual Odometry Using Depth Estimated with Deep Convolutional Neural Fields"

Sharlene Carr
5 years ago
Views:

Scale Recovery for Monocular Visual Odomery Using Deph Esimaed wih Deep Convoluional Neural Fields Xiaochuan Yin, Xiangwei Wang, Xiaoguo Du, Qijun Chen Tongji Universiy yinxiaochuan@homail.

1 Scale Recovery for Monocular Visual Odomery Using Deph Esimaed wih Deep Convoluional Neural Fields Xiaochuan Yin, Xiangwei Wang, Xiaoguo Du, Qijun Chen Tongji Universiy {6448, Absrac Scale recovery is one of he cenral problems for monocular visual odomery. Normally, road plane and camera heigh are specified as reference o recover he scale. The performances of hese mehods depend on he plane recogniion and heigh measuremen of camera. In his work, we propose a novel mehod o recover he scale by incorporaing he dephs esimaed from images using deep convoluional neural fields. Our mehod considers he whole environmenal srucure as reference raher han a specified plane. The accuracy of deph esimaion conribues o he scale recovery. We improve he performance of deph esimaion by considering wo consecuive frames and egomoion of camera ino our neworks. The deph refinemen and scale recovery are obained ieraively. In his way, our mehod can eliminae he scale drif and improve he deph esimaion simulaneously. The effeciveness of our mehod is verified on he KITTI daase for boh monocular visual odomery and deph esimaion asks.. Inroducion Visual odomery is he process of esimaing he egomoion of he robo from inpu images. I is one of he core modules of simulaneous localizaion and mapping SLAM) sysem. Visual odomery using sereo camera has achieved a grea success hese years. The reason why sereo visual odomery is widely applied is ha i can esimae a reliable deph map for ransformaion marix calculaion. However, sereo camera will degenerae ino a monocular one when he disance beween camera and scene is much larger han he baseline. Moreover, self-calibraion is also required for sereo visual odomery sysem afer long-erm operaion in order o reduce he mechanical vibraion encounered in applicaion [, 23]. Unlike he sereo visual odomery, visual odomery using monocular camera does no suffer from hese problems. Therefore, monocular visual odomery has Auhors conribued equally Figure. 3D reconsrucion of sree scene wih egomoion and deph esimaed by our mehod. araced a lo of aenions in recen years. Monocular visual odomery algorihms canno ge environmen map and robo moion in real scale. Because he disance beween feaures and camera canno be measured by riangulaion direcly. Therefore, monocular visual odomery faces he problem of scale drif, and he absolue scale need o be recovered o eliminae i. To recover he scale, geomery consrains beween he camera and he surroundings are normally used. Previous mehods apply he camera heigh and he ground surface o obain he scale [2, 24, ]. The camera heigh is assumed o be known in advance. In [2], hey adop a exure analysis classifier o exrac he ground region, and in his way he planar homography marix is calculaed o recover he scale facor. In [24], a combinaion of several cues wihin a predefined region of ineres ROI) is used o recover he scale. To be more robus in differen scenes, [] esimaes he ground geomery wih self-learned ground appearance informaion. These mehods are consrained by limied environmenal informaion. More srucural informaion of environmens is required. In his way, here would no need o se he camera heigh and ground plane as he fixed references. In hese years, deep learning has achieved a grea success 87

2 in he compuer vision field for is powerful feaure learning abiliy. Deep learning may promoe he progress of visual odomery []. In [4], esimaor based on deep convoluion neural neworks is rained for egomoion esimaion from opical flow. Deph esimaion of surroundings would be anoher plausible direcion for his problem. Deph esimaion from single image is a challenging ask. Deph esimaion and scene srucure inference wih Markov Random Field MRF) is proposed in [22]. In [6, 7], a novel pixel-wise deph and surface normal regression neural neworks are proposed. Deph can also be refined via hierarchical condiional random fields CRFs) afer regression from deep convoluional neworks [7]. In [8], deep convoluional neural neworks combined wih condiion random fields convoluional neural fields) are applied o esimae deph from picure aken in indoor and oudoor environmens. For consecuive frames, opical flow is applied o exrac deph informaion [2, 2]. In his paper, we inroduce a novel scale recovery mehod using deph esimaed from images wih deep neural neworks. We calculae he scale of ranslaion from he esimaed deph map. The deph predicion is improved by considering consecuive images and moion consrains ino our mehod. Therefore, convoluional neural fields are used for deph esimaion in our sysem, which concaenae convoluional neural neworks and condiional random fields. The convoluional neural neworks can regress a coarse deph map from inpu images. Then he condiional random fields refine he coarse deph by involving he consrains. The scale of egomoion is obained based on he refined deph. In his way, he deph and scale can be calculaed ieraively. The conribuions of our paper are as follows.. We presen a novel scale recovery mehod for monocular visual odomery. 2. We presen novel neural neworks for deph esimaion from images. I akes consecuive frames and moion consrains ino consideraion o improve he resul. 3. Our mehod can alleviae he scale drif problem caused by he accumulaed error. 4. A novel monocular visual odomery sysem is proposed using deep learning. The srucure of his paper is organized as follows. In secion II, we inroduce he background knowledge of he relaionship beween phoo inensiy and camera moion. The neural neworks for deph esimaion wih consecuive picures and egomoion are proposed in secion III. In secion IV, we inroduce our scale calculaion mehod and framework of our sysem. Experimens are conduced and analyzed in secion V. We conclude our paper in he las secion. 2. Background In his secion, we inroduce he background knowledge of our sysem. In his paper, bold capial leers denoe a marix, bold lower-case leers a column vecor, ohers a scalar. The inensiy of pixel u = u,v) T Ω in he image is I : Ω R 2 R. The 3D poin in he world coordinae p = x,y,z) T R 3 of he pixel u can be obained from he inverse projecion funcionπ : R 2 R 3. p = π u cx u,z u ) = z u, v c ) T y, ) f x f y where z u is he deph of pixel u, f x, f y and c x, c y denoes he focal lengh and opical cener in he sandard pinhole camera model, respecively. The roaion marix R SO3) and ranslaion vecor R 3 can ransform he poinpo he new posiion. T T,p) = Rp+ 2) The residual of he i-h pixel is defined as he difference in inensiy beween he firs and second image. In his work, he inensiy of pixels afer reprojecion is assumed o be he same. r i,i := I 2 π T T,π u,i,z,ui ) ))) I u,i ) 3) R = { u u R k π T T,π u,z u ) )) Ω k } whereπ is he projecion funcion which is defined as: fx x u = πp) = z +c x, f ) yy z +c y. 3. Deph Esimaion wih Convoluional Neural Fields In his secion, we will inroduce he srucure of our deep neural neworks for deph esimaion. In order o increase he accuracy of deph predicion and remove he ouliers, wo consecuive images and he egomoion are aken as inpus in our neworks. We consruc our convoluional neural fields by concaenaing he convoluional neural neworks and condiional random fields. We apply he deep residual neural neworks ResNe) [] and fully convoluional neworks [9] o consruc he convoluional par of our neworks. ResNe is an elegan srucure, and i has achieved excellen performances []. I akes sandard convoluional neural neworks and adds skip connecions ha bypass a few convoluional layers o consruc he residual block. ResNe has a powerful abiliy for feaure represenaion. However, convoluional and pooling layers would generae he downsampling feaures. 4) 87

3 For he scale recovery problem, we need he deph values of whole image. We apply he fully convoluional neworks o upsampling he regressed deph map o he inpu s size. In order o refine he deph map for laer use, we apply he condiional random fields o smooh he regressed resul. The srucure of our neural neworks is shown in Figure 2. The deail of his par can refer o heir elegan work in [] [9]. Nex, we will presen our condiional random fields par for deph refinemen. 3.. Refining he regression resuls via CRFs Deph can be prediced using convoluional par menioned above. However, oupu of he above neworks has ouliers and may inroduce addiional errors for he scale recovery process. Therefore, addiional layer is required o refine he generaed deph. Condiional random fields are commonly used o involve consrains o improve he resuls. In his work, we define an energy funcion as follows. E = Uy,p,x,p )+ω Vy,p,y,q,x,p,x,q ) p Ω p,q) Ω +ω 2 Wx,p,y,p,x 2,q,T) p,q) R wherex i,p denoes he pixelpin imageω i. y i,p is he deph of pixel p in image Ω i. T is he ransformaion marix of robo. ω and ω 2 are coefficiens of las wo erms. They reflec he influences of each par, and hey are obained in our raining process. Our energy funcion consiss of hree poenial funcions: unary poenial, pairwise poenial in he same image and pairwise poenial of wo consecuive images. 3.. Unary poenial The unary poenial funcion measures he leas square loss beween he regressed deph z and ground ruh deph y. The regressed deph z is he oupu of he convoluional par and i is aken as inpu o he condiional random fields for refinemen. The unary poenial par is defined as: Uy,p,x,p ;θ) = 2 y,p z,p x,p ;θ)) 2, p Ω 6) where θ are he parameers of he convoluional par of our neworks. ) wih similar color are close in deph. Vy p,y q,x p,x q ;γ,γ 2 ) = 2 R pqy p y q ) 2, p,q Ω 7) wherer p,q is he appearance kernel, which is used o measure he disance and color beween differen pixels. R p,q = exp γ I p I q 2 γ 2 u p u q 2 ) 8) where I p and u p are he color and posiion of he pixel p; γ and γ 2 denoe he imporance of each kernel Pairwise poenial of wo consecuive images In order o consider he consrains beween wo consecuive images, he hird poenial funcion measures accuracy of deph esimaion by projecing he curren frame o he previous one wih known ransformaion marix. Because he inensiy error r does no follow he Gaussian disribuion. Therefore, he Huber loss funcion is applied in our energy funcion o eliminae he effecs of ouliers. W = 2 r it,u,z)+j i y z) Huber 9) wherej i is he Jacobian of her i a pixeluwih deph poin z and ransformaion marixt Training he neural neworks The probabiliy disribuion funcion of deph is Pry x) = exp Ey,x)) ) Zx) The pariion funcionzx) is defined as: Zx,x 2,T) = exp{ Ey,x,x 2,T)}dy. ) y Because he Huber loss funcion is applied in our energy funcion, we canno obain he pariion funcion for maximum a poseriori esimaion by calculaing he inegral in he pariion funcion analyically. Therefore, he neural neworks are rained hrough he score maching [] wihou requiring normalizaion. The discree version of he score maching is adoped o opimize he objecive funcion Jδ) = N N ϕ i δ)+ 2 ) ϕ iδ) 2 +cons 2) i= δ 3..2 Pairwise poenial in he same image The pairwise poenial erm enforces he coheren of neighboring pixel s deph. We assume ha he neighboring pixels where δ = [zθ) T,ω,ω 2 ] T are he parameers o opimize in condiional random fields. The score funcion is defined as: ϕy;δ) = y logpry δ) 3) 872

ResNe+FCN CRFs Frame Deph Regression Deph Refining Frame 2 Transformaion Marix Figure 2.

The oupu of he convoluional par is refined by he following condiional random fields layer.

variableδ can be opimized wih gradien descen δ δ η J δ.

3. Deph esimaion The refined deph can be calculaed by solving he MAP inference y = argmaxlogpry x,x 2,T).

error of pixels in wo consecuive frames is smaller han a given hreshold, oherwise he deph value is removed.

is he number of inliers;m = D R is he graph Laplacian marix;ris he affiniy marix composed ofr pq ; D is he

neural neworks is shown in Algorihm. We rain he neural neworks wih mean squared loss as crieria a firs.

The errors and gradiens are se o zeros for he pixel wihou ground ruh deph.

R 3 m n, he corresponding oupu deph maps{y i } N i=,y i R m n and he ransformaion marix{t i } N i=

4 ResNe+FCN CRFs Frame Deph Regression Deph Refining Frame 2 Transformaion Marix Figure 2. Srucure of our neural neworks. The inpus of neural neworks are wo consecuive frames and ransformaion marix. Oupu is he refined deph map. The deep convoluional par is composed of deep residual neworks and fully convoluional neworks. The oupu of he convoluional par is refined by he following condiional random fields layer. The i-h elemen of he model score funcion wih respec o hei-h variable is i ϕ i y;δ) = ϕ iy;δ) y i 4) The variableδ can be opimized wih gradien descen δ δ η J δ. ) The parameers of our convoluional neural fields {θ,ω,ω 2 } can be obained by backpropagaion Deph esimaion The refined deph can be calculaed by solving he MAP inference y = argmaxlogpry x,x 2,T). 6) y In he deph esimaion process, he Huber loss funcion is replaced wih mean squared loss if he inensiy error of pixels in wo consecuive frames is smaller han a given hreshold, oherwise he deph value is removed. The negaive log-likelihood can be wrien as: logpry x,x 2,T) = y z ˆN 2 +ω y T My+ω 2 r+jy z) 2) 7) where ˆN is he number of inliers;m = D R is he graph Laplacian marix;ris he affiniy marix composed ofr pq ; D is he diagonal marix wih D pp = q R pq; J is he Jacobian marix of he image wih respec of he deph, and J is he diagonal marix wihj pp = J p. The refined deph can be calculaed as: y = I+ω M+ω 2 J 2) z+ω2 J 2 z ω 2 Jr ) 8) The raining process of our neural neworks is shown in Algorihm. We rain he neural neworks wih mean squared loss as crieria a firs. Then he CRFs layer is added and we rain he whole neworks. The errors and gradiens are se o zeros for he pixel wihou ground ruh deph. Algorihm : Training process of convoluional neural fields Inpu: Given he inpu image sequence{x i } N i=, X i R 3 m n, he corresponding oupu deph maps{y i } N i=,y i R m n and he ransformaion marix{t i } N i= Iniialize: ResNe is obained from model zoo. Oupu: Parameers of neworksθ,ω and ω 2 // Pre-raining he ResNe+FCN wih mean squared loss // Training he neworks include he CRFs layer for = o maxieraion do Randomly selec inpux i,x+ i // Forward propagaion Obain refined deph mapŷi Obain he gradien from Eq. ) //Backpropagaion Updaeθ,ω and ω 2 end reurnθ,ω andω 2 and T 4. Scale Recovery wih Esimaed Deph The iniial egomoiont can be obained by minimizing he reprojecion error from mached ORB feaures[2]. 873

5 However, he obained egomoion is no in real scale. This will cause he scale drif for he monocular visual odomery sysem. In his secion, we inroduce our scale recovery mehod using deph map prediced from secion 3. The observed poin posiion relaive o curren camera coordinaes is denoed by p = x c i,yc i,zc i )T R 3. The scale parameerαcan be obained by solving z c i u i v i = α f x c x f y c y x c i y c i z c i i =,2,...,n 9) where zi c is he esimaed deph value in pixel u i u i,v i ). We denoe Zu i,v i ) = zi c. The esimaed deph value for each poin can be express as: ) x zi c c = Z i zi c f x +c x, yc i zi c f y +c y 2) and he scale parameer is defined as: α = zc i zi c 2) Therefore, he residual of he i-h observed map poin can be wrien as: r i = zi c αzi. c 22) Ideally, he residual would be zero. On accoun of he noise inroduced by he esimaed deph and he noise is independen and idenically disribued IID), he disribuion of r i denoes pr i ). Because we use he mached feaure poins o calculae he scale. The number of poins lef is small. Taking small sample size and exisence of ouliers ino consideraion, he non-sandardized suden s -disribuion is adoped o model his disribuion, which is capable of modeling he daa wih ouliers. Then he scale recovery problem can be represened as a maximum likelihood esimaion: where pr i ) σ α = argmax α n pr i ) 23) i + ri ) ) 2 υ+ 2 υ σ 24) whereσ andυ are he scaling parameer and degree of freedom of suden s -disribuion; α is he scale parameer o recover. The maximum likelihood esimaion can be solved by expecaion maximizaion EM) algorihm via calculaing α and σ [3, 4] ieraively. Then we can obain he ransformaion marix wih he scaleαas T = R α ) 2) Once he egomoion wih absolue scale is calculaed, he deph image Y can be refined by Eq. 8. Because feaure poins normally do no locae in he near disance, we do no consider he appearance kernel erm in scale recovery process. Soω is se o zero. In his way, we can also speed up calculaion. The egomoion can also be refined wih he more accurae deph map. As a resul, he egomoion and he deph map are becoming more and more accurae simulaneously. In our experimen, we only ierae his process once. Algorihm 2 describes he process of our visual odomery sysem. Image Deph Generaion Deph Generaion Visual Odomery Feaure Exracion Pose Esimaion Regressed Deph Iniialized Pose CRF Refined Deph Scale Recovery Deph Map Transformaion Marix Figure 3. Srucure of our sysem. There are wo main hreads, deph esimaion and visual odomery hreads. The deph esimaion hread generaes he deph map of he curren frame. The visual odomery hread predics he egomoion of he robo. Differen deph maps are refined considering he ime consumpion. Sparse deph map is refined for visual odomery, and dense deph maps are refined for environmen reconsrucion.. Experimens In his secion, we conduc experimens o verify he effeciveness of our mehod on he KITTI daase [8] for deph perdiion and monocular visual odomery asks. The KITTI daase consiss of image sequences from a driving vehicle wih deph capured by a LiDAR and ground ruh posiion by a GPS localizaion sysem. Figure shows parial 3D reconsrucion resul of our mehod using egomoion and deph esimaed by our mehod. Our neural neworks are rained and deployed on a NVIDIA TITAN X GPU wih 2 GB memory. The deep residual par of our neworks is ResNe- downloaded from model zoo. We apply he firs and firs 6 images in sequence and 2 as raining se. We firs pre-rain he neural neworks wih he ground ruh deph wihou he CRFs layer. Afer he pre-raining sage, we add he CRFs layer and fine une he whole neworks. Our neworks are rained in an end-o-end fashion. The deph is preprocessed by normalizing wih logx) minlogx)) maxlogx) minlogx)) [3, 8]. By doing his, he disri- 874

6 Algorihm 2: Scale recovery using deph prediced from image Inpu: Given he image sequence{x } n =, corresponding regressed deph images {Z } n = and camera parameers. Iniialize: α =,T is he camera pose relaive o he robo coordinae. Oupu: The camera poses {T } n = and he opimized sparse deph images{y } n = for = 2 o n do // Calculae he ransformaion marix Exrac and mach feaure poins Minimize he reprojecion error o calculae ransformaion marixt = R // Scale parameer calculaion Obaink map poins{p i } k i=,p i R 3 whilei < maxieraion do Calculae he corresponding deph {zi c }k i=,zc i R Ge he scale parameerαusing maximum likelihood esimaion from Eq. 23) Updae T R = α ) Opimize he deph imageŷ from Eq. 8) end SeY = Ŷ Ge curren poset = T end reurn{t } n =,{Y } n = T buion of deph is similar o a Gaussian disribuion. Unlike he deph predicion mehods proposed in [7, 8], we do no apply super pixels in our framework. ORB feaures are applied in our mehod. Because ORB feaures are sparse in images, poenial funcion in he same image is no involved in our deph refining process in visual odomery hread. This can accelerae he calculaion of our program. Figure 3 shows ha he srucure of our sysem conains wo hreads: deph generaion hread and visual odomery hread. The deph generaion hread provides regressed deph values o visual odomery hread. Egomoion obained from visual odomery hread is passed o deph generaion hread o ge he refined deph. In order o accelerae he deph refining process in visual odomery subsysem, only dephs of sparse feaure poins are refined. Afer obaining he robo moion, he dense deph map can be obained laer or in a differen hread. ) Table. Evaluaion on he KITTI daase for deph esimaion Mehod rel log rms Deph Transfer [2] from [2] ) Ranfl e al. from [2] ) Regressed deph wih our mehod Refined deph wih our mehod Deph esimaion The example images for qualiaive evaluaions are shown in Figure 4. The es images, ground ruh deph map, regressed deph map and refined deph map are shown in Figure 4 a), b), c) and d), respecively. Only nearly % of image pixels have he ground ruh deph values. We apply he inpaining mehod [6] for visualizaion. The refined resuls show ha our algorihm can also remove he ouliers o improve he accuracy of deph predicion. The accuracy of deph esimaion has grea impac on he performances of visual odomery. Applying he regression resuls direcly wihou deph refining, he egomoion obained by visual odomery will drif a lo. The quaniaive evaluaion of deph esimaion can be obained wih he following error merics. N d average relaive error rel): g N i= average log error log ): N i= log d g i log d i N roo mean squared error rms): N i di d g i, N i= dg i d i ) 2 where d g i and d i are he ground ruh and esimaed dephs indexed by i, and N is he oal number of dephs o evaluae. We compare our mehod wih oher deph esimaion approaches considering consecuive frames. In [2, 2], opical flow is applied o esimae he dephs from videos. The es images of above mehods are no provided in heir paper. We evaluae our mehod on deph maps generaed from over housand images from he sequence, 8, 9 and. According o heir experimen seings, we also evaluae he generaed deph wihin 2 meers. Ouliers are no involved in he evaluaion of refined resuls. The resuls are shown in Table. They indicae ha our mehod can predic accurae deph map for inliers..2. Monocular visual odomery Our visual odomery mehod is evaluaed on KITTI daase [8] wih he merics provided by [8]. We compare 87

a) Tes image b) Ground ruh c) Deph regression d) Deph refining Figure 4. Examples of deph predicions resuls on hree frames from he KITTI daase.

deph by our approach red is far, and blue is close). our resul wih oher relevan visual odomery mehods.

Our mehod ges a much beer resul han P-CNN VO [4] which is a recen mehod rying o regress he egomoion from opical flow using deep learning.

The reason why our mehod does no perform well in sequence 9 is ha he scene is unsrucured and conains many rees and bushes around he vehicle.

The regressed dephs are no accurae enough if he environmen lacks man-made references such as houses, vehicles.

Figure and 7 show he reconsruced rajecories and errors in ranslaion and roaion of VISO-M Monocular), VISO-S Sereo) [9] and our mehod.

Figure a) shows ha our mehod is beer han VISO-S for long disance driving. Apparenly, our mehod is much beer han VISO-M mehod.

We also compare he performances of our mehod wih ORB-M Monocular) SLAM as shown in Figure 6.

Zero means rack los, which means i can no calculae he egomoion.

ORB-M can ge a very good performance in calculaing roaion. On he oher hand, i is facing a large scale drif problem.

Unlike he ORB slam mehod, we do no include he close loop deecion and loop correcion in our algorihm.

Conclusion We presen a novel scale recovery mehod for monocular visual odomery.

7 a) Tes image b) Ground ruh c) Deph regression d) Deph refining Figure 4. Examples of deph predicions resuls on hree frames from he KITTI daase. For each frame, we show a) inpu color images, b) ground ruh deph inpained for visualizaion [6]), c) resuls produced by deph regression, and 4) refined deph by our approach red is far, and blue is close). our resul wih oher relevan visual odomery mehods. Table 2 shows he ranslaion and roaion errors in sequence 8, 9 and. Because he resuls of hese sequences are repored and available for comparison [4, 24]. Our mehod ges a much beer resul han P-CNN VO [4] which is a recen mehod rying o regress he egomoion from opical flow using deep learning. Comparing wih Song s and VISO2-M+GP s resuls in ranslaion [24], our mehod achieves a beer performance in sequence 8 and. The reason why our mehod does no perform well in sequence 9 is ha he scene is unsrucured and conains many rees and bushes around he vehicle. This experimen illusraes ha he accuracy of deph esimaion is imporan for our scale recovery sysem. The regressed dephs are no accurae enough if he environmen lacks man-made references such as houses, vehicles. If we can rain he neworks wih more daa or applying differen neural neworks for differen scenes, our mehod has poenial o become more accurae and robus. Figure and 7 show he reconsruced rajecories and errors in ranslaion and roaion of VISO-M Monocular), VISO-S Sereo) [9] and our mehod. Comparing wih VISO-S, we obain a comparaive performance in ranslaion on sequence 8 and. Figure a) shows ha our mehod is beer han VISO-S for long disance driving. Apparenly, our mehod is much beer han VISO-M mehod. As for roaion esimaion, our resul is much beer han oher mehods. And our roaion calculaion is based on ha of ORB-SLAM mehod. We also compare he performances of our mehod wih ORB-M Monocular) SLAM as shown in Figure 6. The scales are obained by dividing he moving disances of consecuive frames o he ground ruh. The expeced value would be. Zero means rack los, which means i can no calculae he egomoion. From he resuls, we can find ou ha our mehod can keep he scale from drif for long disance moion 3 meers in sequence 8). ORB-M can ge a very good performance in calculaing roaion. On he oher hand, i is facing a large scale drif problem. Loop closure deecion in ORB-SLAM can eliminae he scale drif. If he loop closure deecion fail, i will no ge a good performance for scale correcion. Unlike he ORB slam mehod, we do no include he close loop deecion and loop correcion in our algorihm. Our mehod can also be a good compensaion for oher monocular odomery mehods which are facing scale drif problem. 6. Conclusion We presen a novel scale recovery mehod for monocular visual odomery. The scale of ranslaion is obained using deph prediced from images, and deph is prediced wih convoluional neural fields. The performance of deph predicion is improved by incorporaing he consecuive frames and egomoion ino our neworks. The advanage of our mehod is ha i can recover he scale and eliminae he scale drif from srucural informaion of whole environmens raher han from a fixed reference plane. Experimens are conduced on he KITTI daase o verify he effeciveness of our mehod. The experimenal resuls show ha our algorihm can improve he accuracy on boh visual odomery and deph esimaion asks. Acknowledgemens. We hank he anonymous reviewers for heir valable commens. This work is suppored in par by he Naional Naural Science Foundaion of China 67326, 66733), he Fundamenal Research Funds for he Cenral Universiies, he Basic Research Projec of Shanghai Science and Technology Commission6JC42, 6DZ293). 876

8 z [m] Ground Truh V S V M Our Mehod Sequence Sar -4-2 m] a) Sequence 8 z [m] Ground Truh V S V M Our Mehod Sequence Sar x [m] b) Sequence 9 z [m] Ground Truh V S V M Our Mehod Sequence Sar x [m] c) Sequence Figure. Reconsruced rajecories of sequences 8, 9, and from he odomery benchmark of he KITTI daase. We compare our mehod wih VISO2-M Monocular) and VISO2-S Sereo). Table 2. Comparison of ranslaion and roaion errors for our mehod versus some monocular and sereo visual odomery mehods on he KITTI benchmark. VISO2-M VISO2-M+GP VISO2-Sereo P-CNN VO Song e al. s from [24]) from [24]) from [24]) from [4]) from [24]) Seq Trans Ro Trans Ro Trans Ro Trans Ro Trans Ro Trans Ro %) deg/m) %) deg/m) %) deg/m) %) deg/m) %) deg/m) %) deg/m) Avg ORB-M Our Mehod ORB-M Our Mehod ORB-M Our Mehod Scale a) Sequence 8 Scale b) Sequence 9 Scale c) Sequence Figure 6. Scale recovery resuls in sequences 8, 9, and from he odomery benchmark of he KITTI daase. We compare our mehod wih ORB-M Monocular). ORB-M racks los on abou 3 m in sequence 9. References [] C. Cadena, L. Carlone, H. Carrillo, Y. Laif, D. Scaramuzza, J. Neira, I. D. Reid, and J. J. Leonard. Pas, presen, and fuure of simulaneous localizaion and mapping: Towards he robus-percepion age. 26. [2] S. Choi, J. H. Joung, W. Yu, and J. I. Cho. Wha does ground ell us? monocular visual odomery under planar moion consrain. In Inernaional Conference on Conrol Auomaion and Sysems, pages 48 48, Oc 2. [3] J. Civera, A. J. Davison, and J. M. Moniel. Inverse deph paramerizaion for monocular slam. IEEE Translaion Error [%] Translaion Error [%] Translaion Error [%] a) Trans. error across disances c) Trans. error across disances e) Trans. error across disances Error [ Error [ Error [ ] b) Ro. error across disances d) Ro. error across disances f) Ro. error across disances Figure 7. Visual odomery resuls on he KITTI benchmark, for roaion and ranslaion errors over various disances. 877

9 ransacions on roboics, 24):932 94, 28. [4] G. Cosane, M. Mancini, P. Valigi, and T. A. Ciarfuglia. Exploring represenaion learning wih cnns for frame-o-frame ego-moion esimaion. IEEE Roboics and Auomaion Leers, ):8 2, 26. [] T. Dang, C. Hoffmann, and C. Siller. Coninuous sereo self-calibraion by camera parameer racking. IEEE Transacions on Image Processing, 87):36, 29. [6] D. Eigen and R. Fergus. Predicing deph, surface normals and semanic labels wih a common muli-scale convoluional archiecure. In Proceedings of he IEEE Inernaional Conference on Compuer Vision, pages , 2. [7] D. Eigen, C. Puhrsch, and R. Fergus. Deph map predicion from a single image using a muli-scale deep nework. In Proceedings of Advances in Neural Informaion Processing Sysems, pages , 24. [8] A. Geiger, P. Lenz, and R. Urasun. Are we ready for auonomous driving? he kii vision benchmark suie. In Conference on Compuer Vision and Paern Recogniion, 22. [9] A. Geiger, J. Ziegler, and C. Siller. Sereoscan: Dense 3d reconsrucion in real-ime. In Inelligen Vehicles Symposium, 2. [] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recogniion. In Compuer Vision and Paern Recogniion, pages , 26. [] A. Hyvärinen. Esimaion of non-normalized saisical models by score maching. Journal of Machine Learning Research, 6Apr):69 79, 2. [2] K. Karsch, C. Liu, and S. B. Kang. Deph ransfer: Deph exracion from video using non-parameric sampling. IEEE ransacions on paern analysis and machine inelligence, 36):244 28, 24. [3] C. Kerl, J. Surm, and D. Cremers. Robus odomery esimaion for rgb-d cameras. In IEEE Inernaional Conference on Roboics and Auomaion, pages IEEE, 23. [4] K. L. Lange, R. J. Lile, and J. M. Taylor. Robus saisical modeling using he disribuion. Journal of he American Saisical Associaion, 8448):88 896, 989. [] B. Lee, K. Daniilidis, and D. D. Lee. Online selfsupervised monocular visual odomery for ground vehicles. In 2 IEEE Inernaional Conference on Roboics and Auomaion, pages IEEE, 2. [6] A. Levin, D. Lischinski, and Y. Weiss. Colorizaion using opimizaion. In ACM Transacions on Graphics, volume 23, pages ACM, 24. [7] B. Li, C. Shen, Y. Dai, A. V. D. Hengel, and M. He. Deph and surface normal esimaion from monocular images using regression on deep feaures and hierarchical crfs. pages 9 27, 2. [8] F. Liu, C. Shen, G. Lin, and I. Reid. Learning deph from single monocular images using deep convoluional neural fields. IEEE Transacions on Paern Analysis and Machine Inelligence, 38): , Oc 26. [9] J. Long, E. Shelhamer, and T. Darrell. Fully convoluional neworks for semanic segmenaion. In IEEE Conference on Compuer Vision and Paern Recogniion, pages , 2. [2] R. Mur-Aral, J. M. M. Moniel, and J. D. Tards. Orbslam: A versaile and accurae monocular slam sysem. IEEE Transacions on Roboics, 3):47 63, Oc 2. [2] R. Ranfl, V. Vinee, Q. Chen, and V. Kolun. Dense monocular deph esimaion in complex dynamic scenes. 26. [22] A. Saxena, M. Sun, and A. Y. Ng. Make3d: learning 3d scene srucure from a single sill image. IEEE Transacions on Paern Analysis and Machine Inelligence, 3):824 84, 29. [23] D. Scaramuzza and F. Fraundorfer. Visual odomery. IEEE Roboics & Auomaion Magazine, 84):8 92, 2. [24] S. Song, M. Chandraker, and C. C. Gues. High accuracy monocular sfm and scale correcion for auonomous driving. IEEE Transacions on Paern Analysis and Machine Inelligence, 384):73 743, April

CAMERA CALIBRATION BY REGISTRATION STEREO RECONSTRUCTION TO 3D MODEL

CAMERA CALIBRATION BY REGISTRATION STEREO RECONSTRUCTION TO 3D MODEL Klečka Jan Docoral Degree Programme (1), FEEC BUT E-mail: xkleck01@sud.feec.vubr.cz Supervised by: Horák Karel E-mail: horak@feec.vubr.cz