3D Scene Mesh From CNN Depth Predictions And Sparse Monocular SLAM

Size: px

Start display at page:

Download "3D Scene Mesh From CNN Depth Predictions And Sparse Monocular SLAM"

Brittney Spencer
5 years ago
Views:

3D Scene Mesh From CNN Depth Predctons And Sparse Monocular SLAM Tomoyuk Mukasa Ju Xu Bjorn Stenger Rakuten Insttute of Technology, Tokyo, Japan tomoyuk.mukasa@rakuten.

1 3D Scene Mesh From CNN Depth Predctons And Sparse Monocular SLAM Tomoyuk Mukasa Ju Xu Bjorn Stenger Rakuten Insttute of Technology, Tokyo, Japan Abstract In ths paper, we propose a novel framework for ntegratng geometrcal measurements of monocular vsual smultaneous localzaton and mappng (SLAM) and depth predcton usng a convolutonal neural network (CNN). In our framework, SLAM-measured sparse features and CNNpredcted dense depth maps are fused to obtan a more accurate dense 3D reconstructon ncludng scale. We contnuously update an ntal 3D mesh by ntegratng accurately tracked sparse features ponts. Compared to pror work on ntegratng SLAM and CNN estmates [26], there are two man dfferences: Usng a 3D mesh representaton allows as-rgd-as-possble update transformatons. We further propose a system archtecture sutable for moble devces, where feature trackng and CNN-based depth predcton modules are separated, and only the former s run on the devce. We evaluate the framework by comparng the 3D reconstructon result wth 3D measurements obtaned usng an RGBD sensor, showng a reducton n the mean resdual error of 38% compared to CNN-based depth map predcton alone. 1. Introducton Computer vson has long been successfully employed to track camera moton and reconstruct 3D structure from mage sequences. These methods have been appled e.g. n the vsual effects ndustry [6], the robot vson communty [4], or varous types of 3D reconstructon from a large scale [1] to a small scale on a moble devce [14]. In the past decade, moble applcatons for augmented realty (AR) and mxed realty (MR) have become ubqutous. Snce most current moble devces come wth a sngle (back-facng) camera, these applcatons rely on monocular vsual SLAM to recover camera pose [5, 20]. Vsual SLAM estmates depth from small-baselne stereo matchng over pars of nearby frames. Ths assumes that the camera translates n space over tme, so that pars of consecutve frames are equvalent Fgure 1. 3D Scene Mesh. Result of our method, reconstructed from a CNN-predcted mesh, deformed usng 3D ponts obtaned by SLAM, whch are ndcated as lght green dots. to the pars of frames captured usng a stereo rg. Tradtonally, vsual SLAM has reled on 2D feature matchng due to ts effcency and robustness n scenes wth suffcent texture [5, 20]. Recent methods that employ contour nformaton or the whole mage have been shown to add robustness, but come wth hgher computatonal cost [9, 11, 22]. A known lmtaton of monocular SLAM s that t cannot estmate the scale of the scene. Ths can be estmated by ncludng addtonal nformaton, such as data from an nertal measurement unt (IMU) [19], or pror knowledge about object szes [29]. Another shortcomng s that monocular vsual SLAM s ll-condtoned for certan camera motons lke rotaton wthout translaton. Recently, neural networks have been shown to provde good predctons of geometry,.e. depth and normals, from a gven nput mage [7, 8, 15]. An end-to-end traned CNN s able to predct geometry densely, even for less textured areas. Unlke the geometry from monocular SLAM, the depth 1921

Image capturng & Vsualzaton thread CLIENT-SIDE Key-frame t t+1 t+2 t+3 t+4 Monocular vsual SLAM 3D reconstructon 2D trackng thread 3D Mappng thread Depth predcton thread Depth predcton by CNN Depth

Images and 3D ponts are sent to the server sde, where CNN depth predcton and surface mesh deformaton s carred out and s sent back to the clent for vsualzaton.

Recent work on combnng 3D SLAM measurements and depth predctons from a CNN has shown that both sources can complement each other [26], see Table 1.

2 Image capturng & Vsualzaton thread CLIENT-SIDE Key-frame t t+1 t+2 t+3 t+4 Monocular vsual SLAM 3D reconstructon 2D trackng thread 3D Mappng thread Depth predcton thread Depth predcton by CNN Depth fuson by surface mesh deformaton Depth fuson thread SERVER-SIDE Fgure 2. System Overvew. Monocular SLAM s run on the moble devce. Images and 3D ponts are sent to the server sde, where CNN depth predcton and surface mesh deformaton s carred out and s sent back to the clent for vsualzaton. map ncludes an absolute scale, as learned from the tranng examples. One drawback of current methods s that occludng boundary regons tend to be overly smooth and shape detals are lost. Recent work on combnng 3D SLAM measurements and depth predctons from a CNN has shown that both sources can complement each other [26], see Table 1. In ths paper we propose a framework to fuse monocular SLAM wth CNN-based depth predctons n a new way: Feature pont-based ORB-SLAM s run on the moble devce, yeldng sparse but accurate 3D ponts. Depth maps are asynchronously predcted on a server and converted to a mesh representaton. The mesh s then deformed, n an as-rgdas-possble manner, usng the sparse, but accurate feature ponts, and the updated mesh sent to the devce. Ths approach corrects both coarse global geometrc errors and rentroduces some shape detals, see Fgure 1. We evaluate the method on challengng offce scenes, comparng the result wth depth-sensor ground truth. 2. Related work We revew three areas of related work, monocular vsual SLAM, CNN-based depth predcton, and surface mesh deformaton. Monocular vsual SLAM can be classfed nto two categores [30], feature-based [13, 20] and drect approaches [9, 22]. One of the state-of-the-art methods n the featurebased category s ORB-SLAM[20]. The method extracts sparse ORB features and reconstructs them n 3D usng bundle adjustment and pose graph optmzaton. In contrast, drect methods carry out pose estmaton usng all mage pxels. The frst real-tme method n ths category was Dense Trackng and Mappng (DTAM) [22]. Snce processng every pxel s computatonally more expensve, DTAM acheved real-tme performance usng a GPU. Engel et al. [9] proposed Large-Scale Drect Monocular SLAM (LSD- SLAM) whch runs n real-tme on a CPU. The method estmates depth at pxels near mage boundares and recovers a sem-dense map. Apart from hgher computatonal complexty, drect matchng tends to work better for short baselnes even wth moton blur, whle the nvarance property of feature-based approaches allows large vewpont changes. Engel et al. [10] proposed Sem-dense Vsual Odometry (SVO), a hybrd between feature-based and drect SLAM methods, usng a combnaton of drect methods to establsh feature correspondences and feature-based methods to refne the camera pose estmates. Depth predcton from sngle mages has been a longstandng research problem, and deep learnng methods have been shown to exceed methods usng hand-crafted features n terms of the accuracy [7, 8, 15, 17, 18, 27]. Recent methods combne CNN-based depth predctons wth vsual SLAM. Lana et al. [15] proposed a fully convolutonal archtecture and resdual learnng to predct depth maps from mages. In ther evaluaton, the predcted depth maps were nput to Kellers Pont-Based Fuson RGB-D SLAM algorthm [12]. The estmated 3D geometry lacks some shape detal because of blurred regons n the predcted depth maps. Recently, Tateno et al. [26] proposed CNN-SLAM n whch predcted depth and normal maps are fused wth drect monocular SLAM nspred by LSD-SLAM. Depth fuson s an mportant process for reconstructng accurate and complete 3D shape from depth maps. Cur- 922

3 Method 3D Reconstructon Computatonal complexty Accuracy Scale Monocular vsual SLAM (feature based) CNN-based depth predcton Sparse (scene complexty dependent) Dense (estmated for each pxel) Proposed framework Dense (estmated for each pxel) Low (runs on moble devce) Hgh None Hgh (a few seconds for each frame) Hgh (but only vsual SLAM runs on moble devce) Medum (tranng-data dependent) Hgh Avalable Avalable Table 1. Propertes of ndvdual reconstructon methods and of ther combnaton, whch retans desrable propertes of each. less et al. proposed to use averagng truncated sgned dstance functons (TSDF) for depth suson [3] whch s smple yet effectve and used n a large number of reconstructon ppelnes ncludng KnectFuson [21]. Mesh deformaton technques are wdely used n graphcs and vson. Especally, lnear varatonal mesh deformaton technques were developed for edtng detaled hghresoluton meshes, lke those produced by scannng realworld objects [2]. For local detal preservaton mesh deformatons that are locally as-rgd-as-possble (ARAP) have been proposed. The ARAP method by Sorkne et al. [25] optmzes rgd transformatons n 1-rng neghborhoods ( cells ), mantanng consstency between adjacent pars of rgd transformatons by sngle overlappng edges. Lev et al. [16] ntroduced SR-ARAP energy formulaton n whch rotaton of local neghborhood on mesh are constraned to be smlar to neghbors, and enhance the smoothness of the ARAP method. In ths work we convert depth maps to surface meshes and employ ARAP deformatons usng geometrc constrants. 3. Depth fuson by geometrc constrants We desgned our framework consstng of three parts, monocular vsual SLAM, CNN-based depth predcton, and surface mesh deformaton for fusng depth maps constraned by geometrc constrants generated by the SLAM process. Fgure 2 shows the ppelne of our framework. For the mplementaton we use a clent-server desgn, where feature-based monocular SLAM runs on the moble devce and dstnctve key-frames together wth camera poses and 2D and 3D feature coordnates are sent to the server. On the server sde, a depth map s predcted for each key frame and converted to a surface mesh. Fnally, the surface meshes are deformed usng 3D features as geometrc constrants, and fused to a 3D reconstructon. Updates of the 3D reconstructon are returned to the clent and where t s vsualzed for the current camera poston. In the followng subsectons, we wll detal each stage of the framework, SLAM process n Secton3.1, depth predcton n Secton 3.2, and surface mesh deformaton n Monocular vsual SLAM Although our framework s compatble wth any type of feature-based monocular vsual SLAM methods, we employ ORB-SLAM [20] because of ts robustness and accuracy. ORB-SLAM ncorporates three parallel threads: trackng, mappng and loop closng. The trackng s n charge of localzng the camera n every frame and decdng when to nsert a new key-frame. The mappng processes new key-frames and performs local bundle adjustment for reconstructon. The loop closng searches for loops wth every new key-frame. Each key-framek t s assocated wth camera poset kt at tme t, locatons of ORB features p 2D (t) and correspondng 3D map ponts p 3D (t). Note that T kt and p 3D (t) are defned n the map coordnates, whch lacks absolute scale CNN based depth predcton For depth predcton, we use the state-of-the-art archtecture proposed n [7]. When a new key-frame s created and sent to the server-sde, a depth map s predcted by the CNN. The CNN of [7] s a three-step mult-scale network that predcts the structure of the scene takng context nto account by ncludng poolng and convoluton layers wth dfferent strde and kernel szes. The network s traned usng an element-wse L2 loss that explctly accounts for depth relatons between pxel locatons, n addton to the pont-wse error. The loss s defned as: ( ) 2 L depth (D,D ) = 1 d 2 1 n 2n 2 d + 1 [( x d 2 n )+( y d 2 )], where D and D are predcted depth and ground truth depth, respectvely, the loss equalsd D. After computng the depth mapd t of key-framek t, we convert t to a pont cloud n whch ponts correspond to pxels n the map, and a surface meshst cam defned n camera coordnates at tme t. Surface mesh St cam s fused wth other meshes to form a unque 3D reconstructon n the next step. In ths deformaton process, the mesh needs to be wa- (1) 923

4 camera cotangent weght: w j = 1 2 (cotα j +cotβ j ), (4) deformed mesh map ponts! " #$ mesh where α j, β j are the angles opposte of the mesh edge (,j). We defne the energy functon of the whole mesh by summng over the devatons from rgdty per local neghborhood as follows: E(S t) = v S t w E(C ), (5) Fgure 3. Correspondences between 3D map ponts and mesh vertces. tertght to avod mesh corrupton because of the local dspartes of deformaton force. We smply defne edges between vertces based on the pxel connectvtes on the map Mesh deformaton for depth fuson We fuse surface meshst cam nto a unque 3D reconstructon n world coordnates based on map ponts p 3D (t) defned n map coordnates. We frst convert St cam defned n map coordnates by reprojectng each vertex us- to S map t ng the assocated camera poset kt recorded n map coordnates. Secondly, we scalest cam by mnmzng the dstance between map ponts p 3D (t) and correspondng vertces v nst cam as follows: s map t = argmn sv p 3D (t) 2, (2) s s the scale factor for St cam. The correspon- (t) can be easly found by projectng where s map t dences f : v p 3D a ray from the camera center to map ponts and fnd the vertex nearest to ts ntersecton wth the mesh (see Fgure 3). Our mesh deformaton s nspred by as-rgd-as-possble (ARAP) transformatons proposed n [25]. We use the set of map ponts p 3D (t) as the geometrc constrant of the deformaton and defne the one-rng neghborhood of each vertex. Ideally the deformaton seeks to keep the transformaton for the surface n each local neghborhood as rgd as possble. Overlap of local neghborhoods s necessary to avod surface stretchng or shearng at the boundary of the local neghborhoods. By usng the local neghborhood concept, we can defne the followng energy functon for the local neghborhood C, correspondng to vertex v, and ts deformed versonc : E(C ) = w j (v v j) R (v v j ) 2, (3) v j C where R s a 3 3 rotaton matrx and w j denotes the wherew s a weght for the local neghborhoodc. We expand the ARAP method by defnng w based on the normal vector correspondng to the local neghborhood of each vertex. As we defne S t as a watertght surface mesh, and ts correspondng depth map s generated from a sngle camera vewpont, there are areas n whch ther normals are nearly perpendcular to the ray from the camera center,.e., the observablty from the camera s low. These areas tend to be boundary areas between objects n the scene, and do not correspond to objects n the actual scene. To mantan shape detals of the objects n the scene, we selectvely deform these areas as much as possble by defnng the weghtw by a sgmod functon as follows: 1 w = 1+e a(x+bπ), (6) where x denotes the angle between the normal of the local neghborhood and the ray from the camera center to the vertex, a and b are emprcally defned parameters. Fgure 4 shows the dstrbuton ofw on a surface mesh. We further ntroduce a bendng factor B j, as suggested n [16], as follows: B j = αa R R j, (7) whereαs a weghtng coeffcent,as the surface area for scalng nvarance, and fnally update Equaton 5 as follows: E(S t) = v S t w E(C )+B j. (8) After ths deformaton, we scale the deformed mesh by the absolute scales world t estmated for tmetas follows: s world t = t t scam t. (9) The scaled 3D mesh s sent to the clent, and rendered from the current camera poston or any other vewponts specfed by the user. 924

Fgure 4. (Top) Dstrbuton of weghtsw for the deformaton and (bottom) the correspondng textured mesh. Larger ntensty values n the top fgure ndcate the hgher weghts. 4. Experments We evaluate our framework quanttatvely by comparng the 3D reconstructon result wth ground truth data.

The sensor captures RGB mages at VGA resoluton and depth mages at QVGA resoluton, and s able to capture wthn a range of 40 to 350 cm dstance to the camera (see Fgure 5).

5 Fgure 4. (Top) Dstrbuton of weghtsw for the deformaton and (bottom) the correspondng textured mesh. Larger ntensty values n the top fgure ndcate the hgher weghts. 4. Experments We evaluate our framework quanttatvely by comparng the 3D reconstructon result wth ground truth data. For the acquston of the ground truth data and the nput mages, we used a tablet equpped wth an RGB-D StructureSensor [23]. The sensor captures RGB mages at VGA resoluton and depth mages at QVGA resoluton, and s able to capture wthn a range of 40 to 350 cm dstance to the camera (see Fgure 5). Our CNN archtecture s traned on the NYU Depth Dataset v2 [24], and thus performs well on typcal ndoor scenes. Our framework s desgned for the case n whch the CNN-predcted depth map s naccurate. We captured a new challengng offce dataset wth many poorly textured surfaces. We fed the mage sequences nto the proposed framework. ORB-SLAM runs n real-tme (5 to 10 fps) on the clent, specfcally, an Phone 6 wth an A8 processor and a 8 mega-pxel camera. The other components, CNN depth estmaton and mesh adaptaton s carred out on the server, a PC wth Intel Xeon dual core CPU, 2.4GHz, 96GB of RAM and an Nvda GeForce GTX Ttan X GPU wth 12GB of VRAM. To adjust to the orgnal mplementaton of the CNN and ncrease speed, both nput RGB mages and estmated depth mages are reszed to The average processng tme of the depth predcton for each key-frame s 2.6 seconds. Ths s longer than the duraton between keyframes detected by ORB-SLAM because these are selected based on vsual changes. We flter out those key-frames usng a spato-temporal dstance crteron smlar to the other feature-based approaches, e.g., PTAM, and send them to the server. The key-frames are processed on the server and the depth mage for each frame s estmated by the CNN archtecture. In the fuson process, we convert the depth mages to a refned mesh sequence as shown at the bottom of Fgure 5.We also make the ground truth mesh sequence correspond to the refned one from the raw depth maps captured by the depth sensor on the other hand. We compute resdual errors between the refned mesh and the ground truth as shown n Table 2 and Fgure 6. We can observe that our framework effcently reduces the resdual errors for all sequences. Both the average and the medan of the resdual errors fall wthn the range from about two thrds to a half. We also evaluate the absolute scale estmated from depth predcton as shown n the rghtmost column n the Table 2. The average error of the estmated scales for our sx offce scenes s20% of the ground truth scale. 5. Concluson In ths paper, we proposed a framework fusng the result of geometrc measurement,.e., feature-based monocular vsual SLAM and CNN-based depth predcton. We have shown ts effcency andpotental for applcatons whch can run on standard moble devces only equpped wth a sngle camera. Thanks to the capablty of CNN for depth predcton, some of the man lmtatons of feature-based monocular vsual SLAM, such as lack the absolute scale, sparse 3D reconstructon, were overcome. The 3D map acqured by monocular vsual SLAM also compensate the lmtaton of CNN-based depth maps by refnng t wth surface mesh deformaton to mantan shape detals. There are several possble drectons of future work. The frst s global mesh refnement and ntegraton based on the photometrc and geometrc consstency between meshes to obtan a unfed reconstructon result smlar to [28]. As the current deformaton s constraned only by sparse 3D features, refned meshes are not fully regstered wth each other as show at the bottom of Fgure 5. Second drecton s IMU-based scale estmaton. As the absolute scale n the current framework s predcted by CNN and hghly depend on the tranng data, we expect the ts accuracy can be enhanced f we fuse t wth IMU measurement. Another drecton s full use of CNN-based predcton, e.g., semantc labelng. By utlzng semantc labelng, we can selectvely manpulate the 3D reconstructon result. For nstance, we can recognze real furnture n the reconstructon and replace t wth a vrtual one. 925

6 Sofa area 1 Sofa area 2 Sofa area 3 Desk area 1 Desk area 2 Meetng room Fgure 5. Input data for our depth fuson and the reconstructed scenes. From top to bottom row: color mages, feature trackng result of SLAM, correspondng ground truth depth mages, depth mages estmated by DNN, and 3D reconstructon results on sx offce scenes, respectvely. Scene Mesh from CNN depth map Refned mesh by our method Mean Medan Std dev Mean Medan Std dev Scale Sofa area Sofa area Sofa area Desk area Desk area Meetng room Table 2. Accuracy mprovement results. Comparson of resdual errors [cm] from the ground truth obtaned usng a depth sensor, and the absolute scale estmated based on depth predcton. References [1] S. Agarwal, Y. Furukawa, N. Snavely, I. Smon, B. Curless, S. M. Setz, and R. Szelsk. Buldng rome n a day. Commun. ACM, 54(10): , Oct [2] M. Botsch and O. Sorkne. On lnear varatonal surface deformaton methods. IEEE Transactons on Vsualzaton and Computer Graphcs, 14(1): , Jan [3] B. Curless and M. Levoy. A volumetrc method for buldng complex models from range mages. In Proceedngs of the 23rd Annual Conference on Computer Graphcs and Interactve Technques, SIGGRAPH 96, pages , New York, NY, USA, ACM. 3 [4] A. J. Davson and D. W. Murray. Moble robot localsaton usng actve vson, pages Sprnger Berln Hedelberg, Berln, Hedelberg, [5] A. J. Davson, I. D. Red, N. D. Molton, and O. Stasse. MonoSLAM: Real-tme sngle camera SLAM. 29(6): , [6] T. Dobbert. Matchmovng: The Invsble Art of Camera Trackng. John Wley and Sons, [7] D. Egen and R. Fergus. Predctng depth, surface normals and semantc labels wth a common mult-scale convolu- 926

Ground truth CNN-predcted mesh Deformed mesh Ground truth mesh wth SLAM features Resdual error of CNN-predcted mesh Resdual error of deformed mesh Fgure 6. Surface mesh deformaton result.

SLAM features. (Bottom left) 3D SLAM features together wth the ground truth mesh, (bottom center and rght) resdual errors of CNN-predcton and deformed mesh. Warm colors ndcates hgher resdual error.

Depth map predcton from a sngle mage usng a mult-scale deep network. CoRR, abs/1406.2283, 2014. 1, 2 [9] J. Engel, T. Schöps, and D. Cremers. LSD-SLAM: Largescale drect monocular SLAM.

Bschof. Effcent 3D scene abstracton usng lne segments. CVIU, 2016. 1 [12] M. Keller, D. Lefloch, M. Lambers, S. Izad, T. Weyrch, and A. Kolb.

Parallel trackng and mappng for small AR workspaces. In ISMAR, pages 1 10, Washngton, DC, USA, 2007. IEEE Computer Socety. 2 [14] K. Kolev, P. Tanskanen, P. Specale, and M. Pollefeys.

7 Ground truth CNN-predcted mesh Deformed mesh Ground truth mesh wth SLAM features Resdual error of CNN-predcted mesh Resdual error of deformed mesh Fgure 6. Surface mesh deformaton result. (Top left) ground truth surface mesh obtaned from depth-sensor measurements, (top center) surface mesh converted from the CNN-predcted depth map, (top rght) deformaton result of center mesh usng 3D SLAM features. (Bottom left) 3D SLAM features together wth the ground truth mesh, (bottom center and rght) resdual errors of CNN-predcton and deformed mesh. Warm colors ndcates hgher resdual error. Black regons ndcate areas beyond the maxmum depth sensor range. tonal archtecture. CoRR, abs/ , , 2, 3 [8] D. Egen, C. Puhrsch, and R. Fergus. Depth map predcton from a sngle mage usng a mult-scale deep network. CoRR, abs/ , , 2 [9] J. Engel, T. Schöps, and D. Cremers. LSD-SLAM: Largescale drect monocular SLAM. In ECCV, September , 2 [10] J. Engel, J. Sturm, and D. Cremers. Sem-dense vsual odometry for a monocular camera. In ICCV, Sydney, Australa, December [11] M. Hofer, M. Maurer, and H. Bschof. Effcent 3D scene abstracton usng lne segments. CVIU, [12] M. Keller, D. Lefloch, M. Lambers, S. Izad, T. Weyrch, and A. Kolb. Real-tme 3D reconstructon n dynamc scenes usng pont-based fuson. In 3DV, pages 1 8, Washngton, DC, USA, IEEE Computer Socety. 2 [13] G. Klen and D. Murray. Parallel trackng and mappng for small AR workspaces. In ISMAR, pages 1 10, Washngton, DC, USA, IEEE Computer Socety. 2 [14] K. Kolev, P. Tanskanen, P. Specale, and M. Pollefeys. Turnng moble phones nto 3d scanner. In CVPR, [15] I. Lana, C. Rupprecht, V. Belaganns, F. Tombar, and N. Navab. Deeper depth predcton wth fully convolutonal resdual networks. CoRR, abs/ , , 2 [16] Z. Lev and C. Gotsman. Smooth rotaton enhanced as-rgdas-possble mesh anmaton. IEEE Transactons on Vsualzaton and Computer Graphcs, 21: , , 4 [17] B. L, C. Shen, Y. Da, A. van den Hengel, and M. He. Depth and surface normal estmaton from monocular mages usng regresson on deep features and herarchcal crfs. In 2015 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pages , June [18] F. Lu, C. Shen, and G. Ln. Deep convolutonal neural felds for depth estmaton from a sngle mage. CoRR, abs/ , [19] S. Lynen, M. Achtelk, S. Wess, M. Chl, and R. Segwart. A robust and modular mult-sensor fuson approach appled to mav navgaton. In Proc. of the IEEE/RSJ Conference on Intellgent Robots and Systems (IROS), [20] R. Mur-Artal, J. M. M. Montel, and J. D. Tards. Orb-slam: A versatle and accurate monocular slam system. IEEE Transactons on Robotcs, 31(5): , Oct , 2, 3 [21] R. A. Newcombe, S. Izad, O. Hllges, D. Molyneaux, D. Km, A. J. Davson, P. Koh, J. Shotton, S. Hodges, and A. Ftzgbbon. Knectfuson: Real-tme dense surface mappng and trackng. In th IEEE Internatonal Symposum on Mxed and Augmented Realty, pages , Oct

8 [22] R. A. Newcombe, S. J. Lovegrove, and A. J. Davson. DTAM: Dense trackng and mappng n real-tme. In ICCV, pages , Nov , 2 [23] Occptal. v0.6.2, [24] N. Slberman, D. Hoem, P. Kohl, and R. Fergus. Indoor segmentaton and support nference from rgbd mages. In ECCV, [25] O. Sorkne and M. Alexa. As-rgd-as-possble surface modelng. In Proc. Ffth Eurographcs Symp. Geometry Processng, SGP 07, pages , , 4 [26] K. Tateno, F. Tombar, I. Lana, and N. Navab. CNN-SLAM: real-tme dense monocular SLAM wth learned depth predcton. CoRR, abs/ , , 2 [27] P. Wang, X. Shen, Z. Ln, S. Cohen, B. Prce, and A. Yulle. Towards unfed depth and semantc predcton from a sngle mage. In 2015 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pages , June [28] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davson, and S. Leutenegger. Elastcfuson: Real-tme dense slam and lght source estmaton. Int. J. Robotcs Research, 35(14): , [29] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xao. 3d shapenets: A deep representaton for volumetrc shapes. In CVPR, [30] G. Younes, D. C. Asmar, and E. A. Shammas. A survey on non-flter-based monocular vsual SLAM systems. CoRR, abs/ ,

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,