Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

Similar documents
CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction

From 3D descriptors to monocular 6D pose: what have we learned?

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

Fully Convolutional Network for Depth Estimation and Semantic Segmentation

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Direct Methods in Visual Odometry

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

Real-Time Vision-Based State Estimation and (Dense) Mapping

Dense Tracking and Mapping for Autonomous Quadrocopters. Jürgen Sturm

Intrinsic3D: High-Quality 3D Reconstruction by Joint Appearance and Geometry Optimization with Spatially-Varying Lighting

arxiv: v4 [cs.cv] 26 Jul 2018

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

EasyChair Preprint. Visual Odometry Based on Convolutional Neural Networks for Large-Scale Scenes

Reconstruction, Motion Estimation and SLAM from Events

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

Segmentation and Tracking of Partial Planar Templates

Semi-Dense Direct SLAM

Depth Estimation from Single Image Using CNN-Residual Network

Lecture 19: Depth Cameras. Visual Computing Systems CMU , Fall 2013

arxiv: v1 [cs.cv] 13 Nov 2016 Abstract

Single Image Depth Estimation via Deep Learning

FLaME: Fast Lightweight Mesh Estimation using Variational Smoothing on Delaunay Graphs

Monocular Tracking and Reconstruction in Non-Rigid Environments

Geometric Reconstruction Dense reconstruction of scene geometry

Fast Semantic Segmentation of RGB-D Scenes with GPU-Accelerated Deep Neural Networks

3D Computer Vision. Depth Cameras. Prof. Didier Stricker. Oliver Wasenmüller

arxiv: v2 [cs.ro] 26 Feb 2018

Super-Resolution Keyframe Fusion for 3D Modeling with High-Quality Textures

Stereo and Epipolar geometry

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image. Supplementary Material

Jakob Engel, Thomas Schöps, Daniel Cremers Technical University Munich. LSD-SLAM: Large-Scale Direct Monocular SLAM

arxiv: v2 [cs.cv] 21 Feb 2018

Deep Models for 3D Reconstruction

Lecture 10 Dense 3D Reconstruction

arxiv: v1 [cs.cv] 17 Oct 2016

Semantic Segmentation

Towards a visual perception system for LNG pipe inspection

視覚情報処理論. (Visual Information Processing ) 開講所属 : 学際情報学府水 (Wed)5 [16:50-18:35]

arxiv: v2 [cs.cv] 14 May 2018

Lecture 10 Multi-view Stereo (3D Dense Reconstruction) Davide Scaramuzza

Learning 6D Object Pose Estimation and Tracking

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Multi-View 3D Object Detection Network for Autonomous Driving

RGBD Occlusion Detection via Deep Convolutional Neural Networks

15 Years of Visual SLAM

Deeper Depth Prediction with Fully Convolutional Residual Networks

DeMoN: Depth and Motion Network for Learning Monocular Stereo Supplementary Material

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Mask-SLAM: Robust feature-based monocular SLAM by masking using semantic segmentation

Efficient Online Surface Correction for Real-time Large-Scale 3D Reconstruction

Recovering 6D Object Pose and Predicting Next-Best-View in the Crowd Supplementary Material

Learning Semantic Environment Perception for Cognitive Robots

Deep Learning for Virtual Shopping. Dr. Jürgen Sturm Group Leader RGB-D

Multi-Output Learning for Camera Relocalization

3D Fusion of Infrared Images with Dense RGB Reconstruction from Multiple Views - with Application to Fire-fighting Robots

Training models for road scene understanding with automated ground truth Dan Levi

Depth from Stereo. Dominic Cheng February 7, 2018

Multi-scale Voxel Hashing and Efficient 3D Representation for Mobile Augmented Reality

Structured Prediction using Convolutional Neural Networks

Visual SLAM. An Overview. L. Freda. ALCOR Lab DIAG University of Rome La Sapienza. May 3, 2016

Omnidirectional DSO: Direct Sparse Odometry with Fisheye Cameras

3D Line Segments Extraction from Semi-dense SLAM

Detection and Fine 3D Pose Estimation of Texture-less Objects in RGB-D Images

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Incremental 3D Line Segment Extraction from Semi-dense SLAM

Computer Vision Lecture 16

Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image - Supplementary Material -

Computer Vision Lecture 16

CS231N Section. Video Understanding 6/1/2018

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

ECE 6554:Advanced Computer Vision Pose Estimation

Learning-based Localization

arxiv: v1 [cs.cv] 3 Apr 2018

Live Metric 3D Reconstruction on Mobile Phones ICCV 2013

Computer Vision 2 Lecture 1

The Kinect Sensor. Luís Carriço FCUL 2014/15

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Colored Point Cloud Registration Revisited Supplementary Material

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Human Pose Estimation with Deep Learning. Wei Yang

Real-Time Depth Estimation from 2D Images

Spatial Localization and Detection. Lecture 8-1

Deep Learning for Computer Vision II

arxiv: v2 [cs.cv] 28 Sep 2016

arxiv: v1 [cs.cv] 22 Jan 2019

Computer Vision: Making machines see

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Learning to generate 3D shapes

Fitting (LMedS, RANSAC)

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

Fully Convolutional Networks for Semantic Segmentation

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

DepthNet: A Recurrent Neural Network Architecture for Monocular Depth Prediction

Scanning and Printing Objects in 3D Jürgen Sturm

Simultaneous Localization and Mapping (SLAM)

arxiv: v1 [cs.cv] 23 Apr 2017

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

Transcription:

Deep Incremental Scene Understanding Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

C. Couprie et al. "Toward Real-time Indoor Semantic Segmentation Using Depth Information" JMLR, 2014 S. Izadi et al., KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera, UIST 2011 Scene Understanding and SLAM Scene understanding with deep learning (typically frame-wise) [Couprie14] SLAM from RGB-D data allowing real-time scene reconstruction [Izadi11] Can we fuse the two, while still being real-time?

Beyond SLAM: fusing reconstruction with scene understanding Fusing multiple viewpoints over time improves semantic perception and object pose estimation SLAM++ [Salas-Moreno13] Incremental Scene Understanding on Dense SLAM [Li16] C. Li et al., Incremental scene understanding on dense SLAM, IROS 2016 R. Salas-Moreno et al., SLAM++: Simultaneous Localisation and Mapping at the Level of Objects, CVPR 2013

Incremental 3D Segmentation Real-time segmentation of SLAM reconstruction [Tateno15], yielding constant complexity wrt. the size of the reconstruction K. Tateno, F. Tombari, N. Navab, Real-Time and Scalable Incremental Segmentation on Dense SLAM, IROS 15

Real-time also on Google Tango..

What if a depth sensor is not available? Is semantic mapping/incremental scene understanding still possible from a single RGB camera?

Monocular SLAM state of the art FEATURE-BASED DIRECT ORB-SLAM [Mur-Artal14] LSD-SLAM [Engel14] Not dense on texture-less regions MAIN LIMITATIONS No pure rotational motions No absolute scale J. Engel et al., LSD-SLAM: Large-Scale Direct Monocular SLAM ECCV 2014 R. Mur-Artal et al., ORB-SLAM: A Versatile and Accurate Monocular SLAM System IEEE Trans. Robotics 2015

Depth prediction with CNNs Goal: Use a CNN to predict a dense depth map from a single RGB image RGB Image Depth Ground Truth (Kinect) Depth Prediction An alternative to monocular SLAM?

FC ResNet with UpProjections [Laina16] CNN Architecture ResNet-50 avg Memory FC limitations pool Restriction of full connections: high dimensional outputs can produce billions of parameters Residual blocks I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab: Deeper Depth Prediction using fully Convolutional Residual Networks, 3DV 2016

FC ResNet with UpProjections [Laina16] CNN Architecture avg FC pool difficult convergence blurry predictions need for bigger datasets vs ground truth prediction Residual blocks

FC ResNet with UpProjections [Laina16] CNN Architecture Residual blocks fully convolutional ResNet with progressive up-sampling

FC ResNet with UpProjections [Laina16] CNN Architecture Residual blocks

FC ResNet with UpProjections [Laina16] CNN Architecture Residual blocks

FC ResNet with UpProjections [Laina16] CNN Architecture Residual blocks

Multi-task FC ResNet RGB Input Depth GT (Kinect) Depth Prediction 4-class Sem. Seg. 40-class (RGB-Only) 40-class (RGB + Depth Pred.)

Monocular SLAM and CNN depth prediction are complementary Monocular SLAM Accurate on depth borders but sparse CNN-SLAM [Tateno17] takes the best of both world by fusing monocular SLAM with depth prediction in real time CNN Depth Prediction Dense but imprecise along depth borders 1. can learn the absolute scale 2. dense maps 3. can deal with pure rotational motion K. Tateno, F. Tombari, I. Laina, N. Navab: CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction" CVPR, 2017

CNN-SLAM framework Every Key-frame Input RGB Image CNN Semantic Segmentation CNN Depth Prediction Key-frame Initialization Pose Graph Optimization Global Map and Semantic Label Fusion Camera Pose Estimation Frame-wise Depth Refinement Camera pose estimated via direct method at each new frame Set of key-frames, each associated to a depth map Each key-frame depth map D ki is Every input frame 1. initialized via Fully Convolutional ResNet [Laina16] 2. refined with depth values D t estimated via short-baseline stereo matching [Engel14], weighted by the associated uncertainty U ki, U t : D ki u = U t u D ki u + U ki u D t u U ki u + U t u

Key-frame depth refinement Key-frame depth refinement allows estimating fine structures on previously blurred surfaces Gradual fusion of CNN-predicted depth with monocular SLAM: elements near intensity gradients will be more and more refined by the frame-wise depth estimates elements within low-textured regions will gradually hold the predicted depth value from the CNN Refining depth in Key-frame RGB image in Key-frame RGB image in current frame

Qualitative results SLAM on pure rotational motion

Qualitative results Absolute scale estimation

First demonstration of fully monocular real-time semantic mapping

Many prediction tasks are ambiguous Many prediction tasks contain uncertainty. In some cases, uncertainty is inherent in the task itself [Rupprecht17]. What will the other driver do? What is the label for this image? C. Rupprecht, I. Laina, R. DiPietro, M. Baust, F. Tombari, N. Navab, G. D. Hager: Learning in an Uncertain World: Representing Ambiguity Through Multiple Hypotheses" arxiv:1612.00197, 2017

Simple example: next frame prediction single prediction a square is bouncing around the frame it randomly switches color between black and white the CNN predicts the next frame in the sequence the mean of black and white is gray, which is also the background the frame is constant gray

Approximations with the mean p(x) x Learning the mean can lead to very unlikely solutions

Approximate with multiple hypotheses a simple meta-loss transforms any model into a multiple hypothesis predictor (MHP)

Simple example: next frame prediction prediction 1 prediction 2 now we transformed the same network into a multiple hypothesis model with two predictions it is able to separate black and white blocks for the future frame

Image Classification

Human Pose Estimation the variance of prediction can help detecting ambiguities the predictions for the location of the hands varies much more than for the shoulders

Future Frame Prediction with more predictions future frames become sharper the model does not need to blend together all possible outcomes

hypotheses Multiple Hypothesis for Depth Prediction input ground truth hypotheses mean variance

Multiple Hypothesis Prediction for CNN-SLAM the variance can be used to estimate confidences confidences will be used as initialization for the refinement of the keyframe with MHP depth prediction the overall accuracy increases original CNN-SLAM CNN-SLAM with MHP correct pixels: 10.6% correct pixels: 36.0%

Conclusion We presented a framework for real-time scene understanding fusing semantic segmentation and SLAM reconstruction Depth prediction complements monocular SLAM in low texture regions and global scale Multiple hypotheses allow for improved 3D reconstruction Combine deep learning with 3D computer vision to leverage the best of both worlds Slide 32

Credits (alphabetical) Dr. Max Baust Dr. Vasilis Belagiannis Robert DiPietro Prof. Greg Hager Iro Laina Prof. Nassir Navab Keisuke Tateno We gratefully acknowledge the donation from Nvidia of two GPUs that helped the development of the presented research activities.

References [Couprie14] C. Couprie, C. Farabet, L. Najman, Y. LeCun: "Toward Real-time Indoor Semantic Segmentation Using Depth Information" JMLR, 2014 [Engel14] J. Engel et al., LSD-SLAM: Large-Scale Direct Monocular SLAM ECCV 2014 [Izadi11] S. Izadi et al., KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera, UIST 2011 [Laina16] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab: Deeper Depth Prediction using fully Convolutional Residual Networks, 3DV 2016 [Li16] C. Li et al., Incremental scene understanding on dense SLAM, IROS 2016 [Mur-Artal15] R. Mur-Artal et al., ORB-SLAM: A Versatile and Accurate Monocular SLAM System IEEE Trans. Robotics 2015 [Rupprecht17] C. Rupprecht, I. Laina, R. DiPietro, M. Baust, F. Tombari, N. Navab, G. D. Hager: Learning in an Uncertain World: Representing Ambiguity Through Multiple Hypotheses" arxiv:1612.00197, 2017 [Salas-Moreno13] R. Salas-Moreno et al., SLAM++: Simultaneous Localisation and Mapping at the Level of Objects, CVPR 2013 [Tateno15] K. Tateno, F. Tombari, N. Navab, Real-Time and Scalable Incremental Segmentation on Dense SLAM, IROS 15 [Tateno17] K. Tateno, F. Tombari, I. Laina, N. Navab: CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction" CVPR, 2017