From 3D descriptors to monocular 6D pose: what have we learned?

Similar documents
Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

arxiv: v4 [cs.cv] 12 Jul 2018

BOP: Benchmark for 6D Object Pose Estimation

Learning 6D Object Pose Estimation and Tracking

Multi-View 3D Object Detection Network for Autonomous Driving

arxiv: v1 [cs.cv] 27 Mar 2019

Learning from 3D Data

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. Charles R. Qi* Hao Su* Kaichun Mo Leonidas J. Guibas

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

arxiv: v1 [cs.cv] 16 Mar 2018

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

arxiv: v1 [cs.cv] 2 Apr 2019

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Deep Learning for Virtual Shopping. Dr. Jürgen Sturm Group Leader RGB-D

Deep Continuous Fusion for Multi-Sensor 3D Object Detection

Photo OCR ( )

VISION FOR AUTOMOTIVE DRIVING

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape

Learning Semantic Environment Perception for Cognitive Robots

Fusion of Camera and Lidar Data for Object Detection using Neural Networks

Training models for road scene understanding with automated ground truth Dan Levi

arxiv: v3 [cs.cv] 18 Mar 2019

Seeing the unseen. Data-driven 3D Understanding from Single Images. Hao Su

3D Pose Estimation using Synthetic Data over Monocular Depth Images

HDNET: Exploiting HD Maps for 3D Object Detection

arxiv: v1 [cs.cv] 9 Nov 2018

Multi-View 3D Object Detection Network for Autonomous Driving

Synscapes A photorealistic syntehtic dataset for street scene parsing Jonas Unger Department of Science and Technology Linköpings Universitet.

POINT CLOUD DEEP LEARNING

Amodal and Panoptic Segmentation. Stephanie Liu, Andrew Zhou

Detection and Fine 3D Pose Estimation of Texture-less Objects in RGB-D Images

Local-Level 3D Deep Learning. Andy Zeng

CubeSLAM: Monocular 3D Object Detection and SLAM without Prior Models

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Depth from Stereo. Dominic Cheng February 7, 2018

Urban Scene Segmentation, Recognition and Remodeling. Part III. Jinglu Wang 11/24/2016 ACCV 2016 TUTORIAL

arxiv: v1 [cs.cv] 17 Nov 2017

CS468: 3D Deep Learning on Point Cloud Data. class label part label. Hao Su. image. May 10, 2017

arxiv: v1 [cs.cv] 12 Feb 2018

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Dense Tracking and Mapping for Autonomous Quadrocopters. Jürgen Sturm

arxiv: v1 [cs.cv] 11 Jun 2018

3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

08 An Introduction to Dense Continuous Robotic Mapping

Mapping with Dynamic-Object Probabilities Calculated from Single 3D Range Scans

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

arxiv: v1 [cs.cv] 29 Nov 2017

Training models for road scene understanding with automated ground truth Dan Levi

ECE 6554:Advanced Computer Vision Pose Estimation

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Intrinsic3D: High-Quality 3D Reconstruction by Joint Appearance and Geometry Optimization with Spatially-Varying Lighting

Photo-realistic Renderings for Machines Seong-heum Kim

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Flow-Based Video Recognition

BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform. Xintao Wang Ke Yu Chao Dong Chen Change Loy

Real-Time Vision-Based State Estimation and (Dense) Mapping

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

Multi-View 3D Object Detection Network for Autonomous Driving

Mutual Hypothesis Verification for 6D Pose Estimation of Natural Objects

3D Deep Learning on Geometric Forms. Hao Su

Using Faster-RCNN to Improve Shape Detection in LIDAR

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

arxiv: v1 [cs.cv] 13 Dec 2018

Multi-Level Fusion based 3D Object Detection from Monocular Images

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Automotive 3D Object Detection Without Target Domain Annotations

Regionlet Object Detector with Hand-crafted and CNN Feature

YOLO 9000 TAEWAN KIM

Live Metric 3D Reconstruction on Mobile Phones ICCV 2013

PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation

視覚情報処理論. (Visual Information Processing ) 開講所属 : 学際情報学府水 (Wed)5 [16:50-18:35]

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

Colored Point Cloud Registration Revisited Supplementary Material

Recovering 6D Object Pose and Predicting Next-Best-View in the Crowd Supplementary Material

RGBD Occlusion Detection via Deep Convolutional Neural Networks

Augmented Reality, Advanced SLAM, Applications

Improving Vision-based Topological Localization by Combining Local and Global Image Features

arxiv: v1 [cs.cv] 16 Nov 2015

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image. Supplementary Material

Human Pose Estimation with Deep Learning. Wei Yang

arxiv: v1 [cs.cv] 1 Dec 2016

Direct Methods in Visual Odometry

Yiqi Yan. May 10, 2017

3D Shape Segmentation with Projective Convolutional Networks

Scanning and Printing Objects in 3D Jürgen Sturm

Learning-based Localization

Deep-6DPose: Recovering 6D Object Pose from a Single RGB Image

Spatial Localization and Detection. Lecture 8-1

Presented at the FIG Congress 2018, May 6-11, 2018 in Istanbul, Turkey

Joint Object Detection and Viewpoint Estimation using CNN features

Gesture Recognition: Hand Pose Estimation. Adrian Spurr Ubiquitous Computing Seminar FS

Transcription:

ECCV Workshop on Recovering 6D Object Pose From 3D descriptors to monocular 6D pose: what have we learned? Federico Tombari CAMP - TUM

Dynamic occlusion Low latency High accuracy, low jitter No expensive hardware

Features vs Template.. vs Learned Features Template matching Learned

10 years ago PR2 SHOT descriptor

Today INPUT DATA Mono Consumer depth Lidar / sparse APPLICATIONS AR for Smartphones HMDs Personal Robotics Industrial Robotics Autonomous Driving

A new generation of cheaper, smaller, denser LIDARs? Innoviz Luminar Technologies

DEEP LEARNING Learned 3D descriptors INPUT DATA Mono Consumer depth Lidar / sparse APPLICATIONS Smartphones HMDs Personal Robotics Industrial Robotics Autonomous Driving

Hand-crafted Local 3D descriptors p n a q b Spin Images [Johnson99] Shape distributions [Osada02] 3D Shape Context [Frome04] Global Local SHOT [Tombari10] OUR-CVFH [Aldoma12] Viewpoint Feature Histogram [Rusu10] Point Pair Features (PPF) [Drost10] Fast Point Feature Histogram [Rusu09]

3D representations and deep learning Point Clouds: Unorganized, no topology 3D Mesh: Unorganized, with topology Voxel map: Organized, no topology Range (depth) map: Organized, no topology Unorganized 3D representations such as point clouds and meshes are not naturally suited to convolutions

Learned 3D descriptors state of the art Input Data Type Rotation Loss function 3DMatch [Zeng17] Voxel Local Training Contrastive Compact Geometric Features [Khoury17] Point clouds (via histograms) Local Hand-crafted LRF Triplet PPFNet [Deng18] Point clouds Local Hand-crafted LRF N-tuple Pointnet [Qi17] Point clouds Global T-net Classification Segmentation Pointnet++ [Qi17] Point clouds Global T-net Classification Segmentation Dynamic Graph CNN [Wang18] Point clouds Global T-Net Classification Segmentation

Fully Convolutional Point Network Hybrid: Unorganized input, organized internal representation and output End-to-end, general-purpose, hierarchical learning on unordered 3D data Processing of large scale point clouds in one single pass D. Rethage, J. Wald, J. Sturm, N. Navab, F. Tombari, Fully Convolutional Point Networks for Large-Scale Point Clouds, ECCV 2018

Can we learn 6D pose without a 3D sensor?

2D vs. 3D object detection and pose estimation 2D object detection 3D object detection

Monocular 6D object pose estimation state of the art Method BACKBONE Network Output Pose Computation/Refinement BB-8 [RAD2017] VGG 16 8 corners of the projected 3D Bounding Box PnP / VGG [TEKIN2018] YOLO V2 8 corners of the projected 3D Bounding Box + 3D centroid projection PnP POSECNN [XIANG2018] VGG 16 Semantic Labeling + Regression of 6D pose DEEP 6D POSE [DO2018] Mask R-CNN Object Instance Segmentation + Regression of 6D pose SSD-6D [KEHL 2017] SSD 300 Viewpoint and In-Plane rotation classification Contour-based [Rad2017] Rad and Lepetit BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth, ICCV2017 [Kehl2017] Kehl et al. SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again, ICCV 2017 [Tekin2018] Tekin et al. Real-Time Seamless Single Shot 6D Object Pose Prediction, CVPR 2018 [Xiang2018] Xiang et al. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes, RSS 2018 [Do2018] Do et al. Deep-6DPose: Recovering 6D Object Pose from a Single RGB Image, Arxive 2018

SSD-6D: monocular object detection and 6DoF pose estimation [Kehl17] Bounding box location (4 values) C: Probability for each class Single Shot Detector (SSD) network from [Liu16]

SSD-6D: monocular object detection and 6DoF pose estimation [Kehl17] Bounding box location (4 values) C: Probability for each class V: Probability for each viewpoint R: Probability for each in-plane rotation W. Kehl, F. Manhardt, F. Tombari, S. Ilic, N. Navab, SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again, ICCV 2017

From pose regression to pose classification 0.82 0.05 0.04 0.67 0.02 0.001

Training

2D Detections 6D Pose estimation (not refined) 6D Pose estimation (after refinement)

Deep monocular 6D pose refinement Deep-learned 6D pose refinement method that: uses RGB data only trained purely on synthetic data agnostic to geometrical symmetry and visual ambiguities Provided a 3D CAD model, input scene image and 6D pose hypothesis, we render the model in a patch cut out a scene patch around the pose hypothesis feed both to a pre-trained feature extractor regress a rotational and translational pose update F. Manhardt, W. Kehl, N. Navab, F. Tombari, Deep Model-Based 6D Pose Refinement in RGB", ECCV 2018

Proxy loss with distance transform Synthetic Input Image 6D Pose Hypothesis Pose Estimation at Initial State Pose Estimation after convergence Sum over all sampled points projected on the distance transform of the target. Extension of the loss to both directions, since sampled contour points do not originate from target contours.

Results deep monocular 6D pose refinement Pose Errors on LineMOD with Poses initialized from SSD-6D Tracking of unseen class instances Robotic grasping application

Combining learners and optimizers RGBD tracking RGB Image Points on the Edge Learning Use Forest to find the relation between input and output. Given Input Learn Given Output Learning-based Method More Robust Less Accurate Prediction Given the input, predict the output using the Forest. Given Input Given Predict Output Energy-based Optimization Depth Image Points on the Surface Samples Labels Less Robust More Accurate D.J.Tan, N. Navab, F. Tombari, Looking Beyond the Simple Scenarios: Combining Learners and Optimizers in 3D Temporal Tracking, ISMAR 2017 (Best Demo Award)

Comparison monocular vs RGBD 6D pose tracking Monocular pose refinement [Manhardt18] RGB-D pose refinement [Tan17]

Pose estimation for Autonomous Driving State of the art techniques mostly rely on LIDAR (or LIDAR+RGB) State of the art accuracy around 50% - 70% Current Contenders (Multimodal, Lidar only): VoxelNet: Zhou and Touzel, 2017 AVOD-FPN: J. Ku, M. Mozifian, J. Lee, A. Harakeh and S. Waslander: Joint 3D Proposal Generation and Object Detection from View Aggregation. IROS 2018. F-PointNet: C. Qi, W. Liu, C. Wu, H. Su and L. Guibas: Frustum PointNets for 3D Object Detection from RGB-D Data. arxiv 2017. MV3D: X. Chen, H. Ma, J. Wan, B. Li and T. Xia: Multi-View 3D Object Detection Network for Autonomous Driving. CVPR 2017. VoxelNet Detections (courtesy of Zhou and Touzel)

Monocular 6D pose also for AD? Extend 2D detection to predict 3D bounding boxes/6d pose for AD classes (e.g. vehicles) Still very open problem (between 3 and 6% accuracy for KITTI 3D detection with IoU=0.7) Related work (Mono, Stereo): Mono3D: X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In CVPR, 2016 3DOP: X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In NIPS, 2015 Mono3D Detections (courtesy of Chen et al.)

Qualitative Results Lidar for visualization only Green Boxes: Ground Truth Red Boxes: Our predictions

Qualitative Results Lidar for visualization only Green Boxes: Ground Truth Red Boxes: Our predictions

RGB vs. RGB+Lidar Left: Ours (fully monocular) Green Boxes: Ground Truth, Red Boxes: Predictions Right: MV3D [Chen17] (RGB+Lidar)

CNN-SLAM: monocular dense SLAM Monocular SLAM Accurate on depth borders but sparse CNN-SLAM [Tateno17] takes the best of both world by fusing monocular SLAM with depth prediction in real time CNN Depth Prediction Dense but imprecise along depth borders 1. can learn the absolute scale 2. dense maps 3. can deal with pure rotational motion K. Tateno, F. Tombari, I. Laina, N. Navab, CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction", CVPR 2017

CNN-SLAM for AD Color image Predicted depth road Trees Building Sky Person sidewalk KITTI dataset Cityscapes dataset

What have we learned? 6D pose estimation can use deep learning to overcome the limitations of the sensing modality 3D learned descriptors generally report better performance in matching compared to hand-crafted Monocular pose estimation can be carried out via deep learning (although not yet as accurately as with a depth sensor) Open issues: Generalizability Geometric invariance Runtime/hardware limitations Fusion of SLAM/real-time reconstruction with detection and pose estimation New sensing technologies could be the next game changer

Main Credits (alphabetical) Wadim Kehl Ted Krubasik Fabian Manhardt Prof. Nassir Navab Dario Rethage Jürgen Sturm Dr. David J. Tan Keisuke Tateno Johanna Wald Thanks to Google, Toyota and Pointu3D for supporting these research activities