Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Similar documents
3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

3D Object Recognition and Scene Understanding. Yu Xiang University of Washington

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington & NVIDIA Research

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Learning from 3D Data

Object Detection by 3D Aspectlets and Occlusion Reasoning

RGB-D Multi-View Object Detection with Object Proposals and Shape Context

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

Car Bed Chair Sofa Table

BOP: Benchmark for 6D Object Pose Estimation

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image. Supplementary Material

Deep Models for 3D Reconstruction

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

arxiv: v2 [cs.cv] 28 Sep 2016

3D Attention-Driven Depth Acquisition for Object Identification

From 3D descriptors to monocular 6D pose: what have we learned?

Lecture 7: Semantic Segmentation

Learning to generate 3D shapes

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Surface Oriented Traverse for Robust Instance Detection in RGB-D

3D Deep Learning on Geometric Forms. Hao Su

Learning Semantic Environment Perception for Cognitive Robots

Fully Convolutional Networks for Semantic Segmentation

arxiv: v1 [cs.cv] 31 Mar 2016

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Detection and Fine 3D Pose Estimation of Texture-less Objects in RGB-D Images

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Real-Time Vision-Based State Estimation and (Dense) Mapping

What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today?

3D Scene Understanding from RGB-D Images. Thomas Funkhouser

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Human Pose Estimation with Deep Learning. Wei Yang

CVPR 2014 Visual SLAM Tutorial Kintinuous

Deconvolutions in Convolutional Neural Networks

Learning-based Localization

Photo OCR ( )

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Structured Prediction using Convolutional Neural Networks

Colored Point Cloud Registration Revisited Supplementary Material

DeepContext: Context-Encoding Neural Pathways for 3D Holistic Scene Understanding

Semantic RGB-D Perception for Cognitive Robots

DeepContext: Context-Encoding Neural Pathways for 3D Holistic Scene Understanding

Label Propagation in RGB-D Video

3D model classification using convolutional neural network

Object Detection Based on Deep Learning

Depth from Stereo. Dominic Cheng February 7, 2018

Planning, Execution and Learning Application: Examples of Planning in Perception

Reconstruction, Motion Estimation and SLAM from Events

Presentation Outline. Semantic Segmentation. Overview. Presentation Outline CNN. Learning Deconvolution Network for Semantic Segmentation 6/6/16

LEARNING BOUNDARIES WITH COLOR AND DEPTH. Zhaoyin Jia, Andrew Gallagher, Tsuhan Chen

SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks

Learning 6D Object Pose Estimation and Tracking

arxiv: v1 [cs.cv] 28 Mar 2018

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

arxiv: v2 [cs.cv] 18 Aug 2015

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

arxiv: v1 [cs.cv] 29 Sep 2016

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Spatial Localization and Detection. Lecture 8-1

Reconstructive Sparse Code Transfer for Contour Detection and Semantic Labeling

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Detecting Object Instances Without Discriminative Features

Convolutional-Recursive Deep Learning for 3D Object Classification

Seeing the unseen. Data-driven 3D Understanding from Single Images. Hao Su

RGB-D Object Discovery via Multi-Scene Analysis

Deep Learning for Virtual Shopping. Dr. Jürgen Sturm Group Leader RGB-D

arxiv: v1 [cs.cv] 20 Dec 2016

Depth-aware CNN for RGB-D Segmentation

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Flow-Based Video Recognition

DEPTH SENSOR BASED OBJECT DETECTION USING SURFACE CURVATURE. A Thesis presented to. the Faculty of the Graduate School

A MULTI-RESOLUTION FUSION MODEL INCORPORATING COLOR AND ELEVATION FOR SEMANTIC SEGMENTATION

Depth Estimation from Single Image Using CNN-Residual Network

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

Contexts and 3D Scenes

LEARNING RIGIDITY IN DYNAMIC SCENES FOR SCENE FLOW ESTIMATION

Martian lava field, NASA, Wikipedia

arxiv: v1 [cs.cv] 30 Sep 2018

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Cross-domain Deep Encoding for 3D Voxels and 2D Images

arxiv: v1 [cs.cv] 1 Apr 2018

Multiframe Scene Flow with Piecewise Rigid Motion. Vladislav Golyanik,, Kihwan Kim, Robert Maier, Mathias Nießner, Didier Stricker and Jan Kautz

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Near Real-time Object Detection in RGBD Data

MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS. Mustafa Ozan Tezcan

Feature-Fused SSD: Fast Detection for Small Objects

Joint Object Detection and Viewpoint Estimation using CNN features

Towards Spatio-Temporally Consistent Semantic Mapping

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Geometric Reconstruction Dense reconstruction of scene geometry

Unsupervised Deep Learning. James Hays slides from Carl Doersch and Richard Zhang

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

arxiv: v2 [cs.cv] 24 Apr 2017

Transcription:

Perceiving the 3D World from Images and Videos Yu Xiang Postdoctoral Researcher University of Washington 1

2

Act in the 3D World Sensing & Understanding Acting Intelligent System 3D World 3

Understand the 3D World Navigation Manipulation Geometry Free space Surface Semantics Objects Affordances 4

Recognize Objects in 3D Semantics 3D Location 3D Orientation 5

Lula Robotics 6

6D Object Pose Estimation camera coordinate object coordinate 7

Related Work: Feature-based methods Texture-less objects Symmetry objects Occlusion Lowe, ICCV 99 Rothganger et al., IJCV 06 Savarese & Fei-Fei, ICCV 07 Collet et al., IJRR 11 8

Related Work: Template-based methods Texture-less objects Symmetry objects Occlusion Gu & Ren, ECCV 10 Hinterstoisser et al., ACCV 12 Xiang & Savarese, CVPR 12 Cao et al., ICRA 16 9

PoseCNN An input image Texture-less objects Symmetry objects Occlusion 6D poses Y. Xiang, T. Schmidt, V. Narayanan and D. Fox. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In arxiv:1711.00199. 10

Decouple 3D Translation and 3D Rotation 3D Translation 2D center camera coordinate Distance object coordinate 3D Rotation 11

PoseCNN: Semantic Labeling Feature Extraction Embedding Classification 64 128 256 512 512 64 64 64 64 #classes Labels Convolution + ReLU Max Pooling Deconvolution Addition 12

PoseCNN: Object Center Voting 13

PoseCNN: 3D Translation Estimation Feature Extraction Embedding Classification / Regression 64 64 64 64 #classes Labels 64 128 256 512 512 128 Center 128 128 128 3 direction X #classes Center direction Y Center distance Convolution + ReLU Max Pooling Deconvolution Addition Hough Voting 14

PoseCNN: Center Estimation with RANSAC Center hypothesis 15

PoseCNN: 3D Rotation Regression Feature Extraction Embedding Classification / Regression 64 64 64 64 #class Labels Convolution + ReLU Max Pooling Deconvolution Addition 64 128 256 512 512 Hough Voting RoI Pooling Fully Connected 128 128 128 128 3 #class Center direction X 512 512 RoIs Center direction Y 512 Center distance For each RoI 4096 4096 4 #class 6D Poses 16

The YCB-Video Dataset 21 YCB Objects 92 Videos, 133,827 frames17

6D Pose Evaluation Metric Ground truth pose Average distance (non-symmetry) Average distance (symmetry) 3D model points Predicted pose 18

Results on the YCB-Video Dataset 19

Input Image Labeling Centers 6D Pose 20

Network Network + ICP Network + ICP + Multi-view 21

22

23

24

25 Y. Xiang, W. Choi, Y. Lin and S. Savarese. Subcategory-aware Convolutional Neural Networks for Object Proposals and Detection. In WACV, 2017. Y. Xiang, W. Choi, Y. Lin and S. Savarese. Data-Driven 3D Voxel Patterns for Object Category Recognition. In CVPR, 2015 (Oral).

Online Multi-Object Tracking Y. Xiang, A. Alahi and S. Savarese. Learning to Track: Online Multi-Object Tracking by Decision Making. In ICCV, 2015 (Oral). Code available online 26

Semantic Mapping (Semantic SLAM) Geometry Semantics Camera poses 27

28

Semantic Mapping with Data Associated Recurrent Neural Networks (DA-RNNs) DA-RNN Code available online 29 Y. Xiang and D. Fox. DA-RNN: Semantic Mapping with Data Associated Recurrent Neural Networks. In RSS, 2017.

Related Work: 3D Scene Reconstruction KinectFusion Geometry Data Association Semantics Newcombe et al., ISMAR 11 Henry et al., IJRR 12, 3DV 13 Whelan et al., RSS Workshop 12, RSS 15 Keller et al., 3DV 13 30

Related Work: Semantic Labeling Geometry Data Association Semantics Long et al., CVPR 12 Zheng et al., ICCV 15 Chen et al., ICLR 15 Badrinarayanan et al., CVPR 15 31

Related Work: Semantic Mapping SemanticFusion Geometry Data Association Semantics Salas-Moreno et al., CVPR 13 McCormac et al., ICRA 17 Bowman et al. ICRA 17 32

Our Contribution: DA-RNN Recurrent Neural Network RGB Images Data Association Semantic Labels Depth Images KinectFusion 3D Semantic Scene 33

Single Frame Labeling with FCNs Feature Extraction Embedding Classification Convolution + ReLU RGB Image Depth Image 64 64 128 128 256 512 512 256 512 512 1024 1024 64 64 64 64 #classes Labels Max Pooling Concatenation Deconvolution Addition 34

Results on RGB-D Scene Dataset [1] [1] K. Lai, L. Bo and D. Fox. Unsupervised feature learning for 3D scene labeling. In ICRA 14 35

Video Semantic Labeling with DA-RNNs RGB Image Time t Depth Image RGB Image Time t+1 Labels data association Labels Convolution + ReLU Max Pooling Concatenation Deconvolution Addition Recurrent Units Depth Image 36

Data Association from KinectFusion Depth is available Assumption: static scenes Iterative Closest Point (ICP) 37

Data Associated Recurrent Units (DA-RUs) Data Association Hidden state h t i DA-RU i h t+1 Classification Time t Time t+1 Recurrent layer Input i x t+1 Weighted Moving Averaging with learnable parameters 38

Results on RGB-D Scene Dataset [1] FCN DA-RNN [1] K. Lai, L. Bo and D. Fox. Unsupervised feature learning for 3D scene labeling. In ICRA 14. 39

Experiments: Datasets RGB-D Scene Dataset [1] 14 RGB-D videos of indoor scenes 9 object classes ShapeNet Scene Dataset [2] 100 RGB-D videos of virtual table-top scenes 7 object classes [1] K. Lai, L. Bo and D. Fox. Unsupervised feature learning for 3D scene labeling. In ICRA 14. [2] Chang et al., ShapeNet: an information-rich 3D model repository. arxiv preprint arxiv:1512.03012, 2015. 40

Experiments: Comparison on Network Architectures RGB-D Scenes Methods FCN [1] Background 94.3 Bowl 78.6 Cap 61.2 Cereal Box 80.4 Coffee Mug 62.7 Coffee Table 93.6 Office Chair 67.3 Soda Can 73.5 Sofa 90.8 Table 84.2 MEAN 78.7 Metric: segmentation intersection over union (IoU) 41 [1] J. Long, E. Shelhamer and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR 15.

Experiments: Comparison on Network Architectures RGB-D Scenes Methods FCN [1] Our FCN Background 94.3 96.1 Bowl 78.6 87.0 Cap 61.2 79.0 Cereal Box 80.4 87.5 Coffee Mug 62.7 75.7 Coffee Table 93.6 95.2 Office Chair 67.3 71.6 Soda Can 73.5 82.9 Sofa 90.8 92.9 Table 84.2 89.8 MEAN 78.7 85.8 Metric: segmentation intersection over union (IoU) 42 [1] J. Long, E. Shelhamer and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR 15.

Experiments: Comparison on Network Architectures RGB-D Scenes Methods FCN [1] Our FCN Our GRU-RNN Background 94.3 96.1 96.8 Bowl 78.6 87.0 86.4 Cap 61.2 79.0 82.0 Cereal Box 80.4 87.5 87.5 Coffee Mug 62.7 75.7 76.1 Coffee Table 93.6 95.2 96.0 Office Chair 67.3 71.6 72.7 Soda Can 73.5 82.9 81.9 Sofa 90.8 92.9 93.5 Table 84.2 89.8 90.8 MEAN 78.7 85.8 86.4 Metric: segmentation intersection over union (IoU) 43 [1] J. Long, E. Shelhamer and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR 15.

Experiments: Comparison on Network Architectures RGB-D Scenes Methods FCN [1] Our FCN Our GRU-RNN Our DA-RNN Background 94.3 96.1 96.8 97.6 Bowl 78.6 87.0 86.4 92.7 Cap 61.2 79.0 82.0 84.4 Cereal Box 80.4 87.5 87.5 88.3 Coffee Mug 62.7 75.7 76.1 86.3 Coffee Table 93.6 95.2 96.0 97.3 Office Chair 67.3 71.6 72.7 77.0 Soda Can 73.5 82.9 81.9 88.7 Sofa 90.8 92.9 93.5 95.6 Table 84.2 89.8 90.8 92.8 MEAN 78.7 85.8 86.4 90.1 Metric: segmentation intersection over union (IoU) 44 [1] J. Long, E. Shelhamer and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR 15.

Experiments: Comparison on Network Architectures RGB-D Scenes Methods FCN [1] Our FCN Our GRU-RNN Our DA-RNN No Data Association Background 94.3 96.1 96.8 97.6 69.1 Bowl 78.6 87.0 86.4 92.7 3.6 Cap 61.2 79.0 82.0 84.4 9.9 Cereal Box 80.4 87.5 87.5 88.3 14.0 Coffee Mug 62.7 75.7 76.1 86.3 4.5 Coffee Table 93.6 95.2 96.0 97.3 68.0 Office Chair 67.3 71.6 72.7 77.0 13.6 Soda Can 73.5 82.9 81.9 88.7 5.9 Sofa 90.8 92.9 93.5 95.6 35.6 Table 84.2 89.8 90.8 92.8 20.1 MEAN 78.7 85.8 86.4 90.1 24.4 Metric: segmentation intersection over union (IoU) [1] J. Long, E. Shelhamer and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR 15. 45

RGB Image Our FCN Our DA-RNN 46

Experiments: Analysis on Network Inputs 47

RGB Images Depth Images Semantic Mapping 48

Conclusion 6D Object Pose Estimation from images and videos Semantic Mapping from RGB-D videos Deep neural networks with geometric representations 49

Future Work: Perception for Robotics 3D object recognition for robot manipulation 3D scene understanding for robot navigation Perception Combing perception, planning and control Planning Control Learning Reinforcement learning with deep neural networks 50

Acknowledgements Thank you! 51