Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Similar documents
Geometric and Semantic 3D Reconstruction: Part 4A: Volumetric Semantic 3D Reconstruction. CVPR 2017 Tutorial Christian Häne UC Berkeley

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

CS381V Experiment Presentation. Chun-Chen Kuo

Segmentation. Bottom up Segmentation Semantic Segmentation

Data-driven Depth Inference from a Single Still Image

A Patch Prior for Dense 3D Reconstruction in Man-Made Environments

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Semantic 3D Reconstruction of Heads Supplementary Material

Learning Semantic Environment Perception for Cognitive Robots

Multi-View 3D Object Detection Network for Autonomous Driving

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Automatic Photo Popup

Layered Scene Decomposition via the Occlusion-CRF Supplementary material

Multi-view stereo. Many slides adapted from S. Seitz

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Multi-view Stereo. Ivo Boyadzhiev CS7670: September 13, 2011

Structured Models in. Dan Huttenlocher. June 2010

Robotics Programming Laboratory

Decomposing a Scene into Geometric and Semantically Consistent Regions

String distance for automatic image classification

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

Markov Networks in Computer Vision

Markov Networks in Computer Vision. Sargur Srihari

Direction Matters: Depth Estimation with a Surface Normal Classifier

3D Photography: Stereo Matching

08 An Introduction to Dense Continuous Robotic Mapping

Dense 3D Reconstruction. Christiano Gava

Joint Inference in Image Databases via Dense Correspondence. Michael Rubinstein MIT CSAIL (while interning at Microsoft Research)

Pulling Things out of Perspective

IDE-3D: Predicting Indoor Depth Utilizing Geometric and Monocular Cues

Can Similar Scenes help Surface Layout Estimation?

Markov Random Fields and Segmentation with Graph Cuts

CRF Based Point Cloud Segmentation Jonathan Nation

Contexts and 3D Scenes

Combining Appearance and Structure from Motion Features for Road Scene Understanding

Discovering Visual Hierarchy through Unsupervised Learning Haider Razvi

3D Scene Understanding by Voxel-CRF

Single Image Depth Estimation via Deep Learning

Part-Based Models for Object Class Recognition Part 3

Contexts and 3D Scenes

COMPUTER AND ROBOT VISION

Large-Scale Point Cloud Classification Benchmark

Semantic Classification of Boundaries from an RGBD Image

Learning Articulated Skeletons From Motion

Fast Edge Detection Using Structured Forests

Volumetric stereo with silhouette and feature constraints

Day 3 Lecture 1. Unsupervised Learning

Geometry based Repetition Detection for Urban Scene

Image Segmentation. Srikumar Ramalingam School of Computing University of Utah. Slides borrowed from Ross Whitaker

Dense 3D Reconstruction. Christiano Gava

Graph Cuts. Srikumar Ramalingam School of Computing University of Utah

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Seeing the unseen. Data-driven 3D Understanding from Single Images. Hao Su

Real-World Material Recognition for Scene Understanding

Multiple Kernel Learning for Emotion Recognition in the Wild

Graph Cuts. Srikumar Ramalingam School of Computing University of Utah

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Depth from Stereo. Dominic Cheng February 7, 2018

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Separating Objects and Clutter in Indoor Scenes

Region-based Segmentation and Object Detection

OCCLUSION BOUNDARIES ESTIMATION FROM A HIGH-RESOLUTION SAR IMAGE

Learning to generate 3D shapes

Superpixel Segmentation using Depth

What have we leaned so far?

4/13/ Introduction. 1. Introduction. 2. Formulation. 2. Formulation. 2. Formulation

Topics to be Covered in the Rest of the Semester. CSci 4968 and 6270 Computational Vision Lecture 15 Overview of Remainder of the Semester

Image Segmentation continued Graph Based Methods. Some slides: courtesy of O. Capms, Penn State, J.Ponce and D. Fortsyth, Computer Vision Book

Incremental Learning of Object Detectors Using a Visual Shape Alphabet

Support surfaces prediction for indoor scene understanding

Geometric Reconstruction Dense reconstruction of scene geometry

Contents I IMAGE FORMATION 1

Optical flow and tracking

Gesture Recognition: Hand Pose Estimation. Adrian Spurr Ubiquitous Computing Seminar FS

Multiple View Geometry

The Kinect Sensor. Luís Carriço FCUL 2014/15

Cluster Analysis. Angela Montanari and Laura Anderlucci

3D Computer Vision. Dense 3D Reconstruction II. Prof. Didier Stricker. Christiano Gava

Segmentation Based Stereo. Michael Bleyer LVA Stereo Vision

LEARNING BOUNDARIES WITH COLOR AND DEPTH. Zhaoyin Jia, Andrew Gallagher, Tsuhan Chen

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

CS 4495 Computer Vision Motion and Optic Flow

Viewpoint Invariant Features from Single Images Using 3D Geometry

Automatic Dense Semantic Mapping From Visual Street-level Imagery

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Development in Object Detection. Junyuan Lin May 4th

Multiple cosegmentation

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

Regionlet Object Detector with Hand-crafted and CNN Feature

Image Analysis Lecture Segmentation. Idar Dyrdal

Image Processing, Analysis and Machine Vision

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Beyond Bags of Features

Modeling 3D viewpoint for part-based object recognition of rigid objects

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation

Segmentation and Tracking of Partial Planar Templates

Texton Clustering for Local Classification using Scene-Context Scale

Feature Tracking and Optical Flow

Transcription:

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Marc Pollefeys Joined work with Nikolay Savinov, Christian Haene, Lubor Ladicky

2 Comparison to Volumetric Fusion

Higher-order ray potentials to model visibility Volumetric formulation (Savinov et al, CVPR15) Ray potentials Pairwise regularizer Cost based on the first occupied voxel along the ray freespace depth label

Higher-order ray potentials to model visibility Discrete formulation using QPBO relaxation (Savinov et al, CVPR15) x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 0 x 1 x 2 x 3 x 4 x 5 x 6 Our goal is to find : such that is : 1) A pairwise function 2) Number of edges grows linearly with the length for a ray 3) Symmetric to inherit QPBO properties

Two-label problem (Savinov et al, CVPR15) To find we do these steps: 1) Polynomial representation of the ray potential 2) Transformation into submodular function over x and x 3) Pairwise construction using auxiliary variables z 4) Merging variables [Ramalingam12] for linear complexity 5) Symmetrization of the graph

Symmetric graph construction for higher-order ray potential (Savinov et al, CVPR15)

Multi-label problem (Savinov et al, CVPR15) Standard alpha-expansion Multi-label ray potential projects into 2-label ray potential 1 0 1 0 0 1 0 -expansion 1 0 1 1 1 1 1 -expansion 1 1 1 0 0 1 0 -expansion Variables not labelled by QPBO labelled using ICM

Implementation details (Savinov et al, CVPR15) Semantic cost Depth cost Semantic classifier [Ladický ICCV09] Multi-view stereo depth matches using zero-mean NCC For the top n matches : 0-1 Use [GoldbergAlgo11] for graphcut

Results inference generative 9 (Savinov et al, CVPR15)

Results Input Depth Semantics 3D model (Savinov et al, CVPR15)

Results Input Depth Semantics 3D model (Savinov et al, CVPR15)

Results (Savinov et al, CVPR15)

13 Joint depth-semantic cost?

Single-View Depth using a Joint Depth-Semantic Classifier Marc Pollefeys Joint work with Ľubor Ladický and Jianbo Shi (Upenn)

Single-View Depth Estimation

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08]

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] Requires strong prior knowledge Ignores small objects

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06]

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] Works only for foreground objects (things)

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] 3. Depth from semantic labels [Liu et al. CVPR10]

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] 3. Depth from semantic labels [Liu et al. CVPR10] Requires strong priors for semantic classes

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] 3. Depth from semantic labels [Liu et al. CVPR10] 4. Data driven [Saxena et al. NIPS05]

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] 3. Depth from semantic labels [Liu et al. CVPR10] 4. Data driven [Saxena et al. NIPS05] Requires lots of data (depth does not generalizes across classes) A problem with balancing data

Data-driven Depth Estimation Impossible?

Data-driven Depth Estimation No common structure of the scene Ground plane not always visible Large variation of viewpoints and of objects in the scene Both things and stuff in the scene

Data-driven Depth Estimation Desired properties :

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier Super-pixels not necessarily planar

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant Classifier response for x and at a depth d window w h around the point x I

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling d C Sufficient to train a binary classifier predicting a single

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling d C Sufficient to train a binary classifier predicting a single For other depths d :

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling Generalized to multiple semantic classes semantic label

1. Image pyramid is built Training the classifier

Training the classifier 1. Image pyramid is built 2. Training data randomly sampled

Training the classifier 1. Image pyramid is built 2. Training data randomly sampled 3. Samples of each class at d C used as positives

Training the classifier 1. Image pyramid is built 2. Training data randomly sampled 3. Samples of each class at d C used as positives 4. Samples of other classes or at d d C used as negatives

Training the classifier 1. Image pyramid is built 2. Training data randomly sampled 3. Samples of each class at d C used as positives 4. Samples of other classes or at d d C used as negatives 5. Multi-class classifier trained

Classifying the patch Dense Features SIFT, LBP, Self Similarity, Texton

Classifying the patch Dense Features SIFT, LBP, Self Similarity, Texton Representation rectangles Soft BOW representations in the set of random

Classifying the patch Dense Features SIFT, LBP, Self Similarity, Texton Representation rectangles Classifier Soft BOW representations in the set of random AdaBoost

Experiments KITTI dataset 30 training & 30 test images (1382 x 512) 12 semantic labels, depth 2-50m (except sky) ratio of neighbouring depths d i+1 / d i = 1.25 NYU2 dataset 725 training & 724 test images (640 x 480) 40 semantic labels, depth in the range 1-10 m ratio of neighbouring depths d i+1 / d i = 1.25

KITTI results

NYU2 results

NYU2 results

Quantitative results (semantic) Quantitative results on the KITTI dataset (recall) Quantitative results on the NYU2 dataset (frequency-weighted I / U)

Quantitative results (depth) The ratio of pixels below the relative error

Quantitative results (depth) The distribution of the relative errors of an estimated depth in the log 1.25 space

Single-View Depth: Conclusions Things and stuff have intrinsic visual scale Depth can be recovered from a single image Classification improves with scale normalized classifier Data does not need to be balanced over scale

Discriminatively Trained Dense Surface Normal Estimation Marc Pollefeys Joint work with Ľubor Ladický and Bernhard Zeisl

Surface Normal Estimation Not explored much in the literature so how to approach it?

Surface Normal Estimation Not explored much in the literature so how to approach it? Pixels or Super-pixels?

Pixel-based Classifiers Input image Feature representation Context-based (context pixels or rectangles) feature representations [Shotton06, Shotton08]

Pixel-based Classifiers Input image Feature representation Context-based (context pixels or rectangles) feature representations [Shotton06, Shotton08] Classifier typically noisy and does not follow object boundaries

Segment-based Classifiers Input image Feature representation Based on feature statistics in segments

Segment-based Classifiers Input image Feature representation Based on feature statistics in segments Segments expected to be label-consistent

Segment-based Classifiers Input image Feature representation Based on feature statistics in segments Segments expected to be label-consistent One particular segmentation has to be chosen

Joint Regularization Input image Independent classifiers Existing optimization methods (Ladicky09) designed for discrete labels

Joint Regularization Input image Independent classifiers Existing optimization methods (Ladicky09) designed for discrete labels Not obvious how to generalize for continuous problems

Joint Regularization Input image Independent classifiers Existing optimization methods (Ladicky09) designed for discrete labels Not obvious how to generalize for continuous problems Maybe we can directly learn joint classifier

Joint Learning Input image Segment representation How to convert segment representation into pixel representation?

Joint Learning Input image Segment representation How to convert segment representation into pixel representation? Representation of a pixel the same as of the segment it belongs to

Joint Learning Input image Segment representation How to convert segment representation into pixel representation? Representation of a pixel the same as of the segment it belongs to Equivalent to weighted segment based approach

Joint Learning How to convert segment representation into pixel representation? Representation of a pixel the same as of the segment it belongs to Equivalent to weighted segment based approach Concatenation to combine pixel and multiple segment representations

Joint Learning To simplify regression problem Normals clustered using K-means clustering Each represented as weighted sums of cluster centres using local coding

Joint Learning To simplify regression problem Normals clustered using K-means clustering Each represented as weighted sums of cluster centres using local coding Learning formulated as a regression into local coding coordinates

Pipeline of our Method

AdaBoost Regression Response for each cluster centre l

AdaBoost Regression Response for each cluster centre l Learning optimizes weighted expected loss

AdaBoost Regression Response for each cluster centre l Learning optimizes weighted expected loss Empirical risk minimization in each iteration

AdaBoost Regression Introducing two sets of weights the problem transforms into recursive problem:

AdaBoost Regression Introducing two sets of weights the problem transforms into recursive problem: Closed-form solution for parameters of the weak classifier (see the paper)

Test-time Evaluation The most probable triangle found by maximizing:

Test-time Evaluation The most probable triangle found by maximizing: The local coding coefficients found as an expected value of probabilistic interpretation:

Test-time Evaluation The most probable triangle found by maximizing: The local coding coefficients found as an expected value of probabilistic interpretation: Normal recovered by projecting weighted sum to the unit sphere

Results Input image err = 40.366 Input image err = 32.446 Input image err = 33.636 Input image err = 37.066 Input image err = 38.043 Input image err = 35.109 Input image err = 35.849 Input image err = 28.379 Input image err = 35.429

Results Input image err = 37.688 Input image err = 40.784 Input image err = 51.897 Input image err = 28.216 Input image err = 32.034 Input image err = 68.038 Input image err = 33.174 Input image err = 41.131 Input image err = 38.873

Normal segmentation: conclusions Normal estimation might not be as hard as it seems Proposed joint pixel & segment learning useful for other recognition tasks Results improvable by incorporating regularization (potentially joint with depth)

Conclusion Volumetric multi-view approach which performs joint reconstruction, recognition and segmentation Strong coupling between geometry and appearance via classdependent anisotropic smoothness term Clean energy formulation (with tight convex relaxation) Significant qualitative improvement with respect to state of the art (regularized solution make more sense) Semantic reconstruction is more useful

Challenges and future research Scale to many classes leverage sparsity of semantic interactions, class hierarchies Scale to large volumes adaptive space discretization and basis representation Dynamic scenes Spatio-temporal interactions, extensions to 4D volumes and Wulff shapes Exploration and robotics Enable real-time navigation and exploration Predict information gain of perception actions

Thank you for your attention! Questions?