Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Marc Pollefeys Joined work with Nikolay Savinov, Christian Haene, Lubor Ladicky

2 Comparison to Volumetric Fusion

Higher-order ray potentials to model visibility Volumetric formulation (Savinov et al, CVPR15) Ray potentials Pairwise regularizer Cost based on the first occupied voxel along the ray freespace depth label

Higher-order ray potentials to model visibility Discrete formulation using QPBO relaxation (Savinov et al, CVPR15) x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 0 x 1 x 2 x 3 x 4 x 5 x 6 Our goal is to find : such that is : 1) A pairwise function 2) Number of edges grows linearly with the length for a ray 3) Symmetric to inherit QPBO properties

Two-label problem (Savinov et al, CVPR15) To find we do these steps: 1) Polynomial representation of the ray potential 2) Transformation into submodular function over x and x 3) Pairwise construction using auxiliary variables z 4) Merging variables [Ramalingam12] for linear complexity 5) Symmetrization of the graph

Symmetric graph construction for higher-order ray potential (Savinov et al, CVPR15)

Multi-label problem (Savinov et al, CVPR15) Standard alpha-expansion Multi-label ray potential projects into 2-label ray potential 1 0 1 0 0 1 0 -expansion 1 0 1 1 1 1 1 -expansion 1 1 1 0 0 1 0 -expansion Variables not labelled by QPBO labelled using ICM

Implementation details (Savinov et al, CVPR15) Semantic cost Depth cost Semantic classifier [Ladický ICCV09] Multi-view stereo depth matches using zero-mean NCC For the top n matches : 0-1 Use [GoldbergAlgo11] for graphcut

Results inference generative 9 (Savinov et al, CVPR15)

Results Input Depth Semantics 3D model (Savinov et al, CVPR15)

Results (Savinov et al, CVPR15)

13 Joint depth-semantic cost?

Single-View Depth using a Joint Depth-Semantic Classifier Marc Pollefeys Joint work with Ľubor Ladický and Jianbo Shi (Upenn)

Single-View Depth Estimation

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08]

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] Requires strong prior knowledge Ignores small objects

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06]

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] Works only for foreground objects (things)

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] 3. Depth from semantic labels [Liu et al. CVPR10]

Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] 3. Depth from semantic labels [Liu et al. CVPR10] Requires strong priors for semantic classes

Data-driven Depth Estimation Impossible?

Data-driven Depth Estimation No common structure of the scene Ground plane not always visible Large variation of viewpoints and of objects in the scene Both things and stuff in the scene

Data-driven Depth Estimation Desired properties :

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier Super-pixels not necessarily planar

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant Classifier response for x and at a depth d window w h around the point x I

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling d C Sufficient to train a binary classifier predicting a single

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling

Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling Generalized to multiple semantic classes semantic label

1. Image pyramid is built Training the classifier

Training the classifier 1. Image pyramid is built 2. Training data randomly sampled

Training the classifier 1. Image pyramid is built 2. Training data randomly sampled 3. Samples of each class at d C used as positives

Training the classifier 1. Image pyramid is built 2. Training data randomly sampled 3. Samples of each class at d C used as positives 4. Samples of other classes or at d d C used as negatives

Training the classifier 1. Image pyramid is built 2. Training data randomly sampled 3. Samples of each class at d C used as positives 4. Samples of other classes or at d d C used as negatives 5. Multi-class classifier trained

Classifying the patch Dense Features SIFT, LBP, Self Similarity, Texton

Classifying the patch Dense Features SIFT, LBP, Self Similarity, Texton Representation rectangles Soft BOW representations in the set of random

Classifying the patch Dense Features SIFT, LBP, Self Similarity, Texton Representation rectangles Classifier Soft BOW representations in the set of random AdaBoost

Experiments KITTI dataset 30 training & 30 test images (1382 x 512) 12 semantic labels, depth 2-50m (except sky) ratio of neighbouring depths d i+1 / d i = 1.25 NYU2 dataset 725 training & 724 test images (640 x 480) 40 semantic labels, depth in the range 1-10 m ratio of neighbouring depths d i+1 / d i = 1.25

KITTI results

NYU2 results

Quantitative results (semantic) Quantitative results on the KITTI dataset (recall) Quantitative results on the NYU2 dataset (frequency-weighted I / U)

Quantitative results (depth) The ratio of pixels below the relative error

Quantitative results (depth) The distribution of the relative errors of an estimated depth in the log 1.25 space

Single-View Depth: Conclusions Things and stuff have intrinsic visual scale Depth can be recovered from a single image Classification improves with scale normalized classifier Data does not need to be balanced over scale

Discriminatively Trained Dense Surface Normal Estimation Marc Pollefeys Joint work with Ľubor Ladický and Bernhard Zeisl

Surface Normal Estimation Not explored much in the literature so how to approach it?

Surface Normal Estimation Not explored much in the literature so how to approach it? Pixels or Super-pixels?

Pixel-based Classifiers Input image Feature representation Context-based (context pixels or rectangles) feature representations [Shotton06, Shotton08]

Pixel-based Classifiers Input image Feature representation Context-based (context pixels or rectangles) feature representations [Shotton06, Shotton08] Classifier typically noisy and does not follow object boundaries

Segment-based Classifiers Input image Feature representation Based on feature statistics in segments

Segment-based Classifiers Input image Feature representation Based on feature statistics in segments Segments expected to be label-consistent

Segment-based Classifiers Input image Feature representation Based on feature statistics in segments Segments expected to be label-consistent One particular segmentation has to be chosen

Joint Regularization Input image Independent classifiers Existing optimization methods (Ladicky09) designed for discrete labels

Joint Regularization Input image Independent classifiers Existing optimization methods (Ladicky09) designed for discrete labels Not obvious how to generalize for continuous problems

Joint Regularization Input image Independent classifiers Existing optimization methods (Ladicky09) designed for discrete labels Not obvious how to generalize for continuous problems Maybe we can directly learn joint classifier

Joint Learning Input image Segment representation How to convert segment representation into pixel representation?

Joint Learning Input image Segment representation How to convert segment representation into pixel representation? Representation of a pixel the same as of the segment it belongs to

Joint Learning Input image Segment representation How to convert segment representation into pixel representation? Representation of a pixel the same as of the segment it belongs to Equivalent to weighted segment based approach

Joint Learning How to convert segment representation into pixel representation? Representation of a pixel the same as of the segment it belongs to Equivalent to weighted segment based approach Concatenation to combine pixel and multiple segment representations

Joint Learning To simplify regression problem Normals clustered using K-means clustering Each represented as weighted sums of cluster centres using local coding

Joint Learning To simplify regression problem Normals clustered using K-means clustering Each represented as weighted sums of cluster centres using local coding Learning formulated as a regression into local coding coordinates

Pipeline of our Method

AdaBoost Regression Response for each cluster centre l

AdaBoost Regression Response for each cluster centre l Learning optimizes weighted expected loss

AdaBoost Regression Response for each cluster centre l Learning optimizes weighted expected loss Empirical risk minimization in each iteration

AdaBoost Regression Introducing two sets of weights the problem transforms into recursive problem:

AdaBoost Regression Introducing two sets of weights the problem transforms into recursive problem: Closed-form solution for parameters of the weak classifier (see the paper)

Test-time Evaluation The most probable triangle found by maximizing:

Test-time Evaluation The most probable triangle found by maximizing: The local coding coefficients found as an expected value of probabilistic interpretation:

Test-time Evaluation The most probable triangle found by maximizing: The local coding coefficients found as an expected value of probabilistic interpretation: Normal recovered by projecting weighted sum to the unit sphere

Results Input image err = 40.366 Input image err = 32.446 Input image err = 33.636 Input image err = 37.066 Input image err = 38.043 Input image err = 35.109 Input image err = 35.849 Input image err = 28.379 Input image err = 35.429

Results Input image err = 37.688 Input image err = 40.784 Input image err = 51.897 Input image err = 28.216 Input image err = 32.034 Input image err = 68.038 Input image err = 33.174 Input image err = 41.131 Input image err = 38.873

Normal segmentation: conclusions Normal estimation might not be as hard as it seems Proposed joint pixel & segment learning useful for other recognition tasks Results improvable by incorporating regularization (potentially joint with depth)

Conclusion Volumetric multi-view approach which performs joint reconstruction, recognition and segmentation Strong coupling between geometry and appearance via classdependent anisotropic smoothness term Clean energy formulation (with tight convex relaxation) Significant qualitative improvement with respect to state of the art (regularized solution make more sense) Semantic reconstruction is more useful

Challenges and future research Scale to many classes leverage sparsity of semantic interactions, class hierarchies Scale to large volumes adaptive space discretization and basis representation Dynamic scenes Spatio-temporal interactions, extensions to 4D volumes and Wulff shapes Exploration and robotics Enable real-time navigation and exploration Predict information gain of perception actions

Thank you for your attention! Questions?