Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Marc Pollefeys Joined work with Nikolay Savinov, Christian Haene, Lubor Ladicky
2 Comparison to Volumetric Fusion
Higher-order ray potentials to model visibility Volumetric formulation (Savinov et al, CVPR15) Ray potentials Pairwise regularizer Cost based on the first occupied voxel along the ray freespace depth label
Higher-order ray potentials to model visibility Discrete formulation using QPBO relaxation (Savinov et al, CVPR15) x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 0 x 1 x 2 x 3 x 4 x 5 x 6 Our goal is to find : such that is : 1) A pairwise function 2) Number of edges grows linearly with the length for a ray 3) Symmetric to inherit QPBO properties
Two-label problem (Savinov et al, CVPR15) To find we do these steps: 1) Polynomial representation of the ray potential 2) Transformation into submodular function over x and x 3) Pairwise construction using auxiliary variables z 4) Merging variables [Ramalingam12] for linear complexity 5) Symmetrization of the graph
Symmetric graph construction for higher-order ray potential (Savinov et al, CVPR15)
Multi-label problem (Savinov et al, CVPR15) Standard alpha-expansion Multi-label ray potential projects into 2-label ray potential 1 0 1 0 0 1 0 -expansion 1 0 1 1 1 1 1 -expansion 1 1 1 0 0 1 0 -expansion Variables not labelled by QPBO labelled using ICM
Implementation details (Savinov et al, CVPR15) Semantic cost Depth cost Semantic classifier [Ladický ICCV09] Multi-view stereo depth matches using zero-mean NCC For the top n matches : 0-1 Use [GoldbergAlgo11] for graphcut
Results inference generative 9 (Savinov et al, CVPR15)
Results Input Depth Semantics 3D model (Savinov et al, CVPR15)
Results Input Depth Semantics 3D model (Savinov et al, CVPR15)
Results (Savinov et al, CVPR15)
13 Joint depth-semantic cost?
Single-View Depth using a Joint Depth-Semantic Classifier Marc Pollefeys Joint work with Ľubor Ladický and Jianbo Shi (Upenn)
Single-View Depth Estimation
Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08]
Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] Requires strong prior knowledge Ignores small objects
Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06]
Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] Works only for foreground objects (things)
Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] 3. Depth from semantic labels [Liu et al. CVPR10]
Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] 3. Depth from semantic labels [Liu et al. CVPR10] Requires strong priors for semantic classes
Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] 3. Depth from semantic labels [Liu et al. CVPR10] 4. Data driven [Saxena et al. NIPS05]
Single-View Depth Estimation Standard approaches : 1. Model fitting [Barinova et al. ECCV08] 2. 3D-Detection based [Hoiem et al. CVPR06] 3. Depth from semantic labels [Liu et al. CVPR10] 4. Data driven [Saxena et al. NIPS05] Requires lots of data (depth does not generalizes across classes) A problem with balancing data
Data-driven Depth Estimation Impossible?
Data-driven Depth Estimation No common structure of the scene Ground plane not always visible Large variation of viewpoints and of objects in the scene Both things and stuff in the scene
Data-driven Depth Estimation Desired properties :
Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier Super-pixels not necessarily planar
Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant Classifier response for x and at a depth d window w h around the point x I
Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling
Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling d C Sufficient to train a binary classifier predicting a single
Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling d C Sufficient to train a binary classifier predicting a single For other depths d :
Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling
Data-driven Depth Estimation Desired properties : 1. Pixel-wise classifier 2. Translation invariant 3. Depth transforms with inverse scaling Generalized to multiple semantic classes semantic label
1. Image pyramid is built Training the classifier
Training the classifier 1. Image pyramid is built 2. Training data randomly sampled
Training the classifier 1. Image pyramid is built 2. Training data randomly sampled 3. Samples of each class at d C used as positives
Training the classifier 1. Image pyramid is built 2. Training data randomly sampled 3. Samples of each class at d C used as positives 4. Samples of other classes or at d d C used as negatives
Training the classifier 1. Image pyramid is built 2. Training data randomly sampled 3. Samples of each class at d C used as positives 4. Samples of other classes or at d d C used as negatives 5. Multi-class classifier trained
Classifying the patch Dense Features SIFT, LBP, Self Similarity, Texton
Classifying the patch Dense Features SIFT, LBP, Self Similarity, Texton Representation rectangles Soft BOW representations in the set of random
Classifying the patch Dense Features SIFT, LBP, Self Similarity, Texton Representation rectangles Classifier Soft BOW representations in the set of random AdaBoost
Experiments KITTI dataset 30 training & 30 test images (1382 x 512) 12 semantic labels, depth 2-50m (except sky) ratio of neighbouring depths d i+1 / d i = 1.25 NYU2 dataset 725 training & 724 test images (640 x 480) 40 semantic labels, depth in the range 1-10 m ratio of neighbouring depths d i+1 / d i = 1.25
KITTI results
NYU2 results
NYU2 results
Quantitative results (semantic) Quantitative results on the KITTI dataset (recall) Quantitative results on the NYU2 dataset (frequency-weighted I / U)
Quantitative results (depth) The ratio of pixels below the relative error
Quantitative results (depth) The distribution of the relative errors of an estimated depth in the log 1.25 space
Single-View Depth: Conclusions Things and stuff have intrinsic visual scale Depth can be recovered from a single image Classification improves with scale normalized classifier Data does not need to be balanced over scale
Discriminatively Trained Dense Surface Normal Estimation Marc Pollefeys Joint work with Ľubor Ladický and Bernhard Zeisl
Surface Normal Estimation Not explored much in the literature so how to approach it?
Surface Normal Estimation Not explored much in the literature so how to approach it? Pixels or Super-pixels?
Pixel-based Classifiers Input image Feature representation Context-based (context pixels or rectangles) feature representations [Shotton06, Shotton08]
Pixel-based Classifiers Input image Feature representation Context-based (context pixels or rectangles) feature representations [Shotton06, Shotton08] Classifier typically noisy and does not follow object boundaries
Segment-based Classifiers Input image Feature representation Based on feature statistics in segments
Segment-based Classifiers Input image Feature representation Based on feature statistics in segments Segments expected to be label-consistent
Segment-based Classifiers Input image Feature representation Based on feature statistics in segments Segments expected to be label-consistent One particular segmentation has to be chosen
Joint Regularization Input image Independent classifiers Existing optimization methods (Ladicky09) designed for discrete labels
Joint Regularization Input image Independent classifiers Existing optimization methods (Ladicky09) designed for discrete labels Not obvious how to generalize for continuous problems
Joint Regularization Input image Independent classifiers Existing optimization methods (Ladicky09) designed for discrete labels Not obvious how to generalize for continuous problems Maybe we can directly learn joint classifier
Joint Learning Input image Segment representation How to convert segment representation into pixel representation?
Joint Learning Input image Segment representation How to convert segment representation into pixel representation? Representation of a pixel the same as of the segment it belongs to
Joint Learning Input image Segment representation How to convert segment representation into pixel representation? Representation of a pixel the same as of the segment it belongs to Equivalent to weighted segment based approach
Joint Learning How to convert segment representation into pixel representation? Representation of a pixel the same as of the segment it belongs to Equivalent to weighted segment based approach Concatenation to combine pixel and multiple segment representations
Joint Learning To simplify regression problem Normals clustered using K-means clustering Each represented as weighted sums of cluster centres using local coding
Joint Learning To simplify regression problem Normals clustered using K-means clustering Each represented as weighted sums of cluster centres using local coding Learning formulated as a regression into local coding coordinates
Pipeline of our Method
AdaBoost Regression Response for each cluster centre l
AdaBoost Regression Response for each cluster centre l Learning optimizes weighted expected loss
AdaBoost Regression Response for each cluster centre l Learning optimizes weighted expected loss Empirical risk minimization in each iteration
AdaBoost Regression Introducing two sets of weights the problem transforms into recursive problem:
AdaBoost Regression Introducing two sets of weights the problem transforms into recursive problem: Closed-form solution for parameters of the weak classifier (see the paper)
Test-time Evaluation The most probable triangle found by maximizing:
Test-time Evaluation The most probable triangle found by maximizing: The local coding coefficients found as an expected value of probabilistic interpretation:
Test-time Evaluation The most probable triangle found by maximizing: The local coding coefficients found as an expected value of probabilistic interpretation: Normal recovered by projecting weighted sum to the unit sphere
Results Input image err = 40.366 Input image err = 32.446 Input image err = 33.636 Input image err = 37.066 Input image err = 38.043 Input image err = 35.109 Input image err = 35.849 Input image err = 28.379 Input image err = 35.429
Results Input image err = 37.688 Input image err = 40.784 Input image err = 51.897 Input image err = 28.216 Input image err = 32.034 Input image err = 68.038 Input image err = 33.174 Input image err = 41.131 Input image err = 38.873
Normal segmentation: conclusions Normal estimation might not be as hard as it seems Proposed joint pixel & segment learning useful for other recognition tasks Results improvable by incorporating regularization (potentially joint with depth)
Conclusion Volumetric multi-view approach which performs joint reconstruction, recognition and segmentation Strong coupling between geometry and appearance via classdependent anisotropic smoothness term Clean energy formulation (with tight convex relaxation) Significant qualitative improvement with respect to state of the art (regularized solution make more sense) Semantic reconstruction is more useful
Challenges and future research Scale to many classes leverage sparsity of semantic interactions, class hierarchies Scale to large volumes adaptive space discretization and basis representation Dynamic scenes Spatio-temporal interactions, extensions to 4D volumes and Wulff shapes Exploration and robotics Enable real-time navigation and exploration Predict information gain of perception actions
Thank you for your attention! Questions?