ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

Similar documents
Depth from Stereo. Dominic Cheng February 7, 2018

arxiv: v1 [cs.cv] 16 Jul 2018

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

UNSUPERVISED STEREO MATCHING USING CORRESPONDENCE CONSISTENCY. Sunghun Joung Seungryong Kim Bumsub Ham Kwanghoon Sohn

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS. Mustafa Ozan Tezcan

Stereo Correspondence with Occlusions using Graph Cuts

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Camera-based Vehicle Velocity Estimation using Spatiotemporal Depth and Motion Features

Supplemental Material for End-to-End Learning of Video Super-Resolution with Motion Compensation

Learning visual odometry with a convolutional network

EVALUATION OF DEEP LEARNING BASED STEREO MATCHING METHODS: FROM GROUND TO AERIAL IMAGES

3D Scene Understanding from RGB-D Images. Thomas Funkhouser

Tri-modal Human Body Segmentation

Efficient Deep Learning for Stereo Matching

Data-driven Depth Inference from a Single Still Image

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Real-time self-adaptive deep stereo

Beyond local reasoning for stereo confidence estimation with deep learning

Instance-aware Semantic Segmentation via Multi-task Network Cascades

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

arxiv: v2 [cs.cv] 14 May 2018

Supplementary Material for "FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks"

Supplementary Material: Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos

A Deep Learning Approach to Vehicle Speed Estimation

Multi-View 3D Object Detection Network for Autonomous Driving

arxiv: v2 [cs.cv] 9 Mar 2018

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Joint Unsupervised Learning of Optical Flow and Depth by Watching Stereo Videos

Joint Object Detection and Viewpoint Estimation using CNN features

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

Classification of objects from Video Data (Group 30)

End-to-End Learning of Geometry and Context for Deep Stereo Regression

DeMoN: Depth and Motion Network for Learning Monocular Stereo Supplementary Material

UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss

End-to-End Learning of Geometry and Context for Deep Stereo Regression

arxiv: v1 [cs.cv] 20 Dec 2016

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document

Cascade Residual Learning: A Two-stage Convolutional Neural Network for Stereo Matching

Usings CNNs to Estimate Depth from Stereo Imagery

Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz Supplemental Material

Depth Learning: When Depth Estimation Meets Deep Learning

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

arxiv: v2 [cs.ro] 26 Feb 2018

EE795: Computer Vision and Intelligent Systems

Fully Convolutional Network for Depth Estimation and Semantic Segmentation

Left-Right Comparative Recurrent Model for Stereo Matching

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Supplementary Material for ECCV 2012 Paper: Extracting 3D Scene-consistent Object Proposals and Depth from Stereo Images

Supplementary: Cross-modal Deep Variational Hand Pose Estimation

Supplemental Material: A Dataset and Evaluation Methodology for Depth Estimation on 4D Light Fields

arxiv: v1 [cs.cv] 23 Mar 2018

Fuzzy Set Theory in Computer Vision: Example 3

RSRN: Rich Side-output Residual Network for Medial Axis Detection

Semi-Supervised Deep Learning for Monocular Depth Map Prediction

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Ground Plane Detection with a Local Descriptor

Dense Tracking and Mapping for Autonomous Quadrocopters. Jürgen Sturm

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Improved Stereo Matching with Constant Highway Networks and Reflective Confidence Learning

arxiv: v2 [cs.cv] 30 Aug 2017

High-precision Depth Estimation with the 3D LiDAR and Stereo Fusion

SINGLE image super-resolution (SR) aims to reconstruct

Single Image Super Resolution of Textures via CNNs. Andrew Palmer

TESTING ROBUSTNESS OF COMPUTER VISION SYSTEMS

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Non-flat Road Detection Based on A Local Descriptor

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Towards real-time unsupervised monocular depth estimation on CPU

SurfNet: Generating 3D shape surfaces using deep residual networks-supplementary Material

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Supplementary Material for A Multi-View Stereo Benchmark with High-Resolution Images and Multi-Camera Videos

Filter Flow: Supplemental Material

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Deep Back-Projection Networks For Super-Resolution Supplementary Material

Unsupervised Adaptation for Deep Stereo

SINGLE image super-resolution (SR) aims to reconstruct

Study of Residual Networks for Image Recognition

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Learning from 3D Data

Martian lava field, NASA, Wikipedia

arxiv: v2 [cs.cv] 28 Mar 2018

arxiv: v1 [cs.cv] 29 Sep 2016

SPM-BP: Sped-up PatchMatch Belief Propagation for Continuous MRFs. Yu Li, Dongbo Min, Michael S. Brown, Minh N. Do, Jiangbo Lu

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

arxiv: v1 [cs.cv] 12 Nov 2018

arxiv: v1 [cs.cv] 31 Mar 2016

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Two-Stream Convolutional Networks for Action Recognition in Videos

Online structured learning for Obstacle avoidance

Presented at the FIG Congress 2018, May 6-11, 2018 in Istanbul, Turkey

Real-Time Depth Estimation from 2D Images

Using temporal seeding to constrain the disparity search range in stereo matching

Combined Bilateral Filter for Enhanced Real-Time Upsampling of Depth Images

Lost! Leveraging the Crowd for Probabilistic Visual Self-Localization

Pedestrian and Part Position Detection using a Regression-based Multiple Task Deep Convolutional Neural Network

Open-World Stereo Video Matching with Deep RNN

Transcription:

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials) Yinda Zhang 1,2, Sameh Khamis 1, Christoph Rhemann 1, Julien Valentin 1, Adarsh Kowdle 1, Vladimir Tankovich 1, Michael Schoenberg 1, Shahram Izadi 1, Thomas Funkhouser 1,2, Sean Fanello 1 1 Google Inc., 2 Princeton University In this supplementary material, we provide additional information regarding our implementation details and more ablation studies on important components of the proposed framework. We conclude the supplementary material evaluating our method on passive stereo matching, showing that the proposed approach can be applied also on RGB images. 1 Additional Implementation Details To ensure reproducibility, in the following, we give all the details needed to re-implement the proposed architecture. 1.1 Architecture Our model extends the architecture of [3]. Although the specific architecture is out of the scope of this paper, we introduced three different modifications to address specific problems of active stereo matching. First, we adjust Siamese tower to maintain more high frequency signals from the input image. Unlike [3] that aggressively reduce resolution at the beginning of the tower, we run several residual blocks on full resolution and reduce resolution later using convolution with stride (Fig. 1 (a)). Second, we use a two-stream refinement network (Fig. 1 (c)). Instead of stacking the image and the upsampled low resolution disparity, we feed these two into different pathways, where each pathway consists of 1 layer of convolution with 16 channels and 3 resnet blocks that maintain the number channels. The output of two pathways are concatenated together into a 32-dim feature map, which is further fed into 3 resnet blocks. We found this architecture produces results with way less dot artifacts. Lastly, we have an additional invalidation network which predicts the confidence of the estimated disparity (Fig. 1 (b)). One pursuing only very accurate depth could use this output to remove unwanted depth. The invalidation network takes the concatenation of the left and right tower features, and feed them into 5 resnet and 1 convolution in the end. The output is an invalidation map with the same resolution of the tower feature, i.e. 1/8 of the input resolution. This low resolution invalidation map is upsampled to full resolution by bilinear

2 Y. Zhang et al. Fig. 1. Detailed Network Architecture. (a) Siamese Tower, (b) Invalidation Network, (c) Disparity Refinement Network. In resnet block and convolution, the numbers are number of channels, kernel size, stride, and dilated rate. In leaky RELU, the number is the slope for x < 0. c means feature map concatenation. interpolation. In order to refine it, we concatenate this upsampled invalidation map with the full resolution refined disparity and the input image, and feed them into 1 conv + 4 resnet block + 1 conv. The output is a one dimensional score map with higher value representing invalidate area. During the training, we formulate the invalidation mask learning as a classification task. Note that the problem is completely unbalanced, i.e. most of the pixels are valid, therefore a trivial solution for the network consists of assigning all the pixels a positive label. To avoid this, we reweigh the output space by a factor of 10 (we notice that invalid pixels are roughly 10% of the valid ones). For valid pixels we assign the value y + = 1, whereas for invalid pixels we use y = 10. The final loss is defined as an l-1 loss between the estimation and the ground truth mask 1. 1.2 Training Resolution Train/test on full resolution is important for active stereo system to get accurate depth (Sec. 2.2). However, during the training, the model cannot fit in a 12GB memory for a full resolution of 1280 720. To still enable the training in full resolution, we randomly pick a region with 1024 256 pixels, and crop from this same area from left and right images. This does not change the disparity, and thus models trained on small disparity can directly work on full resolution during the test. 1 We also tried other classification losses, like the logistic loss, and they all led to very similar results.

ActiveStereoNet 3 Fig. 2. Active vs Passive Illumination. Notice the importance of both: only active illumination exhibits higher level or noise, passive illumination struggles in textureless regions. When both the illumination are present we predict high quality disparity maps. Fig. 3. Qualitative Evaluation Resolution. Notice how full resolution is needed to produce thin structures and crisp edges. Invalidation Network Our invalidation network is trained fully self-supervised. The ground-truth is generated on the fly by using a hard left-right consistency check with a disparity threshold equal to 1. However, we found this function to be very unstable at the beginning of the training, since the network has not converged yet. To regularize its behavior, we first disable the invalidation network and only train the disparity network for 20000 iterations. After that we enable also the invalidation network. In practice we found this strategy is helpful for the disparity network to converge and prevent invalidation network from affecting the disparity. 2 Additional Evaluations In this section we perform more ablation studies on different components of the network, we then show additional qualitative results and comparisons with other methods.

4 Y. Zhang et al. Fig. 4. We provide more qualitative examples. Notice in these challenging scenes how we provide a more complete output, crisper edges and less oversmoothed disparities. 2.1 Active + Passive Illumination We first highlight the importance of having both active and passive illumination in our system. To show the impact of the two components, we run ASN on a typical living room scene with active only (i.e. all the lights in the room were off), passive only (laser from the sensor is off), and a combination of active and passive. The results are shown in Fig. 2. The result with only active illumination is noisier (see the presence of dot spike) and loses object edges (e.g. the coffee table) due to the strong presence of the dot pattern. The result from traditional passive stereo matching fails in texture-less regions, such as the wall. In contrast, the combination of the two gives a smooth disparity with sharp edges. 2.2 Image Resolution We here evaluate the importance of image resolution for active stereo. We hypothesize that high-resolution input images are needed to generate high-quality disparity maps because any downsizing would lose considerable high frequency signals from the active pattern. To test this, we compare the performance of ASN trained with full and half resolution inputs in Fig. 3. Although deep architectures are usually good at super-resolution tasks, in this case there is a substantial degradation of the overall results. The thin structure and boundary details are missing in the ASN trained with half resolution. Please note that this result drives our decision to design a compact network architecture to avoid excessive memory usage. 2.3 Performance w.r.t. dataset size Here we show the performance of the algorithm with respect to the dataset size. In Fig. 6, we show how our method is able to produce reasonable depthmaps on

ActiveStereoNet 5 Fig. 5. Quantitative Evaluation - Passive Stereo. Our method outperforms the state of the art unsupervised method by Godard et al. [2] and it generalizes better than the supervised loss on the Kitti 2012 and Kitti 2015 datasets. unseen images even when only a small dataset of 100 images is used. Increasing the training data size, leads to sharper results. We did not find any additional improvement when more than 10000 images are employed. 2.4 More Qualitative Results Here we show additional challenging scenes and compare our method with local stereo matching algorithms as well as the sensor output. Results are depicted in Fig. 4. Notice how the proposed methods not only estimate more complete disparity maps, but suffer less from edge fattening and oversmoothing. Fig. 7 shows more results of the predicted disparity and invalidation mask on more diverse scenes. 3 Self-Supervised Passive Stereo We conclude our study with an evaluation of the proposed self-supervised training method on RGB images. Although this is out of the scope of this paper, we show how the proposed loss generalizes well also for passive stereo matching problems. We consider the same architecture described in Sec. 1.1 and we train three networks from scratch on the SceneFlow dataset [5] using three different losses: supervised, Godard et al. [2] and our method. While training on Scene- Flow, we evaluate the accuracy for the current iteration on the validation set

6 Y. Zhang et al. Fig. 6. Analysis of performance w.r.t data size. We found the model to be effective even with a small dataset. 100 images contain millions of individual pixels which are used as training examples. Increasing the dataset size leads to more accurate (sharper) depth maps, although the overall error does not improve that much. We did not observe any significant improvement beyond 10000 examples. as well as on unseen datasets such as Middlebury [6], Kitti 2012 and Kitti 2015 [1, 4]. This is equivalent to testing our model on different datasets without finetuning. In Fig. 5 we show the EPE error of all the trained methods. Please note that the model trained using our loss achieves the lowest error on all datasets, which indicates that our method generalizes well. In particular on Kitti 2012 and Kitti 2015, we even outperform the supervised method trained on SceneFlow, which suggests the supervised loss is prone to overfitting. Finally, we conclude this work showing qualitative results on passive RGB images. Fig. 8 we compare the three networks on the validation set used in SceneFlow (top two rows) as well as on unseen datasets (bottom 6 rows). Notice how our method always outperforms Godard et al. [2] in terms of edge fattening and frequent outliers on every dataset. Moreover, on the Kitti images, we do not suffer from gross errors such as the supervised model, which clearly overfit the original dataset.

ActiveStereoNet 7 Fig. 7. Results of estimated disparity and invalidation mask on more diverse scenes.

8 Y. Zhang et al. Fig. 8. Examples of qualitative results on RGB data using our method. We trained on SceneFlow dataset and tested on the others without any finetuning. Notice that for Kitti dataset we show better results compared to the supervised methods which suffer from gross errors due to overfitting.

ActiveStereoNet 9 References 1. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012) 2. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR. vol. 2, p. 7 (2017) 3. Khamis, S., Fanello, S., Rhemann, C., Valentin, J., Kowdle, A., Izadi, S.: Stereonet: Guided hierarchical refinement for edge-aware depth prediction. In: ECCV (2018) 4. Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015) 5. N.Mayer, E.Ilg, P.Häusser, P.Fischer, D.Cremers, A.Dosovitskiy, T.Brox: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2016), http://lmb.informatik.unifreiburg.de/publications/2016/mifdb16, arxiv:1512.02134 6. Scharstein, D., Hirschmuller, H., Kitajima, Y., Krathwohl, G., Nesic, N., Wang, X., Westling, P.: High-resolution stereo datasets with subpixel-accurate ground truth. In: GCPR (2014)