Learning with Side Information through Modality Hallucination

Similar documents
Learning with Side Information through Modality Hallucination

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Lecture 5: Object Detection

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Rich feature hierarchies for accurate object detection and semantic segmentation

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

Content-Based Image Recovery

Joint Unsupervised Learning of Deep Representations and Image Clusters Supplementary materials

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Deep Neural Networks:

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

Cascade Region Regression for Robust Object Detection

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

The Hilbert Problems of Computer Vision. Jitendra Malik UC Berkeley & Google, Inc.

Human Pose Estimation with Deep Learning. Wei Yang

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

Transfer Learning. Style Transfer in Deep Learning

Real-time Object Detection CS 229 Course Project

arxiv: v1 [cs.cv] 26 Jul 2018

arxiv: v1 [cs.cv] 17 Oct 2016

Near Real-time Object Detection in RGBD Data

Computer Vision Lecture 16

LEARNING BOUNDARIES WITH COLOR AND DEPTH. Zhaoyin Jia, Andrew Gallagher, Tsuhan Chen

Adapting Deep Networks Across Domains, Modalities, and Tasks Judy Hoffman!

Feature-Fused SSD: Fast Detection for Small Objects

Automatic detection of books based on Faster R-CNN

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Adaptive Gesture Recognition System Integrating Multiple Inputs

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Finding Tiny Faces Supplementary Materials

THe goal of scene recognition is to predict scene labels. Learning Effective RGB-D Representations for Scene Recognition

Segmenting Objects in Weakly Labeled Videos

Cross Modal Distillation for Supervision Transfer

Depth-aware CNN for RGB-D Segmentation

Efficient Segmentation-Aided Text Detection For Intelligent Robots

MULTI-VIEW FEATURE FUSION NETWORK FOR VEHICLE RE- IDENTIFICATION

Intensity-Depth Face Alignment Using Cascade Shape Regression

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Spatial Localization and Detection. Lecture 8-1

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

arxiv: v1 [cs.cv] 31 Mar 2016

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Object Detection Based on Deep Learning

LEARNING RIGIDITY IN DYNAMIC SCENES FOR SCENE FLOW ESTIMATION

Martian lava field, NASA, Wikipedia

Classification of objects from Video Data (Group 30)

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Robotics Programming Laboratory

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

arxiv: v1 [cs.cv] 22 Jul 2014

Depth image super-resolution via multi-frame registration and deep learning

arxiv: v1 [cs.cv] 6 Jul 2016

Object detection with CNNs

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Fully Convolutional Networks for Semantic Segmentation

RGBD Occlusion Detection via Deep Convolutional Neural Networks

arxiv: v2 [cs.cv] 2 Nov 2018

Yiqi Yan. May 10, 2017

Robust Face Recognition Based on Convolutional Neural Network

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Detection and Fine 3D Pose Estimation of Texture-less Objects in RGB-D Images

Computer Vision Lecture 16

MULTI-LEVEL 3D CONVOLUTIONAL NEURAL NETWORK FOR OBJECT RECOGNITION SAMBIT GHADAI XIAN LEE ADITYA BALU SOUMIK SARKAR ADARSH KRISHNAMURTHY

Smart Content Recognition from Images Using a Mixture of Convolutional Neural Networks *

Computer Vision Lecture 16

Multiple Kernel Learning for Emotion Recognition in the Wild

Regionlet Object Detector with Hand-crafted and CNN Feature

3D model classification using convolutional neural network

Bilinear Models for Fine-Grained Visual Recognition

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

arxiv: v1 [cs.cv] 10 Jul 2014

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Supplementary Material for Ensemble Diffusion for Retrieval

Rich feature hierarchies for accurate object detection and semantic segmentation

arxiv: v1 [cs.ro] 18 Jul 2017

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

YOLO9000: Better, Faster, Stronger

3D Point Cloud Segmentation Using a Fully Connected Conditional Random Field

Depth Based Object Detection from Partial Pose Estimation of Symmetric Objects

3D Point Cloud Segmentation Using a Fully Connected Conditional Random Field

Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Multi-Glance Attention Models For Image Classification

Exploiting Depth from Single Monocular Images for Object Detection and Semantic Segmentation

Contexts and 3D Scenes

Tri-modal Human Body Segmentation

Pedestrian Detection based on Deep Fusion Network using Feature Correlation

RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation

Multi-View Visual Recognition of Imperfect Testing Data

Deep Learning with Tensorflow AlexNet

Real-Time Depth Estimation from 2D Images

arxiv: v2 [cs.cv] 25 Aug 2014

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Shape Preserving RGB-D Depth Map Restoration

Transcription:

Master Seminar Report for Recent Trends in 3D Computer Vision Learning with Side Information through Modality Hallucination Judy Hoffman, Saurabh Gupta, Trevor Darrell. CVPR 2016 Nan Yang Supervisor: Benjamin Busam Computer Aided Medical Procedures Technische Universität München 15.12.2016

Table of Contents 1 Introduction................................................... 3 2 Related Work................................................. 3 3 Algorithm Explanation......................................... 5 4 Result........................................................ 7 5 Conclusion.................................................... 7 6 References.................................................... 8

Master Seminar Report for Recent Trends in 3D Computer Vision 1 3 Introduction Object dection is a classical and important task in computer vision. In some cases, as shown in Figure 1, using only RGB images will provide ambiguities. The reason causing this specific case might be that the object in the red box is made up of one leg of a desk and a bag which is like the shape of a chair. In fact, there are already works focus on using different image modalities simultaneously to achieve bter performance than using single modality[2,3,4]. Then one might come up with the idea that always using RGB-d image pairs to dect objects. However, even though depth cameras are more common than before, they are still not pervasive. So achieving good performance in object dection with RGB images as input is still worth researching and valuable. In this paper Hoffman al. propose a novel mhod which take RGB-d images paris at training time but only RGB images at test time for object dection task. By doing this, the convolutional neuron nwork learns to hallucinate mid-level convolutional features from a RGB image. A hallucination nwork is used at training time to transfer features which can be commonly extracted from depth images to RGB images. This mhod outperforms the standard nwork with RGB only images as input. Fig. 1: RGB images could provide ambiguities which lead to a failure: the dection in the red box is the result of RBD object dection. 2 Related Work The main mhod of this paper is to use side information, which is depth information here, at training time to transfer information through a new representation, hallucination nwork, to model of test time.there are 4 main related works of

4 Master Seminar Report for Recent Trends in 3D Computer Vision this paper: RGB-d Dection, Transfer Learning, Learning with side information and Nwork transfer through distillation. RGB-d Dection. Object dection aims at 2 purposes; dermine where is the objects and to which category each object belongs to. Fast R-CNN[2] is a classical framework for object dection, as shown in Figure 2 Fig. 2: Fast R-CNN algorithms procedure. Depth information can provide complementary features of images. This fact has been made use of by previous works taking RGB-d image pairs as input to achieve higher mean average precision rate than RGB only models. Using raw depth information is not efficient enough to learn a precise dector. As a result, many mhods introduce new depth representations[5,6,7,8,3]. Recently some mhods adding an additional depth convolutional neuron nworks are presented [2,1,4]. This work is inspired by these approaches. Transfer Learning, Learning with side information and Nwork transfer through distillation. These three related works are the theorical foundations of transferring information from one modality to another one through additional nwork. Transfer learning is about sharing information from one task to another one. The information involved is usually, under the situation of deep learning, the weights. So transferring weights learnt from one task to a new task to reduce the training time and improve the performance is the main purpose of transfer learning. In fact, the weights learned from the old task provide a bter initialization. Because of very high dimension of the loss function, bter initialization means it s closer to the global minima, or at least a bter local minima. The mhod of modality hallucination by learning an additional RGB representation is explored in[9]. Learning using side infor-

Master Seminar Report for Recent Trends in 3D Computer Vision 5 mation is also a good perspective to be viewed from for this work. The definition of learning using side information is fairly straight forward. It is when a learning algorithm uses additional modality information at training time to achieve bter performance, or we can say a stronger model. Surface normals are features could be extracted from depth images. It informs the dector of which directions of surface normals are more common for a specific category objects, e.g. most of normal vectors of a bed are pointed upward instead of downward or leftward, c, which means that a bed should lay on the ground instead of hanging on the wall. This approach can also be viewed from Nwork transfer through distillation. This mhod explains the that even though the input modalities of different tasks can also be different, distillation of information from one task to another one can also provide bter performance. There are several mhods to implement nwork distillation while in this work, transferring supervision from RGB-d image pairs by introducing joint optimization with multiple loss functions. 3 Algorithm Explanation This approach uses different nwork structures, as shown in Figure 3, in training stage and test stage. For training time, there are 3 nworks to be learned: RGB nwork, Hallucination nwork and Depth nwork. RBG nwork takes RGB images as input for the object dection task. Mid-level RGB features are extracted through convolutional neuron nworks. Depth nwork is similar to RGB nwork. Hallucination nwork is a special one. It takes RGB images as input but aims to extract depth features through convolutional neuron nworks from RGB images. To understand this, we need to know how convolutional neuron nworks work as for this hallucination nwork is just a ConvNs. The weights learned through back propagation guided by a loss function are used for convolution operation on image. The result of convolution operation on images are called feature maps which conclude the whole feature space of the input image, which are the most important data for image localization and classification. So we hope that the feature maps generated by hallucination nwork is as identical as possible to the ones generated by depth nwork, but the weights learned are different as for the input modalities of both nworks are different. To train the nworks, firstly train the RBG nwork and depth nwork independently using Fast R-CNN framework with corresponding image inputs. The next task is to initialize the hallucination nwork. The architectures of all three nworks are identical so either RBG nwork or depth nwork could be used to initialize the hallucination nwork. In this paper, initialization with depth nwork is used as for experiments show that initialization with depth nwork can achieve highest mean average precision among initialization with depth nwork, initialization with RGB nwork and random initialization. The reason for this could be that it is depth features are supposed to be extracted, so the paramers provided by depth nwork present a closer start point to the global minima or bter local minima than other two initializations. The next stage is

6 Master Seminar Report for Recent Trends in 3D Computer Vision (b) Nwork at test time (a) Nwork at training time Fig. 3: Nwork structures to optimize three nworks jointly and transfer depth modality features to hallucination nwork. The objective guides the training of hallucination nwork is defined as : 2 (1) Lhallucinate (l) = σ(adn ) σ(ahn ) 2 l l where l is the layer number to which we add the hallucination loss and σ(x) = 1/(1exp x ) is the sigmoid function. AdN and AhN are the activation output l l of layer l in depth nwork and hallucination n respectively. Asymmric transfer is used to train the hallucination nwork. While training and updating the weights of hallucination nwork, the weights of depth nwork should not be updated. The approach to achieve this is to simply s the learning rate of layers lower than l to 0 which will result in no updates for depth nwork according to the principle of SGD. Sigmoid function is used inside the Euclidean loss. The author doesn t mention the reason why she does this but some guesses can be proposed. It s always not good for SGD and back propagation to have unbounded loss. Sigmoid function here provided a reasonable bound, which is from 1 to 1, to make the training more robust. And the reason why she doesn t use ReLU function is obvious; ReLU is a not strictly monotonic function. Whole loss function is defined as: L = γlhallucinate hn α[ldn LrN LrdN LrhN ] loc loc Lloc loc loc β[ldn LrN LhN LrdN (2) LrhN ] as shown in Figure 3(a). Weights γ, α and β are used to balance the loss function. Hallucination loss is a regression loss and the hallucination nwork s input is RGB images but the weights are initialized with depth weights. So at the beginning, hallucination loss is much larger than classification loss and localization loss. However, if hallucination loss dominates the whole loss function, it s not suitable for this multi-task optimization process. To balance the loss function, α is s to 0.5 and β is s to 1.0. A heuristic mhod is used to dermine which number should be s to γ; the contribution of the hallucination loss should

Master Seminar Report for Recent Trends in 3D Computer Vision 7 be around 10 times the size of the contribution from any of the other losses. Other mhods like gradient clipping is also used to improve the robustness of the training procedure. 4 Result As shown in Figure 4, the novel mhod proposed in this paper achieves best performance among all the mhods on the NYUD2 datas. The author trains the initial RGB and depth nworks using the strategy proposed in[2], but use Fast R-CNN instead of RCNN as used in[2]. And then initialize the hallucination nwork using the depth paramer values. Finally, we jointly optimize the three channel nwork structure with a hallucination loss on the pool5 activations. For each architecture choice this work first compares against the corresponding RGB only Fast R-CNN model and find that the hallucination nwork outperforms this baseline, with 30.5 map vs 26.6 map for the AlexN architecture and 34.0 map vs 29.9 map for the VGG-1024 architecture. Note that for this joint AlexN mhod, A-RGB A-H, averaging the APs of the joint model using each of the AlexN RGB baseline models. Fig. 4: Dection (AP%) on NYUD2 test s. To understand what is learned by hallucination nwork and the reason why depth features could improve the performance for object dection, Figure 5 shows one of the dection results of hallucination nwork (green box) and RGB nwork (red box). The RGB nwork classifies the painting on the wall as a bed. The reason for this misclassification could be that beds are usually covered by colorful blanks and have square shape. So the texture information provided by this painting on the wall misleads the RGB nwork. However, hallucination nwork can extract mid-level depth features as surface normals. So it knows that the square object hanging on the wall has no chance to be a bed. 5 Conclusion This paper proposed a novel technique for fusing information of different modalities, i.e. depth information and RGB information, through hallucination nwork taking RGB images as inputs but extracting depth features. This approach outperforms the corresponding Fast R-CNN RGB dection models on the NYUD2 datas. Future application could be fusing different sensor data of tasks e.g. SLAM to achieve bter performance.

8 Master Seminar Report for Recent Trends in 3D Computer Vision Fig. 5: Comparison bween results of hallucination nwork (green box) and RGB nwork (red box). 6 References 1. S. Gupta, P. A. Arbelaez, R. B. Girshick, and J. Malik. Aligning 3D models to RGBD images of cluttered scenes. In Computer Vision and Pattern Recognition (CVPR), 2015. 2. S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning rich features from rgbd images for object dection and segmentation. In Computer Vision?ECCV 2014, pages 345?360. Springer, 2014. 3. C. Wang, W. Ren, K. Huang, and T. Tan. Weakly supervised object localization with latent category learning. In European Conference on Computer Vision (ECCV), 2014. 4. A. Wang, J. Lu, J. Cai, T. Cham, and G. Wang. Large-margin multi-modal deep learning for rgb-d object recognition. In IEEE Transactions on Multimedia, 2015. 5. A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and T. Darrell. A category-level 3D object datas: Putting the kinect to work. In Consumer Depth Cameras for Computer Vision. 2011. 6. B. soo Kim, S.Xu, and S.Savarese. Accurate localization of 3D objects from RGB-D data using segmentation hypotheses. In CVPR, 2013. 7. S. Tang, X. Wang, X. Lv, T. X. Han, J. Keller, Z. He, M. Skubic, and S. Lao. Histogram of oriented normal vectors for object recognition with a depth sensor. In ACCV, 2012. 8. E. S. Ye. Object dection in rgb-d indoor scenes. Master?s thesis, EECS Department, University of California, Berkeley, Jan 2013. 9. L.Spinello and K.O.Arras. Leveraging rgb-d data: Adaptive fusion and domain adaptation for object dection. In ICRA, 2012.