JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Similar documents
Efficient Segmentation-Aided Text Detection For Intelligent Robots

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Amodal and Panoptic Segmentation. Stephanie Liu, Andrew Zhou

Lecture 7: Semantic Segmentation

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Cascade Region Regression for Robust Object Detection

arxiv: v2 [cs.cv] 18 Jul 2017

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

S7348: Deep Learning in Ford's Autonomous Vehicles. Bryan Goodman Argo AI 9 May 2017

Semantic Segmentation

Joint Object Detection and Viewpoint Estimation using CNN features

Lecture 5: Object Detection

arxiv: v1 [cs.cv] 31 Mar 2016

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Fully Convolutional Networks for Semantic Segmentation

Martian lava field, NASA, Wikipedia

Object detection with CNNs

arxiv: v1 [cs.cv] 1 Sep 2017

Pixel Offset Regression (POR) for Single-shot Instance Segmentation

Feature-Fused SSD: Fast Detection for Small Objects

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

arxiv: v1 [cs.cv] 15 Oct 2018

Learning to Generate Object Segmentation Proposals with Multi-modal Cues

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Object Detection Based on Deep Learning

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Boundary-aware Instance Segmentation

Object Detection on Self-Driving Cars in China. Lingyun Li

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

arxiv: v1 [cs.cv] 3 Apr 2016

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

R-FCN: Object Detection with Really - Friggin Convolutional Networks

Yiqi Yan. May 10, 2017

arxiv: v2 [cs.cv] 13 Mar 2019

Pseudo Mask Augmented Object Detection

Photo OCR ( )

Internet of things that video

Training models for road scene understanding with automated ground truth Dan Levi

Pseudo Mask Augmented Object Detection

arxiv: v1 [cs.cv] 24 May 2016

Part Localization by Exploiting Deep Convolutional Networks

SGN: Sequential Grouping Networks for Instance Segmentation

arxiv: v4 [cs.cv] 6 Jul 2016

arxiv: v1 [cs.cv] 14 Dec 2015

Structured Prediction using Convolutional Neural Networks

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Reversible Recursive Instance-level Object Segmentation

Unified, real-time object detection

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

YOLO9000: Better, Faster, Stronger

Deep Watershed Transform for Instance Segmentation

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Illuminating Pedestrians via Simultaneous Detection & Segmentation

3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

Visual features detection based on deep neural network in autonomous driving tasks

arxiv: v1 [cs.cv] 20 Dec 2016

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Regionlet Object Detector with Hand-crafted and CNN Feature

Dense Image Labeling Using Deep Convolutional Neural Networks

RSRN: Rich Side-output Residual Network for Medial Axis Detection

arxiv: v2 [cs.cv] 8 Apr 2018

Scene Composition in Augmented Virtual Presenter System

with Deep Learning A Review of Person Re-identification Xi Li College of Computer Science, Zhejiang University

SEMANTIC SEGMENTATION AVIRAM BAR HAIM & IRIS TAL

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Content-Based Image Recovery

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

Channel Locality Block: A Variant of Squeeze-and-Excitation

Learning from 3D Data

arxiv: v1 [cs.cv] 9 Aug 2017

3D Shape Segmentation with Projective Convolutional Networks

2017 2nd International Conference on Software, Multimedia and Communication Engineering (SMCE 2017) ISBN:

Fast scene understanding and prediction for autonomous platforms. Bert De Brabandere, KU Leuven, October 2017

Synscapes A photorealistic syntehtic dataset for street scene parsing Jonas Unger Department of Science and Technology Linköpings Universitet.

Flow-Based Video Recognition

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

EFFECTIVE OBJECT DETECTION FROM TRAFFIC CAMERA VIDEOS. Honghui Shi, Zhichao Liu*, Yuchen Fan, Xinchao Wang, Thomas Huang

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

Rich feature hierarchies for accurate object detection and semantic segmentation

Finding Tiny Faces Supplementary Materials

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

Semantic Soft Segmentation Supplementary Material

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

arxiv: v1 [cs.cv] 26 Jun 2017

MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS. Mustafa Ozan Tezcan

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Transcription:

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS Zhao Chen Machine Learning Intern, NVIDIA

ABOUT ME 5th year PhD student in physics @ Stanford by day, deep learning computer vision scientist by night. Intern with Deep Learning Applied Research (Autonomous Vehicles) @ NVIDIA, Oct-Dec 2016. 2

TALK OVERVIEW (1) Problem statement and summary. (2) Dataset and preliminaries. (3) Model motivation. (4) Results and visualizations. 3

TALK OVERVIEW (1) Problem statement and summary. (2) Dataset and preliminaries. (3) Model motivation. (4) Results and visualizations. 4

FROM SINGLE TO MULTITASK LEARNING Putting deep learning to work in the real world Detection Model... Object Bounding Boxes Segmentation Model... Segmentation Mask 5

FROM SINGLE TO MULTITASK LEARNING Putting deep learning to work in the real world Detection Model... Object Bounding Boxes Segmentation Model... Poor scalability + inefficient use of information! Segmentation Mask 6

FROM SINGLE TO MULTITASK LEARNING Putting deep learning to work in the real world How do we use one model to perform multiple tasks faster and better? Shared Model... Object Bounding Boxes Segmentation Mask 7

FROM SINGLE TO MULTITASK LEARNING Putting deep learning to work in the real world How do we use one model to perform multiple tasks faster and better? Shared Model... Object Bounding Boxes + edge detection, + surface normals, + distance estimation Segmentation Mask 8

FROM SINGLE TO MULTITASK LEARNING Putting deep learning to work in the real world How do we use one model to perform multiple tasks faster and better? Shared Model... Object Bounding Boxes Segmentation Mask How do you relate various tasks to each other in a multi-task neural network? 9

WHAT WE WILL SHOW By ordering tasks based on receptive field and information density, we improve segmentation and detection accuracy by ~2% and ~8% over single networks, respectively. The joint network is robust and easy to tune compared to non-hierarchical baselines. 10

TALK OVERVIEW (1) Problem statement and summary. (2) Dataset and preliminaries. (3) Model motivation. (4) Results and visualizations. 11

CITYSCAPES DATASET 2975 Training Images @ resolution 1024 x 2048. 20 classes for semantic segmentation, including 8 object classes. Of these 8, 4 are much more represented (car, bicycle, person, rider): the easy classes. Both segmentation, bounding box, and edge ground truth can be generated. Raw Image Semantic Seg. Edge Detection Bounding Box 12

HOW TO TRAIN A SEGMENTATION NETWORK Standard FCN (Shelhamer 2015) Architecture: Convolutions followed by a deconvolution to retrieve a pixel-dense prediction mask. 13

HOW TO TRAIN A DETECTION NETWORK Network outputs confidence that a pixel lies near the center of an object. Points of high confidence produce bounding box coordinates. Confidences are rougher than full segmentation but robust to occlusion. 14

TALK OVERVIEW (1) Problem statement and summary. (2) Dataset and preliminaries. (3) Model motivation. (4) Results and visualizations. 15

Input (1024 x 2048) Shared Feature Map (from base CNN) Low-Res Seg Predictions (W x H x 20) Obj. Confidence Positions Bbox Coordinate Positions Deconv L = αl seg + (1- α)l det 16

OUR BASELINE MODEL PERFORMANCE Seg. Weight Det. Weight = α (α controls how much attention we pay to segmentation vs detection at training) 17

OUR BASELINE MODEL PERFORMANCE Seg. Weight Det. Weight = α (α controls how much attention we pay to segmentation vs detection at training) 18

OUR BASELINE MODEL PERFORMANCE Seg. Weight Det. Weight = α (α controls how much attention we pay to segmentation vs detection at training) 19

OUR BASELINE MODEL PERFORMANCE Seg. Weight Det. Weight = α (α controls how much attention we pay to segmentation vs detection at training) 20

OUR BASELINE MODEL PERFORMANCE Seg. Weight Det. Weight = α (α controls how much attention we pay to segmentation vs detection at training) 21

OUR BASELINE MODEL PERFORMANCE Seg. Weight Det. Weight = α (α controls how much attention we pay to segmentation vs detection at training) 22

OUR BASELINE MODEL PERFORMANCE Seg. Weight Det. Weight = α (α controls how much attention we pay to segmentation vs detection at training) 23

OUR BASELINE MODEL PERFORMANCE Seg. Weight Det. Weight = α (α controls how much attention we pay to segmentation vs detection at training) 24

A LABEL HIERARCHY ALONG TWO AXES Required Receptive Field Object Bounding Boxes Density of Information 25

A LABEL HIERARCHY ALONG TWO AXES Required Receptive Field Object Bounding Boxes Object Confidence Density of Information 26

A LABEL HIERARCHY ALONG TWO AXES Required Receptive Field Object Bounding Boxes Object Confidence Semantic Segmentation Density of Information 27

A LABEL HIERARCHY ALONG TWO AXES Required Receptive Field Object Bounding Boxes Object Confidence Edge Detection (plus) Semantic Segmentation Density of Information 28

Input (1024 x 2048) Shared Feature Map (from base CNN) Low-Res Seg Predictions (W x H x 20) Obj. Confidence Positions Bbox Coordinate Positions Deconv 29

Input (1024 x 2048) Shared Feature Map (from base CNN) Segmentation Obj. Confidence Obj. BBox Low-Res Seg Predictions (W x H x 20) Obj. Confidence Positions Bbox Coordinate Positions Deconv 30

Input (1024 x 2048) Shared Feature Map (from base CNN) Segmentation Obj. Confidence Obj. BBox Low-Res Seg Predictions (W x H x 20) Obj. Confidence Positions Bbox Coordinate Positions Deconv Decreasing information density 31

Input (1024 x 2048) Shared Feature Map (from base CNN) Edge Segmentation Obj. Confidence Obj. BBox Low-Res Edge Predictions (W x H x 3) Low-Res Seg Predictions (W x H x 20) Obj. Confidence Positions Bbox Coordinate Positions Deconv Deconv Decreasing information density 32

Input (1024 x 2048) Shared Feature Map (from base CNN) Edge Segmentation Obj. Confidence Obj. BBox Low-Res Edge Predictions (W x H x 3) Low-Res Seg Predictions (W x H x 20) Obj. Confidence Positions Bbox Coordinate Positions Deconv Deconv Decreasing information density 33

Input (1024 x 2048) Shared Feature Map (from base CNN) Edge Segmentation Obj. Confidence X Obj. BBox Low-Res Edge Predictions (W x H x 3) Low-Res Seg Predictions (W x H x 20) Obj. Confidence Positions Bbox Coordinate Positions Deconv Deconv Decreasing information density 34

Input (1024 x 2048) Shared Feature Map (from base CNN) Edge Segmentation Obj. Confidence X Obj. BBox Low-Res Edge Predictions (W x H x 3) Low-Res Seg Predictions (W x H x 20) Obj. Confidence Positions Bbox Coordinate Positions Deconv Deconv Increasing receptive field 35

Input (1024 x 2048) Shared Feature Map (from base CNN) Edge Segmentation Obj. Confidence Obj. BBox Dilated Convs Low-Res Edge Predictions (W x H x 3) Low-Res Seg Predictions (W x H x 20) Obj. Confidence Positions Dilated Bbox Coordinate Positions Deconv Deconv Increasing receptive field 36

Input (1024 x 2048) Shared Feature Map (from base CNN) Edge Segmentation Obj. Confidence Obj. BBox Dilated Convs Low-Res Edge Predictions (W x H x 3) Low-Res Seg Predictions (W x H x 20) Obj. Confidence Positions Dilated Bbox Coordinate Positions Deconv Deconv Deep Hierarchical Network (DHM) 37

TALK OVERVIEW (1) Problem statement and summary. (2) Dataset and preliminaries. (3) Model motivation. (4) Results and visualizations. 38

RESULTS: HIGH ROBUSTNESS 39

RESULTS: HIGH ROBUSTNESS 40

41

RAW IMAGE Edge Predictions Bounding Box Predictions Segmentation Predictions 42

VISUALIZATIONS DETECTION SEGMENTAITION SINGLE NETWORK DHM (OURS) 43

VISUALIZATIONS SALIENCY (CAR) SEGMENTAITION SINGLE NETWORK DHM (OURS) 44

VISUALIZATIONS DETECTION SEGMENTAITION SINGLE NETWORK DHM (OURS) 45

VISUALIZATIONS DETECTION SEGMENTAITION SINGLE NETWORK DHM (OURS) 46

VISUALIZATIONS SALIENCY (BUS) SEGMENTAITION SINGLE NETWORK DHM (OURS) 47

VISUALIZATIONS DETECTION SEGMENTAITION SINGLE NETWORK DHM (OURS) 48

VISUALIZATIONS DETECTION SEGMENTAITION SINGLE NETWORK DHM (OURS) 49

VISUALIZATIONS DETECTION SEGMENTAITION SINGLE NETWORK DHM (OURS) 50

SUMMARY Our two hierarchies within our model allow our network to reason about intratask relationships: Information density: (Seg +) Edge > Seg > Object Conf > Bbox Receptive field: (Seg +) Edge = Bbox >> Object Conf > Seg With these relationships wired in, our network is: More accurate Robust to tuning Simultaneously better at fine detail and more instance aware Efficient and scalable (3 tasks, 1 network!) 51

REFERENCES J. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint object detection, scene classificationa and semantic segmentation. In CVPR, 2012. S. Gidaris and N. Komodakis. Object detection via a multiregion and semantic segmentation-aware cnn model. In ICCV, 2015. B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV, 2014. S. Liu, X. Qi, J. Shi, H. Zhang, and J. Jia. Multi-scale patch aggregation (mpa) for simultaneous detection and segmentation. In CVPR, 2016. E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and finegrained localization. In CVPR, 2015. J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In https://arxiv.org/pdf/1512.04412.pdf, 2015. 52

THANK YOU! Special thanks to: My internship mentor: Jian Yao My managers: John Zedlewski and Andrew Tao All the wonderful people in DLAR/DLAV. Additional questions/comments: zchen89@stanford.edu 53