Instance-aware Semantic Segmentation via Multi-task Network Cascades

Similar documents
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

arxiv: v1 [cs.cv] 14 Dec 2015

Deep Residual Learning

Spatial Localization and Detection. Lecture 8-1

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

R-FCN: Object Detection with Really - Friggin Convolutional Networks

Object Detection on Self-Driving Cars in China. Lingyun Li

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Lecture 7: Semantic Segmentation

Lecture 5: Object Detection

Deep Learning for Object detection & localization

Object detection with CNNs

Yiqi Yan. May 10, 2017

Object Detection Based on Deep Learning

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Learning to Segment Object Candidates

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Classification of objects from Video Data (Group 30)

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Final Report: Smart Trash Net: Waste Localization and Classification

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Classifying a specific image region using convolutional nets with an ROI mask as input

SEMANTIC SEGMENTATION AVIRAM BAR HAIM & IRIS TAL

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

arxiv: v1 [cs.cv] 15 Oct 2018

Automatic detection of books based on Faster R-CNN

Amodal and Panoptic Segmentation. Stephanie Liu, Andrew Zhou

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Cascade Region Regression for Robust Object Detection

Rich feature hierarchies for accurate object detection and semantic segmentation

OBJECT DETECTION HYUNG IL KOO

Rich feature hierarchies for accurate object detection and semant

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Flow-Based Video Recognition

Fully Convolutional Networks for Semantic Segmentation

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Crafting GBD-Net for Object Detection

RON: Reverse Connection with Objectness Prior Networks for Object Detection

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Multi-scale Patch Aggregation (MPA) for Simultaneous Detection and Segmentation

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Pixel Offset Regression (POR) for Single-shot Instance Segmentation

Content-Based Image Recovery

arxiv: v1 [cs.cv] 31 Mar 2016

Improving Face Recognition by Exploring Local Features with Visual Attention

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

arxiv: v2 [cs.cv] 18 Jul 2017

EE-559 Deep learning Networks for semantic segmentation

Kaggle Data Science Bowl 2017 Technical Report

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

Toward Scale-Invariance and Position-Sensitive Region Proposal Networks

Single-Shot Refinement Neural Network for Object Detection -Supplementary Material-

YOLO9000: Better, Faster, Stronger

Rich feature hierarchies for accurate object detection and semantic segmentation

arxiv: v1 [cs.cv] 14 Jun 2016

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Learning Deep Structured Models for Semantic Segmentation. Guosheng Lin

arxiv: v1 [cs.cv] 19 Mar 2018

Visual features detection based on deep neural network in autonomous driving tasks

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

SON OF ZORN S LEMMA: TARGETED STYLE TRANSFER USING INSTANCE-AWARE SEMANTIC SEGMENTATION

Structured Prediction using Convolutional Neural Networks

LEARNING TO INFER GRAPHICS PROGRAMS FROM HAND DRAWN IMAGES

Yield Estimation using faster R-CNN

arxiv: v1 [cs.cv] 29 Nov 2018

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

CAP 6412 Advanced Computer Vision

Martian lava field, NASA, Wikipedia

arxiv: v2 [cs.cv] 10 Apr 2017

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies

arxiv: v1 [cs.cv] 16 Nov 2018

Semantic Segmentation

Pseudo Mask Augmented Object Detection

Feature-Fused SSD: Fast Detection for Small Objects

Advanced Video Analysis & Imaging

Modern Convolutional Object Detectors

Pseudo Mask Augmented Object Detection

Todo before next class

Semantic Soft Segmentation Supplementary Material

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

arxiv: v2 [cs.cv] 30 Sep 2018

Gated Bi-directional CNN for Object Detection

Efficient Segmentation-Aided Text Detection For Intelligent Robots

TEXT SEGMENTATION ON PHOTOREALISTIC IMAGES

Multi-View 3D Object Detection Network for Autonomous Driving

Team G-RMI: Google Research & Machine Intelligence

Places Challenge 2017

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

Fast scene understanding and prediction for autonomous platforms. Bert De Brabandere, KU Leuven, October 2017

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Transcription:

Instance-aware Semantic Segmentation via Multi-task Network Cascades Jifeng Dai, Kaiming He, Jian Sun Microsoft research 2016 Yotam Gil Amit Nativ

Agenda Introduction Highlights Implementation Further improvements Experiments & Results Conclusions

Introduction Semantic segmentation each pixel has a category Labeling image pixels with semantic categories and instance indices is a challenging task

Introduction Classification Classification + Localization Object Detection Instance Segmentation

Introduction This is the output we re looking for classification of each object to class and instance index -

Introduction Existing methods require external mask proposals modules Slow at inference (~30sec / image) for MCG [CVPR 2014] proposals Take no advantage of deeply learned features

Highlights First pure CNN-based method for instance segmentation First place in MS COCO segmentation challenge in 2015 Fastest CNN-based method for instance segmentation

Dividing the task to sub-tasks Decomposition into three sub-tasks:

Dividing the task to sub-tasks The tasks are dependent This network structure is called Multi-task network cascade The training is done end-to-end elaborated next

Dividing the task to sub-tasks Cascade Model -

Task 1 Regressing box level instances Region Proposal Network (RPN) Based on Faster R-CNN Input Shared features Outputs highest score boxes to next stage, in the format of Bi = x i, y i, w i, h i, p i Loss function L 1 = L 1 B θ

Task 2 - Regressing mask-level instances Input Shared features and proposed boxes {B i } Output - {M i }, a list of masks each with size m 2, taking continuous values in [0,1] Perform logistic regression to the ground truth mask Shared features & Box proposals Task 2 m 2 Mask per proposed box

Task 2 Regressing mask-level instances Loss function L 2 = L 2 M θ B(θ) Region-of-Interest (RoI) pooling with differentiable RoI warping layer to enable end-to-end training

Task 3 Categorizing instances Input Shared features, boxes (stage 1) and masks (stage 2) Two pathways concatenated to predict object class Box-based pathway: directly use RoI pooled features Mask-based pathway: mask out background features - F mask i (θ) = F RoI i (θ) M i (θ)

Task 3 Categorizing instances Output C = {C i }, list of category prediction for all instances Loss function L 3 = L 3 C θ B θ, M(θ)

End-to-end training Loss function L = L 1 + L 2 + L 3 Unlike traditional multi-task learning loss terms are dependent

End-to-end training Challenges Apply the chain rule to the loss function Spatial transform of a predicted box that determines RoI pooling (unlike R-CNN, for example)

End-to-end training Solution Perform cropping and warping operations by bilinear interpolation

End-to-end training F i RoI θ = G B i θ F θ G Cropping and warping, maps W x H to W x H image Dimensions (n x n) F full image feature map n-dimensional vector F RoI - Output of RoI warping n -dimensional vector L 2 B i = L 2 F i RoI G B i F

Further improvements cascades with more stages Added 2 more stages to get 5-stage cascade Stages 2 and 3 are performed for the second time the box proposals derive from stage 3

Experiments on PASCAL VOC 2012

Experiments on PASCAL VOC 2012 Object detection evaluations as a by product

Experiments on PASCAL VOC 2012

Experiments MS COCO Using VGG-16 and ResNet Final result on the test-challenge set is 28.2%/51.5%

Experiments MS COCO

Conclusions Contributions Task decomposition Multi-task Network Cascades (MNCs) Solely based on CNNs, without external modules End-to-End Training Fast and accurate Investigate in the future Idea of exploiting network cascades in a multi-task learning framework maybe useful for other recognition tasks Combine other successful strategies

Multi-scale Patch Aggregation (MPA) for Simultaneous Detection and Segmentation Shu Liu, Xiaojuan Qi, Jianping Shi, Hong Zhang, Jiaya Jia (The Chinese University of Hong Kong, SenseTime Group Limited) Amit Nativ

So what are we talking about? Object recognition Object detection Semantic segmentation Instance aware semantic segmentation

Previous work [B. Hariharan 2014] Region Proposals Feature extraction: (R-CNN) Region Classification Region Refinement

Patch Aggregation Method The Basic Idea find different patches of the same object Find the mask in each one combine them in a smart way INSTANCE AWARE + DETECTION +SEGMENTATION

Patch Aggregation Method The Basic Idea

Patch Aggregation Method The Basic Idea

Patch Aggregation Method The Basic Idea Each patch belongs to a different object Instance aware segmentation and detection

Network structure Convolution layers Multi-scale path generator Class classification branch Segmentation branch

Convolution layers Convolution Layers generate the global feature map. 13 convolution layers interleaved with ReLU and polling. Similar to layers in VGG-16 net. Down sample is 16

Multi-Scale Patch Generator In the original image 4 different patch sizes: (48 48, 96 96, 192 192, 384 384) Sliding windows with patch 16

Multi-Scale Patch Generator Different patch scale different patch grid (48 48, 96 96, 192 192, 384 384) ( 3x3, 6x6, 12x12, 24x24) Cropped feature grids Global feature map

Multi-Scale Patch Generator Intuitively, we could now analyze each scale separately. mask label mask label mask label mask label Sub Net 3 Sub Net 6 Sub Net 12 Sub Net 24 Cropped feature grids Global feature map

Multi-Scale Patch Generator a better solution is to rescale all patches to the same size mask Scale Alignment (12x12) Low resolution layers up sample High resolution layer Sub down sample Net 12 label 12x12 deconv deconv copy Max poll Cropped feature grids

Training Sample Selection Standard criterion Intersection over Union (IoU) value

Training Sample Selection Condition 1: Patch center on an object

Training Sample Selection Condition 2: at least half the object is inside the patch

Training Sample Selection Condition 3: The object size is at least 20% of patch

Training Sample Selection Only if all three conditions are met: Condition 1: Patch center on an object Condition 2: at least half the object is inside the patch Condition 3: The object size is at least 20% of patch The patch is POSITIVE: CLASS ASSIGNED TO PATCH MASK TO SEGMENT

Distinguish individual instances Due to condition 1: Patch is only responsible for center object If objects overlap in patch only the label of the mask in center will be predicted

Multi-class Classification Branch Predicts semantic label to each patch 2x2 Max pooling to reduce complexity Three fully connected layers The output: predicted score of patch P i

Segmentation Branch Segments the object in the patch (one patch one object)

Training Loss and Strategy The loss of classification and segmentation branches: if patch belongs to class label L w = i [ log(f c l i (P i )) + λi(l i 0) N j log f s j P i classification segmentation

Patch Aggregation Method After network prediction Semantic label patch mask One patch one semantic label overlapped patches overlapped masks merging masks optimize segmentation.

Patch Aggregation Method How to merge masks: overlap score: IoU of neighboring masks Row search: Only Left side Column search: Only top side Iterate over all patches. Patch pair with highest IoU is selected Repeat until overlap score is less than τ

Results Tested on different image data sets: VOC 2012 segmentation val VOC 2012 SDS val Microsoft COCO VOC 2012 SDS val subset

Results On VOC 2012 Segmentation val 10,582 images in train 1499 images in val. also proposal free (In terms of map r with different IoU thresholds)

Results On VOC 2012 SBD val 5623 images in train 5732 images in val. VOC 2012 SDS val subset

Results On VOC 2012 SBD val 5623 images in train 5732 images in val.

Running-time Analysis Proposal based systems take much longer Single scale input takes ~2 sec. Three-scale input takes ~ 9 sec Region proposals No Region proposals

Error Analysis Mis-localization has a strong effect Localization Class confusion Background detection

Take-Home Message No region proposals Patches are used to detect interesting areas For a patch to be includes, it must sustain 3 rules Patches are selected and merged based on mask IoU

Results On MSCOCO test-std/tes-dev 120k images in trainval 20k images in test-std 20k images in test-dev

QUESTIONS??