Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Similar documents
Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

arxiv: v3 [cs.cv] 24 Jan 2018

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Todo before next class

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Amodal and Panoptic Segmentation. Stephanie Liu, Andrew Zhou

Mask R-CNN. Kaiming He Georgia Gkioxari Piotr Dollár Ross Girshick Facebook AI Research (FAIR)

arxiv: v1 [cs.cv] 15 Oct 2018

Lecture 5: Object Detection

Yiqi Yan. May 10, 2017

R-FCN: Object Detection with Really - Friggin Convolutional Networks

arxiv: v1 [cs.cv] 19 Mar 2018

Deep Learning for Object detection & localization

Final Report: Smart Trash Net: Waste Localization and Classification

Spatial Localization and Detection. Lecture 8-1

Zero-shot learning with Fast RCNN and Mask RCNN through semantic attribute Mapping

Object Detection on Self-Driving Cars in China. Lingyun Li

Lecture 7: Semantic Segmentation

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

Martian lava field, NASA, Wikipedia

An Analysis of Scale Invariance in Object Detection SNIP

POINT CLOUD DEEP LEARNING

arxiv: v1 [cs.cv] 19 Feb 2019

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Skin Lesion Attribute Detection for ISIC Using Mask-RCNN

Team G-RMI: Google Research & Machine Intelligence

An Analysis of Scale Invariance in Object Detection SNIP

Fully Convolutional Networks for Semantic Segmentation

Object Detection Based on Deep Learning

Toward Scale-Invariance and Position-Sensitive Region Proposal Networks

Multi-View 3D Object Detection Network for Autonomous Driving

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Rich feature hierarchies for accurate object detection and semantic segmentation

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Classification of objects from Video Data (Group 30)

LEARNING TO INFER GRAPHICS PROGRAMS FROM HAND DRAWN IMAGES

arxiv: v2 [cs.cv] 18 Jul 2017

Computer Vision Lecture 16

Paper Motivation. Fixed geometric structures of CNN models. CNNs are inherently limited to model geometric transformations

Computer Vision Lecture 16

arxiv: v1 [cs.cv] 29 Nov 2018

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Deep Residual Learning

Automatic detection of books based on Faster R-CNN

Classifying a specific image region using convolutional nets with an ROI mask as input

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Object detection with CNNs

arxiv: v1 [cs.cv] 10 Jan 2019

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

arxiv: v2 [cs.cv] 23 Jan 2019

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Modern Convolutional Object Detectors

arxiv: v2 [cs.cv] 28 Nov 2018

arxiv: v1 [cs.cv] 30 Jul 2018

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

arxiv: v1 [cs.cv] 26 Jul 2018

EE-559 Deep learning Networks for semantic segmentation

Finding Tiny Faces Supplementary Materials

arxiv: v1 [cs.cv] 23 Apr 2018

arxiv: v2 [cs.cv] 3 Sep 2018

Detection and Segmentation of Manufacturing Defects with Convolutional Neural Networks and Transfer Learning

Advanced Video Analysis & Imaging

Learning Deep Features for Visual Recognition

OBJECT DETECTION HYUNG IL KOO

Optimizing Object Detection:

Visual features detection based on deep neural network in autonomous driving tasks

Learning Globally Optimized Object Detector via Policy Gradient

An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches

Structured Prediction using Convolutional Neural Networks

An Exploration of Computer Vision Techniques for Bird Species Classification

Regionlet Object Detector with Hand-crafted and CNN Feature

SNIPER: Efficient Multi-Scale Training

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

arxiv: v1 [cs.cv] 23 Jan 2019

ECE 5470 Classification, Machine Learning, and Neural Network Review

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Joint Object Detection and Viewpoint Estimation using CNN features

Learning to Segment Object Candidates

arxiv: v2 [cs.cv] 19 Apr 2018

Fast Edge Detection Using Structured Forests

arxiv: v2 [cs.cv] 8 Apr 2018

Accelerating Convolutional Neural Nets. Yunming Zhang

Cascade Region Regression for Robust Object Detection

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Object Detection in Sports Videos

arxiv: v1 [cs.ro] 24 Jul 2018

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

A Deep Learning Approach to Vehicle Speed Estimation

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Human Pose Estimation with Deep Learning. Wei Yang

Transcription:

Mask R-CNN By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Types of Computer Vision Tasks http://cs231n.stanford.edu/

Semantic vs Instance Segmentation Image Source: https://arxiv.org/pdf/1405.0312.pdf

Overview of Mask R-CNN Goal: to create a framework for Instance segmentation Builds on top of Faster R-CNN by adding a parallel branch For each Region of Interest (RoI) predicts segmentation mask using a small FCN Changes RoI pooling in Faster R-CNN to a quantization-free layer called RoI Align Generate a binary mask for each class independently: decouples segmentation and classification Easy to generalize to other tasks: Human pose detection Result: performs better than state-of-art models in instance segmentation, bounding box detection and person keypoint detection

Some Results

Background - Faster R-CNN Image Source: https://arxiv.org/pdf/1506.01497.pdf Image Source: https://www.youtube.com/watch?v=ul25zsysk2a&index=1&list= PLkRkKTC6HZMxZrxnHUDYSLiPZxiUUFD2C

Background - FCN Image Source: https://arxiv.org/pdf/1411.4038.pdf

Related Work Image Source: https://www.youtube.com/watch?v=g7z4mkfrji4

Mask R-CNN Basic Architecture Procedure: RPN RoI Align Parallel prediction for the class, box and binary mask for each RoI Segmentation is different from most prior systems where classification depends on mask prediction Loss function for each sampled RoI Image Source: https://www.youtube.com/watch?v=g7z4mkfrji4

Mask R-CNN Framework

RoI Align Motivation Image Source: https://www.youtube.com/watch?v=ul25zsysk2a&inde x=1&list=plkrkktc6hzmxzrxnhudyslipzxiuuf D2C

RoI Align Removes this quantization which is causes this misalignment For each bin, you regularly sample 4 locations and do bilinear interpolation Result are not sensitive to exact sampling location or the number of samples Compare results with RoI wrapping: Which basically does bilinear interpolation on feature map only

RoI Align Image Source: https://www.youtube.com/watch?v=g7z4mkfrji4

RoI Align Results (a) RoIAlign (ResNet-50-C4) comparison (b) RoIAlign (ResNet-50-C5, stride 32) comparison

FCN Mask Head

Loss Function Loss for classification and box regression is same as Faster R-CNN To each map a per-pixel sigmoid is applied The map loss is then defined as average binary cross entropy loss Mask loss is only defined for the ground truth class Decouples class prediction and mask generation Empirically better results and model becomes easier to train

Loss Function - Results (a) Multinomial vs. Independent Masks

Mask R-CNN at Test Time https://www.youtube.com/watch?v=g7z4mkfrji4

Network Architecture Can be divided into two-parts: Backbone architecture : Used for feature extraction Network Head: comprises of object detection and segmentation parts Backbone architecture: ResNet ResNeXt: Depth 50 and 101 layers Feature Pyramid Network (FPN) Network Head: Use almost the same architecture as Faster R-CNN but add convolution mask prediction branch

Implementation Details Same hyper-parameters as Faster R-CNN Training: RoI positive if IoU is atleast 0.5; Mask loss is defined only on positive RoIs Each mini-batch has 2 images per GPU and each image has N sampled RoI N is 64 for C4 backbone and 512 for FPN Train on 8 GPUs for 160k iterations Learning rate of 0.02 which is decreased by 10 at 120k iterataions Inference: Proposal number 300 for C4 backbone and 1000 for FPN Mask branch is applied to the highest scoring 100 detection boxes; so not done parallel at test time, this speeds up inference and accuracy We also only use the kth-mask where k is the predicted class by the classification branch The m x m mask is resized to the RoI Size

Main Results

Main Results

Results: FCN vs MLP

Main Results Object Detection

Mask R-CNN for Human Pose Estimation

Mask R-CNN for Human Pose Estimation Model keypoint location as a one-hot binary mask Generate a mask for each keypoint types For each keypoint, during training, the target is a m x m binary map where only a single pixel is labelled as foreground For each visible ground-truth keypoint, we minimize the cross-entropy loss over a m 2 -way softmax output

Results for Pose Estimation (b) Multi-task learning (a) Keypoint detection AP on COCO test-dev (c) RoIAlign vs. RoIPool

Experiments on Cityscapes

Experiments on Cityscapes

Latest Results Instance Segmentation

Latest Result Pose Estimation

Future work Interesting direction would be to replace rectangular RoI Extend this to segment multiple background (sky, ground) Any other ideas?

Conclusion A framework to do state-of-art instance segmentation Generates high-quality segmentation mask Model does Object Detection, Instance Segmentation and can also be extended to human pose estimation!!!!!! All of them are done in parallel Simple to train and adds a small overhead to Faster R-CNN

Resources Official code: https://github.com/facebookresearch/detectron TensorFlow unofficial code: https://github.com/matterport/mask_rcnn ICCV17 video: https://www.youtube.com/watch?v=g7z4mkfrji4 Tutorial Videos: https://www.youtube.com/watch?v=ul25zsysk2a&list=plkrkktc6hzmxzr xnhudyslipzxiuufd2c

References https://arxiv.org/pdf/1703.06870.pdf https://arxiv.org/pdf/1405.0312.pdf https://arxiv.org/pdf/1411.4038.pdf https://arxiv.org/pdf/1506.01497.pdf http://cs231n.stanford.edu/ https://www.youtube.com/watch?v=oot3uixzzte https://www.youtube.com/watch?v=ul25zsysk2a&index=1&list=plkrkktc 6HZMxZrxnHUDYSLiPZxiUUFD2C

Thank You Any Questions?