Instance-aware Semantic Segmentation via Multi-task Network Cascades

Instance-aware Semantic Segmentation via Multi-task Network Cascades Jifeng Dai, Kaiming He, Jian Sun Microsoft research 2016 Yotam Gil Amit Nativ

Agenda Introduction Highlights Implementation Further improvements Experiments & Results Conclusions

Introduction Semantic segmentation each pixel has a category Labeling image pixels with semantic categories and instance indices is a challenging task

Introduction Classification Classification + Localization Object Detection Instance Segmentation

Introduction This is the output we re looking for classification of each object to class and instance index -

Introduction Existing methods require external mask proposals modules Slow at inference (~30sec / image) for MCG [CVPR 2014] proposals Take no advantage of deeply learned features

Highlights First pure CNN-based method for instance segmentation First place in MS COCO segmentation challenge in 2015 Fastest CNN-based method for instance segmentation

Dividing the task to sub-tasks Decomposition into three sub-tasks:

Dividing the task to sub-tasks The tasks are dependent This network structure is called Multi-task network cascade The training is done end-to-end elaborated next

Dividing the task to sub-tasks Cascade Model -

Task 1 Regressing box level instances Region Proposal Network (RPN) Based on Faster R-CNN Input Shared features Outputs highest score boxes to next stage, in the format of Bi = x i, y i, w i, h i, p i Loss function L 1 = L 1 B θ

Task 2 - Regressing mask-level instances Input Shared features and proposed boxes {B i } Output - {M i }, a list of masks each with size m 2, taking continuous values in [0,1] Perform logistic regression to the ground truth mask Shared features & Box proposals Task 2 m 2 Mask per proposed box

Task 2 Regressing mask-level instances Loss function L 2 = L 2 M θ B(θ) Region-of-Interest (RoI) pooling with differentiable RoI warping layer to enable end-to-end training

Task 3 Categorizing instances Input Shared features, boxes (stage 1) and masks (stage 2) Two pathways concatenated to predict object class Box-based pathway: directly use RoI pooled features Mask-based pathway: mask out background features - F mask i (θ) = F RoI i (θ) M i (θ)

Task 3 Categorizing instances Output C = {C i }, list of category prediction for all instances Loss function L 3 = L 3 C θ B θ, M(θ)

End-to-end training Loss function L = L 1 + L 2 + L 3 Unlike traditional multi-task learning loss terms are dependent

End-to-end training Challenges Apply the chain rule to the loss function Spatial transform of a predicted box that determines RoI pooling (unlike R-CNN, for example)

End-to-end training Solution Perform cropping and warping operations by bilinear interpolation

End-to-end training F i RoI θ = G B i θ F θ G Cropping and warping, maps W x H to W x H image Dimensions (n x n) F full image feature map n-dimensional vector F RoI - Output of RoI warping n -dimensional vector L 2 B i = L 2 F i RoI G B i F

Further improvements cascades with more stages Added 2 more stages to get 5-stage cascade Stages 2 and 3 are performed for the second time the box proposals derive from stage 3

Experiments on PASCAL VOC 2012

Experiments on PASCAL VOC 2012 Object detection evaluations as a by product

Experiments on PASCAL VOC 2012

Experiments MS COCO Using VGG-16 and ResNet Final result on the test-challenge set is 28.2%/51.5%

Experiments MS COCO

Conclusions Contributions Task decomposition Multi-task Network Cascades (MNCs) Solely based on CNNs, without external modules End-to-End Training Fast and accurate Investigate in the future Idea of exploiting network cascades in a multi-task learning framework maybe useful for other recognition tasks Combine other successful strategies

Multi-scale Patch Aggregation (MPA) for Simultaneous Detection and Segmentation Shu Liu, Xiaojuan Qi, Jianping Shi, Hong Zhang, Jiaya Jia (The Chinese University of Hong Kong, SenseTime Group Limited) Amit Nativ

So what are we talking about? Object recognition Object detection Semantic segmentation Instance aware semantic segmentation

Previous work [B. Hariharan 2014] Region Proposals Feature extraction: (R-CNN) Region Classification Region Refinement

Patch Aggregation Method The Basic Idea find different patches of the same object Find the mask in each one combine them in a smart way INSTANCE AWARE + DETECTION +SEGMENTATION

Patch Aggregation Method The Basic Idea

Patch Aggregation Method The Basic Idea Each patch belongs to a different object Instance aware segmentation and detection

Network structure Convolution layers Multi-scale path generator Class classification branch Segmentation branch

Convolution layers Convolution Layers generate the global feature map. 13 convolution layers interleaved with ReLU and polling. Similar to layers in VGG-16 net. Down sample is 16

Multi-Scale Patch Generator In the original image 4 different patch sizes: (48 48, 96 96, 192 192, 384 384) Sliding windows with patch 16

Multi-Scale Patch Generator Different patch scale different patch grid (48 48, 96 96, 192 192, 384 384) ( 3x3, 6x6, 12x12, 24x24) Cropped feature grids Global feature map

Multi-Scale Patch Generator Intuitively, we could now analyze each scale separately. mask label mask label mask label mask label Sub Net 3 Sub Net 6 Sub Net 12 Sub Net 24 Cropped feature grids Global feature map

Multi-Scale Patch Generator a better solution is to rescale all patches to the same size mask Scale Alignment (12x12) Low resolution layers up sample High resolution layer Sub down sample Net 12 label 12x12 deconv deconv copy Max poll Cropped feature grids

Training Sample Selection Standard criterion Intersection over Union (IoU) value

Training Sample Selection Condition 1: Patch center on an object

Training Sample Selection Condition 2: at least half the object is inside the patch

Training Sample Selection Condition 3: The object size is at least 20% of patch

Training Sample Selection Only if all three conditions are met: Condition 1: Patch center on an object Condition 2: at least half the object is inside the patch Condition 3: The object size is at least 20% of patch The patch is POSITIVE: CLASS ASSIGNED TO PATCH MASK TO SEGMENT

Distinguish individual instances Due to condition 1: Patch is only responsible for center object If objects overlap in patch only the label of the mask in center will be predicted

Multi-class Classification Branch Predicts semantic label to each patch 2x2 Max pooling to reduce complexity Three fully connected layers The output: predicted score of patch P i

Segmentation Branch Segments the object in the patch (one patch one object)

Training Loss and Strategy The loss of classification and segmentation branches: if patch belongs to class label L w = i [ log(f c l i (P i )) + λi(l i 0) N j log f s j P i classification segmentation

Patch Aggregation Method After network prediction Semantic label patch mask One patch one semantic label overlapped patches overlapped masks merging masks optimize segmentation.

Patch Aggregation Method How to merge masks: overlap score: IoU of neighboring masks Row search: Only Left side Column search: Only top side Iterate over all patches. Patch pair with highest IoU is selected Repeat until overlap score is less than τ

Results Tested on different image data sets: VOC 2012 segmentation val VOC 2012 SDS val Microsoft COCO VOC 2012 SDS val subset

Results On VOC 2012 Segmentation val 10,582 images in train 1499 images in val. also proposal free (In terms of map r with different IoU thresholds)

Results On VOC 2012 SBD val 5623 images in train 5732 images in val. VOC 2012 SDS val subset

Results On VOC 2012 SBD val 5623 images in train 5732 images in val.

Running-time Analysis Proposal based systems take much longer Single scale input takes ~2 sec. Three-scale input takes ~ 9 sec Region proposals No Region proposals

Error Analysis Mis-localization has a strong effect Localization Class confusion Background detection

Take-Home Message No region proposals Patches are used to detect interesting areas For a patch to be includes, it must sustain 3 rules Patches are selected and merged based on mask IoU

Results On MSCOCO test-std/tes-dev 120k images in trainval 20k images in test-std 20k images in test-dev

QUESTIONS??