Instance-aware Semantic Segmentation via Multi-task Network Cascades Jifeng Dai, Kaiming He, Jian Sun Microsoft research 2016 Yotam Gil Amit Nativ
Agenda Introduction Highlights Implementation Further improvements Experiments & Results Conclusions
Introduction Semantic segmentation each pixel has a category Labeling image pixels with semantic categories and instance indices is a challenging task
Introduction Classification Classification + Localization Object Detection Instance Segmentation
Introduction This is the output we re looking for classification of each object to class and instance index -
Introduction Existing methods require external mask proposals modules Slow at inference (~30sec / image) for MCG [CVPR 2014] proposals Take no advantage of deeply learned features
Highlights First pure CNN-based method for instance segmentation First place in MS COCO segmentation challenge in 2015 Fastest CNN-based method for instance segmentation
Dividing the task to sub-tasks Decomposition into three sub-tasks:
Dividing the task to sub-tasks The tasks are dependent This network structure is called Multi-task network cascade The training is done end-to-end elaborated next
Dividing the task to sub-tasks Cascade Model -
Task 1 Regressing box level instances Region Proposal Network (RPN) Based on Faster R-CNN Input Shared features Outputs highest score boxes to next stage, in the format of Bi = x i, y i, w i, h i, p i Loss function L 1 = L 1 B θ
Task 2 - Regressing mask-level instances Input Shared features and proposed boxes {B i } Output - {M i }, a list of masks each with size m 2, taking continuous values in [0,1] Perform logistic regression to the ground truth mask Shared features & Box proposals Task 2 m 2 Mask per proposed box
Task 2 Regressing mask-level instances Loss function L 2 = L 2 M θ B(θ) Region-of-Interest (RoI) pooling with differentiable RoI warping layer to enable end-to-end training
Task 3 Categorizing instances Input Shared features, boxes (stage 1) and masks (stage 2) Two pathways concatenated to predict object class Box-based pathway: directly use RoI pooled features Mask-based pathway: mask out background features - F mask i (θ) = F RoI i (θ) M i (θ)
Task 3 Categorizing instances Output C = {C i }, list of category prediction for all instances Loss function L 3 = L 3 C θ B θ, M(θ)
End-to-end training Loss function L = L 1 + L 2 + L 3 Unlike traditional multi-task learning loss terms are dependent
End-to-end training Challenges Apply the chain rule to the loss function Spatial transform of a predicted box that determines RoI pooling (unlike R-CNN, for example)
End-to-end training Solution Perform cropping and warping operations by bilinear interpolation
End-to-end training F i RoI θ = G B i θ F θ G Cropping and warping, maps W x H to W x H image Dimensions (n x n) F full image feature map n-dimensional vector F RoI - Output of RoI warping n -dimensional vector L 2 B i = L 2 F i RoI G B i F
Further improvements cascades with more stages Added 2 more stages to get 5-stage cascade Stages 2 and 3 are performed for the second time the box proposals derive from stage 3
Experiments on PASCAL VOC 2012
Experiments on PASCAL VOC 2012 Object detection evaluations as a by product
Experiments on PASCAL VOC 2012
Experiments MS COCO Using VGG-16 and ResNet Final result on the test-challenge set is 28.2%/51.5%
Experiments MS COCO
Conclusions Contributions Task decomposition Multi-task Network Cascades (MNCs) Solely based on CNNs, without external modules End-to-End Training Fast and accurate Investigate in the future Idea of exploiting network cascades in a multi-task learning framework maybe useful for other recognition tasks Combine other successful strategies
Multi-scale Patch Aggregation (MPA) for Simultaneous Detection and Segmentation Shu Liu, Xiaojuan Qi, Jianping Shi, Hong Zhang, Jiaya Jia (The Chinese University of Hong Kong, SenseTime Group Limited) Amit Nativ
So what are we talking about? Object recognition Object detection Semantic segmentation Instance aware semantic segmentation
Previous work [B. Hariharan 2014] Region Proposals Feature extraction: (R-CNN) Region Classification Region Refinement
Patch Aggregation Method The Basic Idea find different patches of the same object Find the mask in each one combine them in a smart way INSTANCE AWARE + DETECTION +SEGMENTATION
Patch Aggregation Method The Basic Idea
Patch Aggregation Method The Basic Idea
Patch Aggregation Method The Basic Idea Each patch belongs to a different object Instance aware segmentation and detection
Network structure Convolution layers Multi-scale path generator Class classification branch Segmentation branch
Convolution layers Convolution Layers generate the global feature map. 13 convolution layers interleaved with ReLU and polling. Similar to layers in VGG-16 net. Down sample is 16
Multi-Scale Patch Generator In the original image 4 different patch sizes: (48 48, 96 96, 192 192, 384 384) Sliding windows with patch 16
Multi-Scale Patch Generator Different patch scale different patch grid (48 48, 96 96, 192 192, 384 384) ( 3x3, 6x6, 12x12, 24x24) Cropped feature grids Global feature map
Multi-Scale Patch Generator Intuitively, we could now analyze each scale separately. mask label mask label mask label mask label Sub Net 3 Sub Net 6 Sub Net 12 Sub Net 24 Cropped feature grids Global feature map
Multi-Scale Patch Generator a better solution is to rescale all patches to the same size mask Scale Alignment (12x12) Low resolution layers up sample High resolution layer Sub down sample Net 12 label 12x12 deconv deconv copy Max poll Cropped feature grids
Training Sample Selection Standard criterion Intersection over Union (IoU) value
Training Sample Selection Condition 1: Patch center on an object
Training Sample Selection Condition 2: at least half the object is inside the patch
Training Sample Selection Condition 3: The object size is at least 20% of patch
Training Sample Selection Only if all three conditions are met: Condition 1: Patch center on an object Condition 2: at least half the object is inside the patch Condition 3: The object size is at least 20% of patch The patch is POSITIVE: CLASS ASSIGNED TO PATCH MASK TO SEGMENT
Distinguish individual instances Due to condition 1: Patch is only responsible for center object If objects overlap in patch only the label of the mask in center will be predicted
Multi-class Classification Branch Predicts semantic label to each patch 2x2 Max pooling to reduce complexity Three fully connected layers The output: predicted score of patch P i
Segmentation Branch Segments the object in the patch (one patch one object)
Training Loss and Strategy The loss of classification and segmentation branches: if patch belongs to class label L w = i [ log(f c l i (P i )) + λi(l i 0) N j log f s j P i classification segmentation
Patch Aggregation Method After network prediction Semantic label patch mask One patch one semantic label overlapped patches overlapped masks merging masks optimize segmentation.
Patch Aggregation Method How to merge masks: overlap score: IoU of neighboring masks Row search: Only Left side Column search: Only top side Iterate over all patches. Patch pair with highest IoU is selected Repeat until overlap score is less than τ
Results Tested on different image data sets: VOC 2012 segmentation val VOC 2012 SDS val Microsoft COCO VOC 2012 SDS val subset
Results On VOC 2012 Segmentation val 10,582 images in train 1499 images in val. also proposal free (In terms of map r with different IoU thresholds)
Results On VOC 2012 SBD val 5623 images in train 5732 images in val. VOC 2012 SDS val subset
Results On VOC 2012 SBD val 5623 images in train 5732 images in val.
Running-time Analysis Proposal based systems take much longer Single scale input takes ~2 sec. Three-scale input takes ~ 9 sec Region proposals No Region proposals
Error Analysis Mis-localization has a strong effect Localization Class confusion Background detection
Take-Home Message No region proposals Patches are used to detect interesting areas For a patch to be includes, it must sustain 3 rules Patches are selected and merged based on mask IoU
Results On MSCOCO test-std/tes-dev 120k images in trainval 20k images in test-std 20k images in test-dev
QUESTIONS??