arxiv: v1 [cs.cv] 31 Mar 2016

Similar documents
arxiv: v4 [cs.cv] 6 Jul 2016

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

Fully Convolutional Networks for Semantic Segmentation

Conditional Random Fields as Recurrent Neural Networks

Efficient Segmentation-Aided Text Detection For Intelligent Robots

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Semantic Segmentation

Lecture 7: Semantic Segmentation

arxiv: v2 [cs.cv] 18 Jul 2017

arxiv: v1 [cs.cv] 1 Feb 2018

A MULTI-RESOLUTION FUSION MODEL INCORPORATING COLOR AND ELEVATION FOR SEMANTIC SEGMENTATION

Presentation Outline. Semantic Segmentation. Overview. Presentation Outline CNN. Learning Deconvolution Network for Semantic Segmentation 6/6/16

Deconvolutions in Convolutional Neural Networks

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

arxiv: v1 [cs.cv] 13 Mar 2016

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

Channel Locality Block: A Variant of Squeeze-and-Excitation

Finding Tiny Faces Supplementary Materials

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Gradient of the lower bound

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

arxiv: v1 [cs.cv] 15 Oct 2018

TEXT SEGMENTATION ON PHOTOREALISTIC IMAGES

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Deep Interactive Object Selection

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation

arxiv: v1 [cs.cv] 8 Mar 2017 Abstract

arxiv: v1 [cs.cv] 14 Dec 2015

Rich feature hierarchies for accurate object detection and semantic segmentation

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

Lecture 5: Object Detection

DifNet: Semantic Segmentation by Diffusion Networks

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Dense Image Labeling Using Deep Convolutional Neural Networks

Object detection with CNNs

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

Xiaowei Hu* Lei Zhu* Chi-Wing Fu Jing Qin Pheng-Ann Heng

EE-559 Deep learning Networks for semantic segmentation

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Feature-Fused SSD: Fast Detection for Small Objects

Cascade Region Regression for Robust Object Detection

YOLO9000: Better, Faster, Stronger

arxiv: v1 [cs.cv] 24 May 2016

A Bi-directional Message Passing Model for Salient Object Detection

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Object Detection Based on Deep Learning

Automatic detection of books based on Faster R-CNN

arxiv: v1 [cs.cv] 5 Apr 2017

Deep Dual Learning for Semantic Image Segmentation

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Instance-aware Semantic Segmentation via Multi-task Network Cascades

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Learning to Segment Instances in Videos with Spatial Propagation Network

RSRN: Rich Side-output Residual Network for Medial Axis Detection

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

arxiv: v2 [cs.cv] 23 Nov 2016 Abstract

Boundary-aware Fully Convolutional Network for Brain Tumor Segmentation

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Martian lava field, NASA, Wikipedia

PARTIAL STYLE TRANSFER USING WEAKLY SUPERVISED SEMANTIC SEGMENTATION. Shin Matsuo Wataru Shimoda Keiji Yanai

Semantic Soft Segmentation Supplementary Material

arxiv: v1 [cs.cv] 22 Nov 2017

Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation

Iterative Multi-domain Regularized Deep Learning for Anatomical Structure Detection and Segmentation from Ultrasound Images

Semi Supervised Semantic Segmentation Using Generative Adversarial Network

Joint Calibration for Semantic Segmentation

arxiv: v4 [cs.cv] 12 Aug 2015

arxiv: v3 [cs.cv] 8 May 2017

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

Person Part Segmentation based on Weak Supervision

arxiv: v1 [cs.cv] 24 Nov 2016

Final Report: Smart Trash Net: Waste Localization and Classification

Webly Supervised Semantic Segmentation

Yiqi Yan. May 10, 2017

arxiv: v2 [cs.cv] 10 Apr 2017

Real-time Object Detection CS 229 Course Project

Learning to Segment Human by Watching YouTube

arxiv: v1 [cs.cv] 29 Sep 2016

Spatial Localization and Detection. Lecture 8-1

Multi-Glance Attention Models For Image Classification

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Know your data - many types of networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

arxiv: v2 [cs.cv] 8 Apr 2018

Dataset Augmentation with Synthetic Images Improves Semantic Segmentation

arxiv: v2 [cs.cv] 29 Nov 2016 Abstract

Computer Vision Lecture 16

Flow-Based Video Recognition

Transcription:

Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu and C.-C. Jay Kuo arxiv:1603.09742v1 [cs.cv] 31 Mar 2016 University of Southern California Abstract. Semantic segmentation has been a major topic in computer vision, and has played an important role in understanding object classes as well as object localizations. Recent development in deep learning, especially in fully-convolutional neural network, has enabled pixel-level labeling for more accurate results. However most of the previous works, including FCN, did not take object boundary into consideration. In fact, since the originally labeled ground truth does not provide with a clean object boundary, the labeled contours and background objects have been both ignored as background class. In this work, we propose an elegant object boundary guided FCN (OBG-FCN) network, which uses the prior knowledge of object boundary from training to achieve better class accuracy and segmentation details. To this end, we first relabel the object contours, and use the FCN network to specially learn to find the whereabouts of the object and contours. Then we transform the output of this branch to 21 classes and use it as a mask to refine the detail shapes of the objects. An end-to-end learning is then applied to finetune the transforming parameters which reconsider the combination of object-background-boundary in the final segmentation decision. We apply the proposed method in PASCAL VOC segmentation benchmark, and have achieved 87.4% mean IU (15% relative improvements compared to FCN and around 10% improvement compared to CRF- RNN), and our edge model has shown to be stable and accurate even at accuracy level of FCN-2s. 1 Introduction Recently, semantic segmentation has played an important role in understanding object classes as well as object localizations[1]. The introduction of fully convolution networks[2] has brought a large improvement in image semantic segmentation. In fact, there are also a number of recent approaches including DeepLab[3], CRF-RNN[4], which achieved good segmentation performance. However, while they use feature representations to make pixel-wise classification, it is of great importance to take another factor into consideration, i.e. the object boundary, to make the labelling more accurate and natural[5].

2 Authors Suppressed Due to Excessive Length Input Image FCN-8s CRF-RNN OBG-FCN Ground Truth Fig. 1. Examples of segmentation results with first column as input images and last column as segmentation ground truth. The 2nd to 4th columns are the results from FCN-8s, CRF-RNN and our proposed OBG-FCN.

Object Boundary Guided Semantic Segmentation 3 It is a significant challenge to adapt Convnets on pixel-wise classification task. Firstly, the convolution filters and the max-pooling manipulation of traditional CNNs make the object boundary prediction quiet coarse. Furthermore, although recent algorithms, such as FCN[2], have made use of the intermediate convolutional layers to finetune the pixel-wise prediction, they couldn t make a good prediction on the object boundaries[6]. CRF-RNN[4] formulates mean-field approximate inference that partially improves the prediction on object boundaries, but there still exists a number of problems, for example, mixing the nearby objects together and misclassifying the objects on the boundaries. In this case, lack of boundary constraints couldn t give a good prediction on image boundary in most cases. To deal with this problem, we introduce an object boundary guided FCN network (OBG-FCN), which uses the pre-trained prior knowledge of object boundaries to enhance the performance semantic segmentation. In this work, we first relabel the object contours based on the PASCAL annotation. Although the ground truth provided by PASCAL annotation already offers contour information, it also includes some background objects as boundaries, which will mislead our training. Therefore, we relabel the object boundary by shifting the positions of the object regions and derive an accurate ground truth with object proposals and boundaries. Then, we follow the FCN network structure and conduct step-by-step learning from FCN-32s to FCN-2s to train our 3-class OB-FCN segmenter (Object-Boundary-Background). Then we use the output of the OB-FCN branch as a mask layer, which is transformed from 3-class output to 21-class, and conduct an element-wise multiplication with the original FCN-8s network. An end-to-end training is then followed on the object boundary guided FCN (OBG-FCN) to finetune the network. The results have shown a great improvement over previous state-of-art in improving the mean IU of PASCAL VOC benchmark and result in more accurate class accuracy and object details. The following sections explain our implementation details and introduce our architecture which combines the information of two distinct network branches together to make pixel-wise predictions. In the experiment section, we demonstrate the state-of-the-art results on PASCAL VOC 2011-2012. 2 Object Boundary FCN (OB-FCN) with Re-labeled Boundaries One of the most major idea is to utilize the boundary information as a guideline to the training stage of semantic segmentation. Researchers in previous works, such as [7,8,2,4], all treated the boundary to be background. Therefore, their segmentation results show little relation to the boundary on the ground truth. Subjective comparison of their segmentation results also demonstrate that there are lot of cross regions if adding the boundaries onto them, which is a good indication that boundary can be a crucial part for the semantic segmentation. The first stage of our research is to achieve the boundary prediction as precise as possible. Preprocessing the ground truth is one of the key parts in our work.

4 Authors Suppressed Due to Excessive Length By dividing the ground truth into three classes, objects, boundaries, and background, we recreated our own proposed ground truth, and followed the network structure of FCN to learn corresponding features and finetune the network. As a matter of fact, the ground truth of class labels of semantic segmentation has a labeling of object boundary, however it is sometimes confused with the background objects. Therefore, we relabel the object boundaries by moving the objects horizontally and vertically so that we can extent the object area, where we later set the center object region as it is. In this way, we can get a clear edge between objects and backgrounds and within different objects. Sample Iimage Original Ground Truth Relabeled Object Boundary Fig. 2. Examples of re-labeled object boundary for an image in PASCAL VOC An example is shown in Fig. 2, where the original images have some background objects included as the same class of object boundaries. In contrary, our relabeled ground truth keeps the exact information of objects and accurate boundary information. Since we are working at FCN-4s in current stage of OB-FCN network, we set the maximum boundary width as 4 which is the accuracy interval pixel-wise. Currently we are working on combining the result with OB-FCN-2s, and we expect to have even better results from it. Fig. 3. Flow chart of OB-FCN network structure. Previously in the work of FCN, it has been observed that the accuracy level can only reach up to combining pool 3, while further combining with pool2 or

Object Boundary Guided Semantic Segmentation 5 pool1 will confuse the segmenter. However, by making the object boundary FCN (OB-FCN) branch to learn only 3 classes (object, boundary, backgorund), we are able to achieve the detail level of FCN-4s and FCN-2s without confusing the network with small scale information. The flow-chart of the OB-FCN network is shown in Fig.3, where our final model is consisted with all pooling information. Input Image OB-FCN- 32s OB-FCN- 16s OB-FCN-8s OB-FCN-4s Labeled Boundary Fig. 4. Examples of segmentation results with first column as input images and last column as segmentation ground truth. The 2nd to 4th columns are the results from FCN-8s, CRF-RNN and our proposed OBG-FCN. A step-by-step boundary learning result is shown in Fig. 4, where the revolution of each step shows finer details of object and its boundaries. Serving as important prior knowledge of object proposals, our work shows a much more precise Semantic Segmentation can be achieved even with the help of a FCN-4s OB-FCN branch. We will further evaluate the object matching area with the ground truth in future experiments. 3 Object Boundary Guided FCN (OBG-FCN) for semantic segmentation Now that we acquire a precise model with the object information, it is important to combine them with the class information derived with the original FCN-8s. As mentioned in [9], a masking method is adopted by applying the output of one branch to the other branch. We followed the method and tried to combine the object information and class information by using the output of OB-FCN as a mask. We first followed the methods by using shared layers from Conv-5 and even go down to Conv-3, however the results are not that satisfying. This is most because that by looking for boundary information, the shallow layers of FCN and OB-FCN are most likely to be different with each other. Therefore, we decided to use the two pre-trained branch completely separated. The system network flow is shown in Fig. 6, where we introduce the data to two different branches, and design a masking layer to combine the output.

6 Authors Suppressed Due to Excessive Length Fig. 5. End-to-end two-branch network of OBG-FCN. Here, we use element wise product to exert the masking. However since this operation requires two bottom layers with exactly same dimension. We need to transform the 3-class output of OB-FCN to 21 classes. Therefore, we apply a convolution layer between the element production and the output of OB-FCN, which takes an input of 3 and output a 21 class masking map. Fig. 6. Demonstration of the convolutional transform layer to map 3-class output of OB-FCN to 21 classes. One crucial issue here is to initialize the transform layer. Since we would want most of the object area to be highlighted and combined with original FCN, and would like the background and detected boundaries not confusing with the object, we do not randomly initialize the network, but setting the parameters as 1, if k = C(background), m C(object), ω(k, m, 1, 1) = 1, if k = C(object), m = C(object), 0, otherwise, where ω is the parameters of the transforming convolution layer,k is the first parameter corresponding to the output depth and m is the corresponding input channel of OB-FCN s result, while C representing the class ID of background (0) and objects (1-20). As a result, we apply the pre-trained object area directly onto the original FCN-8s, and derive a primitive masking layer as shown in the first two columns of Fig. 7. The corresponding combined layer output is shown in the third column with the segmentation result in last column. As shown in the result, our (1)

Object Boundary Guided Semantic Segmentation 7 3-Class Output Initializa- 21-Class tion Combined Initialization Initialized Result Fig. 7. Results of initialization of convolutional transform layer. masking layer did a good job in highlighting the object area, whose segmentation boundary is already more accurate than the FCN-8s itself. 4 End-to-END Object Boundary Guided FCN (OBG-FCN) Training Based on the proposed model, we then conduct an end-to-end training to refine the network. Our currently results show that by enabling the back-propagation to both networks would significantly influence the pretrained features. In fact, the constraint of original FCN-8s still exists here, that even if we fixed the OB-FCN branch, the back-propagated gradients would result in scattered segmentation results which shows that the FCN looks for too much detail patterns. Therefore, currently we fixed the learning rate of the two branches, and conduct the finetuning on the masking layer with large step-size. FCN-8s Layer Output FCN-8s Result OBG-FCN Layer Output OBG Result Fig. 8. Results of end-to-end training of OBG-FCN, compared with FCN, on the test image of Fig. 4

8 Authors Suppressed Due to Excessive Length The end-to-end traning results are shown in Fig. 8, where the masking layer for each class now has specific weighting by combing the background, object and boundary information, and the segmentation results now looks even finer. Currently we are working on enabling global finetuning of feature layers for better results. 5 Experiment Results In this section, we evaluate the proposed OBG-FCN method on PASCAL VOC dataset, and compare with the previous state-of-art FCN [2] and CRF-RNN [4] with their newest available models. We currently only use the 1112 training images from PASCAL VOC 2011 segmentation dataset to train our OB-FCN branch, and finetune the OBG-FCN network. We first evaluate on the PASCAL VOC 2011 dataset. Since the model trained in [2] and [4] both use the training images in PASCAL VOC 2011 and the extra data in [10], there are some overlapping with the validation set and the extra data. However, we first present this result as an indication of our improved performance. We will later derive a more solid evaluation on non-overlapping validation dataset, as well as submitting it to the PASCAL challenge server. As shown in Table. 1, we present four different evaluations on the validation sets. And it has shown that the initialized network without further finetuning already reaches the state-of-art performance. And the final result of our proposed OBG-FCN network has outperforms the other methods significantly. Table 1. Comparison of semantic segmentation on complete PASCAL VOC 2011 dataset. pixel mean mean f.w accuracy accuracy IU IU FCN-8s 90.0 81.8 71.6 84.8 CRF-as-RNN 94.8 88.7 81.6 91.3 OBG-FCN (initialization) 94.9 88.0 81.3 92.4 OBG-FCN 97.5 90.7 87.4 95.4 We then follow the steps of [4] and derive a reduced subset of VOC 2012 validation data with 346 images by removing overlapping images within the training set. The results are shown in Table. 2 and the initialized OBG-FCN already out-performs FCN-8s and CRF-RNN, while the final result has a further improvement in higher accuracy and mean IU. In Fig. 1, we present several sets of segmentation results. The first six sets of examples are referred to as general or failure cases according to the [4], and the rest are examples of typical good quality results of the previous methods. As shown in the results, our methods manage to achieve finer details of object boundaries even if the CRF-RNN already did a good job. As for the confusion

Object Boundary Guided Semantic Segmentation 9 Table 2. Comparison of semantic segmentation on reduced PASCAL VOC 2012 validation set. pixel mean mean f.w accuracy accuracy IU IU FCN-8s 88.2 76.5 66.5 82.4 CRF-as-RNN 91.8 81.3 73.6 87.3 OBG-FCN (initialization) 93.8 84.4 77.1 90.9 OBG-FCN 97.0 88.1 84.2 94.5 classes, as well as occlusion problem, the proposed OBG-FCN can significantly improve the class accuracy and object completeness. References 1. R. Girshick, J. Donahue, T.D., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. (2014) 1 2. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3431 3440 1, 3, 7, 8 3. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. arxiv preprint arxiv:1412.7062 (2014) 1 4. Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1529 1537 1, 3, 7, 8 5. J. Dai, K.H., Boxsup, J.S.: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. arxiv preprint arxiv:1503.01640 (2015) 1 6. Gedas Bertasius, Jianbo Shi, L.T.: Semantic segmentation with boundary neural fields. arxiv preprint arxiv:1511.02674 (2015) 3 7. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2014) 580 587 3 8. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1440 1448 3 9. Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network cascades. arxiv preprint arxiv:1512.04412 (2015) 5 10. Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: Computer vision ECCV 2014. Springer (2014) 297 312 8