arxiv: v1 [cs.cv] 31 Mar PDF Free Download

Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu and C.-C. Jay Kuo arxiv:1603.09742v1 [cs.cv] 31 Mar 2016 University of Southern California Abstract. Semantic segmentation has been a major topic in computer vision, and has played an important role in understanding object classes as well as object localizations. Recent development in deep learning, especially in fully-convolutional neural network, has enabled pixel-level labeling for more accurate results. However most of the previous works, including FCN, did not take object boundary into consideration. In fact, since the originally labeled ground truth does not provide with a clean object boundary, the labeled contours and background objects have been both ignored as background class. In this work, we propose an elegant object boundary guided FCN (OBG-FCN) network, which uses the prior knowledge of object boundary from training to achieve better class accuracy and segmentation details. To this end, we first relabel the object contours, and use the FCN network to specially learn to find the whereabouts of the object and contours. Then we transform the output of this branch to 21 classes and use it as a mask to refine the detail shapes of the objects. An end-to-end learning is then applied to finetune the transforming parameters which reconsider the combination of object-background-boundary in the final segmentation decision. We apply the proposed method in PASCAL VOC segmentation benchmark, and have achieved 87.4% mean IU (15% relative improvements compared to FCN and around 10% improvement compared to CRF- RNN), and our edge model has shown to be stable and accurate even at accuracy level of FCN-2s. 1 Introduction Recently, semantic segmentation has played an important role in understanding object classes as well as object localizations[1]. The introduction of fully convolution networks[2] has brought a large improvement in image semantic segmentation. In fact, there are also a number of recent approaches including DeepLab[3], CRF-RNN[4], which achieved good segmentation performance. However, while they use feature representations to make pixel-wise classification, it is of great importance to take another factor into consideration, i.e. the object boundary, to make the labelling more accurate and natural[5].

2 Authors Suppressed Due to Excessive Length Input Image FCN-8s CRF-RNN OBG-FCN Ground Truth Fig. 1. Examples of segmentation results with first column as input images and last column as segmentation ground truth. The 2nd to 4th columns are the results from FCN-8s, CRF-RNN and our proposed OBG-FCN.

Object Boundary Guided Semantic Segmentation 3 It is a significant challenge to adapt Convnets on pixel-wise classification task. Firstly, the convolution filters and the max-pooling manipulation of traditional CNNs make the object boundary prediction quiet coarse. Furthermore, although recent algorithms, such as FCN[2], have made use of the intermediate convolutional layers to finetune the pixel-wise prediction, they couldn t make a good prediction on the object boundaries[6]. CRF-RNN[4] formulates mean-field approximate inference that partially improves the prediction on object boundaries, but there still exists a number of problems, for example, mixing the nearby objects together and misclassifying the objects on the boundaries. In this case, lack of boundary constraints couldn t give a good prediction on image boundary in most cases. To deal with this problem, we introduce an object boundary guided FCN network (OBG-FCN), which uses the pre-trained prior knowledge of object boundaries to enhance the performance semantic segmentation. In this work, we first relabel the object contours based on the PASCAL annotation. Although the ground truth provided by PASCAL annotation already offers contour information, it also includes some background objects as boundaries, which will mislead our training. Therefore, we relabel the object boundary by shifting the positions of the object regions and derive an accurate ground truth with object proposals and boundaries. Then, we follow the FCN network structure and conduct step-by-step learning from FCN-32s to FCN-2s to train our 3-class OB-FCN segmenter (Object-Boundary-Background). Then we use the output of the OB-FCN branch as a mask layer, which is transformed from 3-class output to 21-class, and conduct an element-wise multiplication with the original FCN-8s network. An end-to-end training is then followed on the object boundary guided FCN (OBG-FCN) to finetune the network. The results have shown a great improvement over previous state-of-art in improving the mean IU of PASCAL VOC benchmark and result in more accurate class accuracy and object details. The following sections explain our implementation details and introduce our architecture which combines the information of two distinct network branches together to make pixel-wise predictions. In the experiment section, we demonstrate the state-of-the-art results on PASCAL VOC 2011-2012. 2 Object Boundary FCN (OB-FCN) with Re-labeled Boundaries One of the most major idea is to utilize the boundary information as a guideline to the training stage of semantic segmentation. Researchers in previous works, such as [7,8,2,4], all treated the boundary to be background. Therefore, their segmentation results show little relation to the boundary on the ground truth. Subjective comparison of their segmentation results also demonstrate that there are lot of cross regions if adding the boundaries onto them, which is a good indication that boundary can be a crucial part for the semantic segmentation. The first stage of our research is to achieve the boundary prediction as precise as possible. Preprocessing the ground truth is one of the key parts in our work.

4 Authors Suppressed Due to Excessive Length By dividing the ground truth into three classes, objects, boundaries, and background, we recreated our own proposed ground truth, and followed the network structure of FCN to learn corresponding features and finetune the network. As a matter of fact, the ground truth of class labels of semantic segmentation has a labeling of object boundary, however it is sometimes confused with the background objects. Therefore, we relabel the object boundaries by moving the objects horizontally and vertically so that we can extent the object area, where we later set the center object region as it is. In this way, we can get a clear edge between objects and backgrounds and within different objects. Sample Iimage Original Ground Truth Relabeled Object Boundary Fig. 2. Examples of re-labeled object boundary for an image in PASCAL VOC An example is shown in Fig. 2, where the original images have some background objects included as the same class of object boundaries. In contrary, our relabeled ground truth keeps the exact information of objects and accurate boundary information. Since we are working at FCN-4s in current stage of OB-FCN network, we set the maximum boundary width as 4 which is the accuracy interval pixel-wise. Currently we are working on combining the result with OB-FCN-2s, and we expect to have even better results from it. Fig. 3. Flow chart of OB-FCN network structure. Previously in the work of FCN, it has been observed that the accuracy level can only reach up to combining pool 3, while further combining with pool2 or

Object Boundary Guided Semantic Segmentation 5 pool1 will confuse the segmenter. However, by making the object boundary FCN (OB-FCN) branch to learn only 3 classes (object, boundary, backgorund), we are able to achieve the detail level of FCN-4s and FCN-2s without confusing the network with small scale information. The flow-chart of the OB-FCN network is shown in Fig.3, where our final model is consisted with all pooling information. Input Image OB-FCN- 32s OB-FCN- 16s OB-FCN-8s OB-FCN-4s Labeled Boundary Fig. 4. Examples of segmentation results with first column as input images and last column as segmentation ground truth. The 2nd to 4th columns are the results from FCN-8s, CRF-RNN and our proposed OBG-FCN. A step-by-step boundary learning result is shown in Fig. 4, where the revolution of each step shows finer details of object and its boundaries. Serving as important prior knowledge of object proposals, our work shows a much more precise Semantic Segmentation can be achieved even with the help of a FCN-4s OB-FCN branch. We will further evaluate the object matching area with the ground truth in future experiments. 3 Object Boundary Guided FCN (OBG-FCN) for semantic segmentation Now that we acquire a precise model with the object information, it is important to combine them with the class information derived with the original FCN-8s. As mentioned in [9], a masking method is adopted by applying the output of one branch to the other branch. We followed the method and tried to combine the object information and class information by using the output of OB-FCN as a mask. We first followed the methods by using shared layers from Conv-5 and even go down to Conv-3, however the results are not that satisfying. This is most because that by looking for boundary information, the shallow layers of FCN and OB-FCN are most likely to be different with each other. Therefore, we decided to use the two pre-trained branch completely separated. The system network flow is shown in Fig. 6, where we introduce the data to two different branches, and design a masking layer to combine the output.

6 Authors Suppressed Due to Excessive Length Fig. 5. End-to-end two-branch network of OBG-FCN. Here, we use element wise product to exert the masking. However since this operation requires two bottom layers with exactly same dimension. We need to transform the 3-class output of OB-FCN to 21 classes. Therefore, we apply a convolution layer between the element production and the output of OB-FCN, which takes an input of 3 and output a 21 class masking map. Fig. 6. Demonstration of the convolutional transform layer to map 3-class output of OB-FCN to 21 classes. One crucial issue here is to initialize the transform layer. Since we would want most of the object area to be highlighted and combined with original FCN, and would like the background and detected boundaries not confusing with the object, we do not randomly initialize the network, but setting the parameters as 1, if k = C(background), m C(object), ω(k, m, 1, 1) = 1, if k = C(object), m = C(object), 0, otherwise, where ω is the parameters of the transforming convolution layer,k is the first parameter corresponding to the output depth and m is the corresponding input channel of OB-FCN s result, while C representing the class ID of background (0) and objects (1-20). As a result, we apply the pre-trained object area directly onto the original FCN-8s, and derive a primitive masking layer as shown in the first two columns of Fig. 7. The corresponding combined layer output is shown in the third column with the segmentation result in last column. As shown in the result, our (1)

Object Boundary Guided Semantic Segmentation 7 3-Class Output Initializa- 21-Class tion Combined Initialization Initialized Result Fig. 7. Results of initialization of convolutional transform layer. masking layer did a good job in highlighting the object area, whose segmentation boundary is already more accurate than the FCN-8s itself. 4 End-to-END Object Boundary Guided FCN (OBG-FCN) Training Based on the proposed model, we then conduct an end-to-end training to refine the network. Our currently results show that by enabling the back-propagation to both networks would significantly influence the pretrained features. In fact, the constraint of original FCN-8s still exists here, that even if we fixed the OB-FCN branch, the back-propagated gradients would result in scattered segmentation results which shows that the FCN looks for too much detail patterns. Therefore, currently we fixed the learning rate of the two branches, and conduct the finetuning on the masking layer with large step-size. FCN-8s Layer Output FCN-8s Result OBG-FCN Layer Output OBG Result Fig. 8. Results of end-to-end training of OBG-FCN, compared with FCN, on the test image of Fig. 4

8 Authors Suppressed Due to Excessive Length The end-to-end traning results are shown in Fig. 8, where the masking layer for each class now has specific weighting by combing the background, object and boundary information, and the segmentation results now looks even finer. Currently we are working on enabling global finetuning of feature layers for better results. 5 Experiment Results In this section, we evaluate the proposed OBG-FCN method on PASCAL VOC dataset, and compare with the previous state-of-art FCN [2] and CRF-RNN [4] with their newest available models. We currently only use the 1112 training images from PASCAL VOC 2011 segmentation dataset to train our OB-FCN branch, and finetune the OBG-FCN network. We first evaluate on the PASCAL VOC 2011 dataset. Since the model trained in [2] and [4] both use the training images in PASCAL VOC 2011 and the extra data in [10], there are some overlapping with the validation set and the extra data. However, we first present this result as an indication of our improved performance. We will later derive a more solid evaluation on non-overlapping validation dataset, as well as submitting it to the PASCAL challenge server. As shown in Table. 1, we present four different evaluations on the validation sets. And it has shown that the initialized network without further finetuning already reaches the state-of-art performance. And the final result of our proposed OBG-FCN network has outperforms the other methods significantly. Table 1. Comparison of semantic segmentation on complete PASCAL VOC 2011 dataset. pixel mean mean f.w accuracy accuracy IU IU FCN-8s 90.0 81.8 71.6 84.8 CRF-as-RNN 94.8 88.7 81.6 91.3 OBG-FCN (initialization) 94.9 88.0 81.3 92.4 OBG-FCN 97.5 90.7 87.4 95.4 We then follow the steps of [4] and derive a reduced subset of VOC 2012 validation data with 346 images by removing overlapping images within the training set. The results are shown in Table. 2 and the initialized OBG-FCN already out-performs FCN-8s and CRF-RNN, while the final result has a further improvement in higher accuracy and mean IU. In Fig. 1, we present several sets of segmentation results. The first six sets of examples are referred to as general or failure cases according to the [4], and the rest are examples of typical good quality results of the previous methods. As shown in the results, our methods manage to achieve finer details of object boundaries even if the CRF-RNN already did a good job. As for the confusion

Object Boundary Guided Semantic Segmentation 9 Table 2. Comparison of semantic segmentation on reduced PASCAL VOC 2012 validation set. pixel mean mean f.w accuracy accuracy IU IU FCN-8s 88.2 76.5 66.5 82.4 CRF-as-RNN 91.8 81.3 73.6 87.3 OBG-FCN (initialization) 93.8 84.4 77.1 90.9 OBG-FCN 97.0 88.1 84.2 94.5 classes, as well as occlusion problem, the proposed OBG-FCN can significantly improve the class accuracy and object completeness. References 1. R. Girshick, J. Donahue, T.D., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. (2014) 1 2. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3431 3440 1, 3, 7, 8 3. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. arxiv preprint arxiv:1412.7062 (2014) 1 4. Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1529 1537 1, 3, 7, 8 5. J. Dai, K.H., Boxsup, J.S.: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. arxiv preprint arxiv:1503.01640 (2015) 1 6. Gedas Bertasius, Jianbo Shi, L.T.: Semantic segmentation with boundary neural fields. arxiv preprint arxiv:1511.02674 (2015) 3 7. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2014) 580 587 3 8. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1440 1448 3 9. Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network cascades. arxiv preprint arxiv:1512.04412 (2015) 5 10. Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: Computer vision ECCV 2014. Springer (2014) 297 312 8

arxiv: v1 [cs.cv] 31 Mar 2016