RSRN: Rich Side-output Residual Network for Medial Axis Detection

Similar documents
Linear Span Network for Object Skeleton Detection

arxiv: v2 [cs.cv] 1 Apr 2017

arxiv: v1 [cs.cv] 30 Nov 2018

CAP 6412 Advanced Computer Vision

Edge Detection Using Convolutional Neural Network

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

PASCAL Boundaries: A Semantic Boundary Dataset with A Deep Semantic Boundary Detector

Channel Locality Block: A Variant of Squeeze-and-Excitation

DeepSkeleton: Learning Multi-task Scale-associated Deep Side Outputs for Object Skeleton Extraction in Natural Images

DeepSkeleton: Learning Multi-task Scale-associated Deep Side Outputs for Object Skeleton Extraction in Natural Images

Multi-Scale Kernel Operators for Reflection and Rotation Symmetry: Further Achievements

Classifying a specific image region using convolutional nets with an ROI mask as input

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

Dense Volume-to-Volume Vascular Boundary Detection

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

CAP 6412 Advanced Computer Vision

arxiv: v1 [cs.cv] 31 Mar 2016

EDGE detection can be viewed as a method to extract visually

Edge Detection. Computer Vision Shiv Ram Dubey, IIIT Sri City

LARGE-SCALE PERSON RE-IDENTIFICATION AS RETRIEVAL

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

SYMMETRY DETECTION VIA CONTOUR GROUPING

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Fully Convolutional Networks for Semantic Segmentation

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

arxiv: v1 [cs.cv] 29 Sep 2016

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

arxiv: v4 [cs.cv] 6 Jul 2016

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Content-Based Image Recovery

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

2017 ICCV Challenge: Detecting Symmetry in the Wild

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Lecture 7: Semantic Segmentation

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

Object Skeleton Extraction in Natural Images by Fusing Scale-associated Deep Side Outputs

arxiv: v1 [cs.cv] 15 Oct 2018

Semantic Segmentation

arxiv: v1 [cs.cv] 16 Nov 2015

Deep Back-Projection Networks For Super-Resolution Supplementary Material

Learning to Segment Object Candidates

Structured Prediction using Convolutional Neural Networks

A Bi-directional Message Passing Model for Salient Object Detection

Detection of a Single Hand Shape in the Foreground of Still Images

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

Using RGB, Depth, and Thermal Data for Improved Hand Detection

Feature-Fused SSD: Fast Detection for Small Objects

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Elastic Neural Networks for Classification

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Edge detection. Winter in Kraków photographed by Marcin Ryczek

arxiv: v1 [cs.cv] 20 Dec 2016

Fully Convolutional Network for Depth Estimation and Semantic Segmentation

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Yiqi Yan. May 10, 2017

Counting Vehicles with Cameras

Robust Minutiae Extractor: Integrating Deep Networks and Fingerprint Domain Knowledge

DEL: Deep Embedding Learning for Efficient Image Segmentation

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

A Fast Caption Detection Method for Low Quality Video Images

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Richer Convolutional Features for Edge Detection

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Project 3 Q&A. Jonathan Krause

Team G-RMI: Google Research & Machine Intelligence

Gradient of the lower bound

Enhanced Active Shape Models with Global Texture Constraints for Image Analysis

arxiv: v2 [cs.cv] 9 Apr 2017

A MULTI-RESOLUTION FUSION MODEL INCORPORATING COLOR AND ELEVATION FOR SEMANTIC SEGMENTATION

Spatial Localization and Detection. Lecture 8-1

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Towards Weakly- and Semi- Supervised Object Localization and Semantic Segmentation

Automatic detection of books based on Faster R-CNN

Convolution Neural Networks for Chinese Handwriting Recognition

Pedestrian Detection based on Deep Fusion Network using Feature Correlation

Final Report: Smart Trash Net: Waste Localization and Classification

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

arxiv: v1 [cs.cv] 11 May 2018

3D Pose Estimation using Synthetic Data over Monocular Depth Images

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform. Xintao Wang Ke Yu Chao Dong Chen Change Loy

Two-Stream Convolutional Networks for Action Recognition in Videos

A Benchmark for Interactive Image Segmentation Algorithms

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs Supplementary Material

TEXT SEGMENTATION ON PHOTOREALISTIC IMAGES

Recap from Monday. Visualizing Networks Caffe overview Slides are now online

Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition

EE-559 Deep learning Networks for semantic segmentation

Graph Matching Iris Image Blocks with Local Binary Pattern

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Learning to Estimate 3D Human Pose and Shape from a Single Color Image Supplementary material

Tri-modal Human Body Segmentation

FINE-GRAINED image classification aims to recognize. Fast Fine-grained Image Classification via Weakly Supervised Discriminative Localization

Flow-free Video Object Segmentation

Transcription:

RSRN: Rich Side-output Residual Network for Medial Axis Detection Chang Liu, Wei Ke, Jianbin Jiao, and Qixiang Ye University of Chinese Academy of Sciences, Beijing, China {liuchang615, kewei11}@mails.ucas.ac.cn, {jiaojb, qxye}@ucas.ac.cn Abstract In this paper, we propose a Rich Side-output Residual Network (RSRN) for medial axis detection for the ICCV 2017 workshop challenge on detecting symmetry in the wild. RSRN uses the rich features of fully convolutional network by hierarchically fusing side-outputs in a deep-toshallow manner to decrease the residual between the detection result and the ground-truth, which refines the detection result hierarchically. Experimental results show that the proposed RSRN improves the performance compared with baseline on both SKLARGE and BMAX500 datasets. idation dataset. With multi-scale combination, we further obtain 2.35% performance gain. 2. Methodology To make the paper self-contained, SRN [5] and RCF [7] are first reviewed. The architecture of Rich Side-output Residual Network (RSRN) is then described. 2.1. Review of SRN and RCF 1. Introduction Medial axes are pervasive in visual objects such as nature creatures and artificial objects. Symmetric parts and the connections between them, or their interconnection, constitute a powerful part-based decomposition of objects shapes, providing valuable cues for tasks like segmentation, foreground extraction, object proposal, and text-line detection. The problem of medial axis detection has received increased attention in recent years. Liu et al. have held symmetry detection competitions in CVPR 2011 [8] and CVPR 2013 [6], promoting the symmetry detection in natural s. In ICCV 2017, Liu et al. held a new competition, in which medial axis becomes one task [1]. In this paper, we build a fully convolutional network, named Rich Side-output Residual Network (RSRN), for medial axis detection. RSRN is motivated by the success of -to-mask deep learning approaches, i.e., Side-output Residual Network (SRN) [5] for object symmetry detection and Richer Convolutional Features (RCF) [7] for edge detection. SRN takes a coarse to fine strategy of stage sideoutputs of VGG [10], and RCF takes the rich side-outputs of VGG, to refine the classification map hierarchically. Experimental results show that the proposed RSRN respectively achieves 6.08% and 7% improvement compared with the baseline RCF and SRN on SKLARGE val- Corresponding author (a) (b) Figure 1. Illustration of (a) SRN and (b) RCF. Side-outputs of convolutional stages are deeply supervised in both SRN [5] and RCF [7]. The difference between SRN and RCF is the strategy of constructing the sideoutput. SRN: It takes a coarse-to-fine strategy with Residual Units (RU), Fig. 1(a). For each convolutional stage, the side convolutional layer is added on the last layer of the stage. Hierarchical fusing is then utilized to reduce the residual between the input of RU and the ground-truth. The final output is the output of the last stacked RU. RCF: It uses all the convolutional layers in each stage considering the receptive fields have trivial difference. Instead of adding the side convolutional layer directly, RCF first uses an element-wise summarization of convolutional layers in one stage, then computes side-outputs. The final output is the weighted average of the side-outputs, as shown in Fig. 1(b). 11739

2.2. Rich side output residual network To take advantages of SRN and RCF, there are two modes to integrate them together. The first one integrates the Residual Unit to the side-output of RCF. The second uses rich side-outputs of each convolutional layer to construct Rich Side-output Residual Network (RSRN). Lastly, a multi-scale combination strategy is used. stage_1 conv1_1 conv1_2 s1_1 s1_2 s_1 RU r_1 dsnout1 stage_5 conv5_1 s5_1 2.2.1 Base architecture conv5_2 s5_2 s_5 dsnout5 In the base architecture, Residual Units are stacked on the side-outputs of RCF, Fig. 2. We denote the side-output as s i and the output of RU asr i. r i is computed as conv5_3 s5_3 fuss r i = wi r (r i 1 + c s i ), (1) where wi r is a 1 1 convolutional kernel to implement the weighted-sum function and + c denotes the relative operation of concatenating. The deeply supervised output d i (dsnout) is computed as fuss_loss Figure 2. The base architecture integrating RCF and SRN. stage_1 conv1_1 RU s1_1 r1_1 dsnout1_1 d i = Sigmoid(r i ). (2) conv1_2 s1_2 r1_2 dsnout1_2 In this case, pixel-wise feature combination is fulfilled before the side-output. An element-wise sum layer connects the two layers, as s i = w i j s i,j, (3) stage_5 conv5_1 s5_1 r5_1 dsnout5_1 conv5_2 s5_2 r5_2 dsnout5_2 where w i is a 1 1 convolutional kernel to be used as a pixel classifier, and s i,j is the pixel-wise feature map of the j th convolutional layer of i th stage in VGG [10]. conv5_3 s5_3 fuss dsnout5_3 2.2.2 Architecture of RSRN Rich side-output contains more information than stage sideoutput, as its receptive field size is different for each convolutional layer. It motivates us to fuse the rich side-output with RU to build the Rich Side-output Residual Network (RSRN), Fig. 3. In RSRN, we use the convolutional response map of VGG as the pixel-based hyper-column feature. Each layer has an output, as s i,j = w i,j c i,j, (4) where w i,j is a 1 1 convolutional kernel to be used as a pixel classifier and c i,j is the response map of the j th convolutional layer of i th stage in VGG. All 12 RUs are staked in RSRN, and fused together to get the final output. resolution fuss_loss Figure 3. The RSRN architecture. fixed-scale detection rescaling combination Figure 4. Multi-scale combination. 1740

2.2.3 Multi-scale combination Multi-scale combination has been successfully used in segmentation [3] and edge detection [7]. This motivates us to utilize it in medial axis detection, Fig. 4. Specifically, we resize input s at three scales to obtain s at different resolutions. For each scale, RSRN is used to detect the medial axis. And all the outputs are averaged after rescaling. In Fig. 4, it s illustrated that the medial axis of the body is well extracted with high resolution, and the medial axes of arms are well extracted with low resolution. 3. Experimental results In this section, we discuss the implementation of RSRN in detail and compare RSRN with other approaches for medial axis detection. All the experiments are performed with NVIDIA Tesla K80. 3.1. Implementation Dataset: The SKLARGE [9] and BMAX500 [11] are used to evaluate the proposed RSRN. SKLARGE contains 746 training s, 245 validation s, and 500 testing s. SKLARGE is focused on the medial axis of foreground objects. The medial axis is the skeleton of the instance segmentation mask of object. It is challenging with a variety of object categories and scale variance. BMAX500 is derived from BSDS500 dataset [2], with 200 s for training, 100 s for validating, and 200 s for testing. Each in BSDS500 is annotated with several volunteers (usually 5-7 per ). Similar with SKLARGE, by extracting segment skeletons on all the annotated segmentation masks of the, the medial axis ground-truth is obtained. In BMAX500, both foreground and background are considered to extract a medial axis. Protocol: The precision-recall metric with maximum F- measure is used to evaluate the performance of medial axis detection [12], as F = 2PR P +R Given a threshold, the prediction is firstly transferred to a binary map, with which the F-measure is computed. By adjusting the threshold, the maximum F-measure is obtained. Data augmentation: In [5], Ke et al. discuss the data augmentation for medial axis training in deep learning approaches. We follow the data augmentation manner in [5], by rotating each every 90 degree and flip it with different axes. The multi-scale augmentation is not used as the one-pixel width ground-truth suffers from resizing. Model parameters: Following the setting of SRN [5], we train RSRN by fine-tuning the pre-trained 16-layer VGG net [10]. The hyper-parameters of RSRN include: minibatch size (1), learning rate (1e-6 for SKLARGE and 1e-8 (5) Precision 1 0.8 0 0.1 0.1 0.5 [F=.709] RSRN_M [F=.700] SRN_M [F=.685] RSRN [F=.678] SRN [F=.676] RCFSRN_M [F=.664] RCF_M [F=.651] RCFSRN [F=.633] HED_M [F=.624] RCF [F=.585] HED 0.1 0.1 0 0.8 1 Recall Figure 5. The scatter diagram of precision-recall with the best F- measure. methods F-measure (%) Runtime (s) HED [13] 58.55 0.050 RCF [7] 62.44 0.055 SRN [5] 67.85 0.052 RCFSRN (ours) 65.09 0.059 RSRN (ours) 68.52 0.067 HED M 63.32 02 RCF M 66.45 33 SRN M 70.01 13 RCFSRN M (ours) 67.62 83 RSRN M (ours) 70.87 97 Table 1. F-measure on SKLARGE validation dataset. for BMAX500), loss-weight for each RU output (1.0), momentum (0.9), initialization of the nested filters (0), weight decay (0.002), and maximum number of training iterations (20,000, about 2 epochs for SKLARGE and 6 epochs for BMAX500). In the testing phase, a standard non-maximal suppression algorithm [4] is applied on the output map to obtain thinned edges and symmetry. 3.2. Performance on SKLARGE We compare the medial axis detection performance on the SKLARGE validation dataset with HED [13], RCF [7]and SRN [5]. The results are achieved by running the source codes of the methods on a GPU platform. HED is the footstone of all the other methods which are constructed based on HED with additional layers. The F-measure and the runtime are shown in Table 1 and the Precision-Recall scatter diagram is shown in Fig. 5. HED achieves the F-measure of 58.55%. Base architucture is just a little better than RCF and is not as good as SRN. The initialization of RCF, i.e., the deepest side-output, 0.7 0.8 0.5 0.9 0.7 1741

RSRN RCF HED SRN GT GT Image SRN RCF+SRN RSRN RSRN_M Figure 6. Illustration of medial axis detection results on the SKLARGE validation dataset. First to last rows are, ground-truth, the results of HED [13], RCF [7], SRN [5], base, RSRN, and RSRN M, respectively. is the rough combination of three convolutional layers with different receptive fields in stage-5 of VGG. In RSRN, hierarchical fusing of stage-5 is used to produce the initial results. Our RSRN takes the structure of SRN with the rich side-outputs, and gets the best performance of 68.52%, which improves the baseline RCF and SRN by 6.08% and 7%. It is demonstrated that all the convolutional layers contain helpful information, not only the last one in each stage. HED runs fastest with 0.50s per and all the other methods are a little slower than HED. From Table 1, it is observed that multi-scale is an efficient way to get better medial axis detection result by pair comparison, at the cost of computational efficiency. The RSRN M achieves 2.35% performance gain compared with single scale RSRN, and the runtime is almost four times of SRN. The medial axis detection results are shown in Fig. 6 for qualitative comparison which shows that without RU the HED and RCF produce significant noise. Figure 7. Illustration of medial axis detection results on the BMAX500 validation dataset. First to last rows are, groundtruth, the results of SRN [5], and the proposed RSRN, respectively. methods F-measure (%) Runtime (s) SRN 51.95 0.100 SRN M 48.99 97 RSRN 57.48 0.113 RSRN M 49.66 70 Table 2. F-measure on the BMAX500 validation dataset. 3.3. Performance on BMAX500 The medial axis detection results on the BMAX500 validation dataset are shown in Table 2. We show the results of RSRN and multi-scale RSRN as it achieves the best performance on SKLARGE. From Table 2, RSRN achieves the F-measure of 57.48% and multi-scale RSRN gets the F- measure of 49.66%. In BMAX500, multi-scale is not competitive because BMAX500 considers the medial axis on both foreground and background with complicated s. Fig. 7 shows some detection results. 4. Conclusion In this paper we propose an RSRN approach for medial axis detection. On one hand, we use rich side-outputs of VGG rather than stage side-outputs, which are more informative. On the other hand, we take the advantage of the output residual, reducing the residual between rich sideoutputs and the ground-truth, which refines the detection result hierarchically. Experimental results show that RSRN significantly outperforms the baseline on both SKLARGE and BMAX500 datasets. Acknowledgement This work is partially supported by NSFC under Grant 61671427 and Grant 61771447. 1742

References [1] Detecting symmetry in the wild. https://sites.google.com/view/symcomp17/challenges/2dsymmetry. 1 [2] P. Arbelaez, M. Maire, C. C. Fowlkes, and J. Malik. Contour detection and hierarchical segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 33(5):898 916, 2011. 3 [3] P. A. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marqués, and J. Malik. Multiscale combinatorial grouping. In IEEE Conference on Computer Vision and Pattern Recognition, pages 328 335, 2014. 3 [4] P. Dollár and C. L. Zitnick. Fast edge detection using structured forests. IEEE Transaction on Pattern Analysis and Machine Intelligence, 37(8):1558 1570, 2015. 3 [5] W. Ke, J. Chen, J. Jiao, G. Zhao, and Q. Ye. SRN: sideoutput residual network for object symmetry detection in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1068 1076, 2017. 1, 3, 4 [6] J. Liu, G. Slota, G. Zheng, Z. Wu, M. Park, S. Lee, I. Rauschert, and Y. Liu. Symmetry detection from realworld s competition 2013: Summary and results. In International Conference on Computer Vision and Pattern Recognition Workshops, pages 200 205, 2013. 1 [7] Y. Liu, X. H. Ming-Ming Cheng, K. Wang, and X. Bai. Richer convolutional features for edge detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3000 3009, 2017. 1, 3, 4 [8] I. Rauschert, K. Brocklehurst, S. Kashyap, J. Liu, and Y. Liu. First symmetry detection competition: Summary and results. The Pennsylvania State University, PA, Tech. Rep. CSE11-012, 2011. 1 [9] W. Shen, K. Zhao, Y. Jiang, Y. Wang, X. Bai, and A. L. Yuille. Deepskeleton: Learning multi-task scale-associated deep side outputs for object skeleton extraction in natural s. arxiv: 1609.03659, 2016. 3 [10] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale recognition. In International Conference on Learning Representations, 2015. 1, 2, 3 [11] S. Tsogkas and S. J. Dickinson. AMAT: medial axis transform for natural s. arxiv: 1703.08628, 2017. 3 [12] S. Tsogkas and I. Kokkinos. Learning-based symmetry detection in natural s. In European Conference on Computer Vision, pages 41 54, 2012. 3 [13] S. Xie and Z. Tu. Holistically-nested edge detection. In International Conference on Computer Vision, pages 1395 1403, 2015. 3, 4 1743