Exploring Context with Deep Structured models for Semantic Segmentation

Size: px

Start display at page:

Download "Exploring Context with Deep Structured models for Semantic Segmentation"

Emery Phillips
5 years ago
Views:

1 1 Exploring Context with Deep Structure moels for Semantic Segmentation Guosheng Lin, Chunhua Shen, Anton van en Hengel, Ian Rei between an image patch an a large backgroun image region. Explicitly moeling the patch-patch contextual relations has not been well stuie in recent CNN-base segmentation methos. In this work, we propose to explicitly moel the contextual relations using conitional ranom fiels (CRFs). We formulate CNN-base pairwise potential functions to capture semantic correlations between neighboring patches. Some recent methos combine CNNs an CRFs for semantic segmentation, e.g., the ense CRFs applie in [3], [5], [6], [7]. The purpose of applying the ense CRFs in these methos is to refine the upsample lowresolution preiction to sharpen object/region bounaries. These methos consier Potts-moel-base pairwise potentials for enforcing local smoothness. There the pairwise potentials are conventional log-linear functions. In contrast, here we learn more general pairwise potentials using CNNs to moel the semantic compatibility between image regions. Our CNN pairwise potentials aim to improve the coarse-level preiction rather than merely encouraging local smoothness, an thus have a ifferent purpose compare to Potts-moel-base pairwise potentials. Given that these two types of potentials make ifferent effects, they can be combine to improve segmentation results. Fig. 1 illustrates the preiction process of our metho. In contrast to patch-patch context, patch-backgroun context is wiely explore in the literature. For CNN-base methos, backgroun information can be effectively capture by combining features from a multi-scale image network input, an has shown goo performance in some recent segmentation methos [1], [8]. A special case of capturing patch-backgroun context is consiering the whole image as the backgroun region an incorporating the imagelevel label information into learning. In our approach, to encoe rich backgroun information, we construct multiarxiv: v2 [cs.cv] 26 Mar 2016 Abstract State-of-the-art semantic image segmentation methos are mostly base on training eep convolutional neural networks (CNNs). In this work, we proffer to improve semantic segmentation with the use of contextual information. In particular, we explore patch-patch context an patch-backgroun context in eep CNNs. We formulate eep structure moels by combining CNNs an Conitional Ranom Fiels (CRFs) for learning the patch-patch context between image regions. Specifically, we formulate CNN-base pairwise potential functions to capture semantic correlations between neighboring patches. Efficient piecewise training of the propose eep structure moel is then applie in orer to avoi repeate expensive CRF inference uring the course of back propagation. For capturing the patch-backgroun context, we show that a network esign with traitional multi-scale image inputs an sliing pyrami pooling is very effective for improving performance. We perform comprehensive evaluation of the propose metho. We achieve new state-of-the-art performance on a number of challenging semantic segmentation atasets incluing NYUDv2, PASCAL-VOC2012, Cityscapes, PASCAL-Context, SUN-RGBD, SIFT-flow, an KITTI atasets. Particularly, we report an intersection-over-union score of 77.8 on the PASCAL-VOC2012 ataset. Inex Terms Semantic Segmentation, Convolutional Neural Networks, Conitional Ranom Fiels, Contextual Moels 1 INTRODUCTION Semantic image segmentation aims to preict a category label for every image pixel, which is an important yet challenging task for image unerstaning. Recent approaches have applie convolutional neural network (CNNs) [1], [2], [3] to this pixel-level labeling task an achieve remarkable success. Among these CNN-base methos, fully convolutional neural networks (FCNNs) [2], [3] have become a popular choice, because of their computational efficiency for ense preiction an en-to-en style learning. Contextual relationships are ubiquitous an provie important cues for scene unerstaning tasks. Spatial context can be formulate in terms of semantic compatibility relations between one object an its neighboring objects or image patches (stuff), in which a compatibility relation is an inication of the co-occurrence of visual patterns. For example, a car is likely to appear over a roa, an a glass is likely to appear over a table. Context can also encoe incompatibility relations. For example, a boat is unlikely to appear on a roa. These relations also exist at finer scales, for example, in object part-to-part relations, an part-toobject relations. In some cases, contextual information is the most important cue, particularly when a single object shows significant visual ambiguities. A more etaile iscussion of the value of spatial context can be foun in [4]. We explore two types of spatial context to improve the segmentation performance: patch-patch context an patchbackgroun context. The patch-patch context is the semantic relation between the visual patterns of two image patches. Likewise, patch-backgroun context the semantic relation Authors are with School of Computer Science, The University of Aelaie, Australia; an ARC Centre of Excellence for Robotic Vision. {guosheng.lin, chunhua.shen, anton.vanenhengel, ian.rei}@aelaie.eu.au

2 2 Unary potential net: Multi-scale CNN... Pairwise potential net Multi-scale CNN... Coarse-level preiction stage: Preiction refinement stage: CRF Inference Up-sample & Refine Deep structure moel: Contextual eep CRF Low-resolution preiction Final preiction Fig. 1 An illustration of the preiction process of our metho. Both our unary an pairwise potentials are formulate as multi-scale CNNs for capturing semantic relations between image regions. Our metho outputs low-resolution preiction after performing CRF inference, then the preiction is up-sample an refine in a stanar post-processing stage to output the final preiction. scale networks an apply sliing pyrami pooling on feature maps. The traitional pyrami pooling (in a sliing manner) on the feature map is able to capture information from backgroun regions of ifferent sizes. Incorporating general pairwise potentials usually involves computationally expensive inference, which brings challenges for CRF learning. To facilitate efficient learning we apply piecewise training of the CRF [9] to avoi repeate inference uring back propagation training of the eep moel. Thus our main contributions are as follows: 1. We formulate CNN-base general pairwise potential functions in CRFs to explicitly moel patch-patch semantic relations. 2. Deep CNN-base general pairwise potentials are challenging for efficient CNN-CRF joint learning. We perform approximate training, using piecewise training of CRFs [9], to avoi the repeate inference at every stochastic graient escent iteration an thus achieve efficient learning. 3. We explore backgroun context by applying a network architecture with traitional multi-scale image input [1] an sliing pyrami pooling [10]. We empirically emonstrate the effectiveness of this network architecture for semantic segmentation. 4. We set new state-of-the-art performance on a number of challenging semantic segmentation atasets, incluing NYUDv2, PASCAL VOC 2012, PASCAL-Context, SIFT-flow, SUN-RGBD, Cityscapes ataset an so on. In particular, we achieve an intersection-over-union score of 77.8 on the PASCAL VOC 2012 ataset. 1.1 Relate work Preliminary results of our work appeare in [11]. Exploiting contextual information has been wiely stuie in the literature (e.g., [4], [12], [13]). For example, early work of TAS [4] moels ifferent types of spatial context between Things an Stuff using a generative graphical moel. The most successful recent methos for semantic image segmentation are base on CNNs. A number of these CNNbase methos are region-proposal-base methos [14], [15], which first generate region proposals an then assign category labels to each of them. Very recently, FCNNs [2], [3], [7] have become a popular choice for their efficient feature generation an en-to-en training. FCNNs have also been applie to a range of other ense-preiction tasks recently, such as image restoration [16], image superresolution [17] an epth estimation [18], [19], [20]. The metho that we propose here is also built upon fully convolution-style networks. FCNN methos make use of the Image-Net traine CNN moels (e.g., the VGG-16 moel [21]) which takes avantages of the large Image-Net ataset for learning eep moels. For convolution an pooling layers, the resolution of the output feature map is own-sample if the convolution/pooling strie is greater than 1. Usually a few such layers use a strie setting of 2, hence the irect preictions of FCNNs are typically in low resolution. To increase the preiction resolution, the naive metho of irectly reucing the stries for all layers is not able to aress this ownsample preiction for a eep network. Small stries result in prohibitively expensive computation for a eep network, an also reuce the view-of-fiel (the image region that a filter is able to see ) of the network layers. Network layers with insufficient view-of-fiel may not be able to capture high-level semantic patterns an thus egrae the performance. To aress this low-resolution preiction issue, a variety of FCNN base methos are propose very recently which focus on refining the low-resolution preiction to obtain high resolution preiction. DeepLab-CRF [3] performs bilinear upsampling of the preiction score map to the input image size an applies the ense CRF metho [22] to refine the object bounary by levering low-level (color contrast) information. They consier Potts-moel base pairwise potential functions which enforce local smoothness. CRF-RNN [6] extens this approach by implementing the mean fiel CRF inference as recurrent layers for en-to-en learning of the ense CRF an FCNN network. The work in [23] learns econvolution layers to upsample the low-resolution preictions. The epth estimation metho [24] explores super-pixel pooling for builing the gap between the lowresolution feature map an high-resolution final preiction. Eigen et al. [20] perform coarse-to-fine learning of multiple networks with ifferent resolution outputs for refining the coarse preiction. The methos in [2], [25] explore mi-layer features (skip connections) for high-resolution preiction. Unlike these methos, our metho focuses on improving the coarse (low-resolution) preiction by learning general CNN pairwise potentials to capture semantic relations between patches. These methos are complementary to our metho. Jointly learning CNNs an CRFs has also been explore in other applications apart from segmentation. Recent work in [19], [24] proposes to jointly learn continuous CRFs an CNNs for epth estimation from single monocular images. They focus on continuously-value variable preiction,

3 3 FeatMap-Net Feature map (low resolution) Construct CRF graph (accoring to the feature map resolution) 1. Noe construction: each noe correspons to each spatial position in the feature map. CRF graph 2. Pairwise connections: each noe connects to all noes that lie within a pre-efine range box Fig. 2 An illustration of generating a feature map with FeatMap-Net an constructing the CRF graph. Surrouning Above/Below Fig. 3 An illustration of constructing pairwise connections in a CRF graph. A noe is connecte to all other noes which lie insie the range box (ashe box in the figure). Two types of spatial relations are escribe in the figure, which correspon to two types of pairwise potential functions. while our metho is for iscrete categorical label preiction. The work in [26] combines CRFs an CNNs for human pose estimation. The authors of [27] explore joint training of Markov ranom fiels an eep neural networks for the tasks of preicting wors from noisy images an multi-class classification. They require marginal inference for every graient calculation which is computationally expensive for training eep moels. 2 MODELING SEMANTIC PAIRWISE RELATIONS We first escribe how to buil the CRF graph for moeling semantic pairwise relations. Given an image, we first apply a convolutional network to generate a feature map. We refer to this network as FeatMap-Net, etails of which are presente in Sec. 4 (Fig. 6 shows the overall architecture). With this feature map, we construct one noe in the CRF graph corresponing to one spatial position of the feature map. Fig. 2 illustrates how we construct noes an pairwise connections in a CRF graph. Pairwise connections are constructe by connecting one noe to all other noes which lie within a spatial range box (the ashe box in Fig. 3). We consier ifferent spatial relations by efining ifferent types range boxes, an each type of spatial relation is moele by a specific pairwise potential function. As shown in Fig. 3, our metho moels the surrouning an above/below spatial relations. For the surrouning relation, the range box is centere at the noe. For the above/below relation, the bottom ege of the range box is centere at the noe. In our experiments, the size of the range box (ash box in the figure) size is 0.4a 0.4a, where a is the length of the short ege of the feature map. It woul be straightforwar to construct more pairwise potentials, by varying either the sizes or positions of the connection range boxes, an our approach is not limite to connections within boxes. 3 CONTEXTUAL DEEP CRFS Here we present the etails of our eep CRF moel. We enote by x X one input image an y Y the labeling mask which escribes the label configuration of each noe in the CRF graph. The energy function is enote by E(y, x; θ) which moels the compatibility of the input-output pair, with a small output value inicating high confience in the preiction y. All network parameters are enote by θ which we nee to learn. The conitional likelihoo for one image is formulate as follows: P (y x) = 1 exp [ E(y, x)]. (1) Z(x) Here Z is the partition function, efine as: Z(x) = y exp [ E(y, x)]. The energy function is typically formulate by a set of unary an pairwise potentials: E(y, x) = U(y p, x p ) U U p N U + V (y p, y q, x pq ). (2) V V (p,q) S V Here U is a unary potential function. To make the exposition more general, we consier multiple types of unary potentials with U the set of all such unary potentials. N U is a set of noes for the potential U. Likewise, V is a pairwise potential function with V the set of all types of pairwise potential. S V is the set of eges for the potential V. x p an x pq inicates the corresponing image regions which associate to the specifie noe an ege. The potential function is constructe by a eep network for generating feature map (FeatMap-Net) an a shallow network (Unary-Net or Pairwise-Net) to generate the output of the potential function. Details are escribe in the following sections. An overview of our contextual eep structure moel for preiction an training is shown in Fig Unary potential functions We formulate the unary potential function by stacking the FeatMap-Net for generating feature maps an a shallower fully connecte network (referre to as Unary-Net) to generate the final output of the unary potential function. The unary potential function is written as follows: U(y p, x p ; θ U ) = z p,yp (x; θ U ). (3) Here z p,yp is the output value of Unary-Net, which correspons to the p-th noe an the y p -th class. Fig. 4 shows an illustration of the Unary-Net an how it corporates with FeatMap-Net. Fig. 5 emonstrates the process for generating the feature vector for one noe. The input of the Unary-Net is the noe feature vector extracte

4 4 FeatMap-Net Feature map (low resolution) Network parameter learning: Minimise negative log-likelihoo Coarse level preiction: CRF marginal inference Low resolution preiction Construct CRF graph (accoring to the feature map) Generate features Learning stage Preiction stage One noe Noe feature vector Unary-Net One unary potential output: CRF graph Distribution of the preiction: One pairwise connection (ege) Ege feature vector Pairwise-Net One pairiwise potential output: in which, Other noes/eges Fig. 4 An overview of the propose contextual eep structure moel. Unary-Net an Pairwise-Net are shown here for generating potential function outputs. One noe in the CRF graph spatial corresponence in the feature map Feature map Extract the corresponing feature vector for one noe Feature vector for one noe One pairwise connection (ege) in the CRF graph spatial corresponence in the feature map Feature map Extract the feature vectors of two noes Two noe feature vectors Concatenate Feature vector for pairwise connection 2 Fig. 5 An illustration of generating feature vectors for CRF noes an pairwise connections from the feature map output by FeatMap-Net. The symbol enotes the feature imension. We concatenate the corresponing features of two connecte noes in the feature map to obtain the CRF ege features. from the feature map which is generate by FeatMap-Net. The feature vector for one CRF noe is simply the corresponing feature vector in the feature map. The imension of the Unary-Net output vector for one noe is K, which is the same as the number of classes. 3.2 Pairwise potential functions We formulate the unary potential function, analogous to the unary potentials, by stacking the FeatMap-Net for generating feature maps an a shallower fully connecte network (referre to as Pairwise-Net) to generate the final output of the pairwise potential function. The pairwise potential function is written as follows: V (y p, y q, x pq ; θ V ) = z p,q,yp,y q (x; θ V ). (4) Here z p,q,yp,y q is the output value of Pairwise-Net. It is the confience value for the noe pair (p, q) when they are labele with the class value (y p, y q ), which measures the compatibility of the label pair (y p, y q ) given the input image x. θ V is the corresponing set of CNN parameters for the potential V, which we nee to learn. The role of Pairwise- Net in our structure moel is illustrate in Fig. 4. Fig. 5 escribes the process for generating the feature vector for one pairwise connection. The input of Pairwise-Net is the ege feature vector which is generate from the feature map for two connecte noes. Following the work of [28], we concatenate the corresponing feature vectors of two connecte noes to obtain the CRF ege feature vector. The Pairwise-Net has K 2 output units to match the number of possible label combinations for a pair of noes. Our formulation of pairwise potentials is ifferent from the Potts-moel-base smoothness potentials in the existing methos of [3], [6]. The Potts-moel-base pairwise potentials are a log-linear functions an employ a special formulation for enforcing neighborhoo smoothness base on color contrast, an thus to sharpen object/region bounaries. In contrast, our pairwise potentials moel the semantic compatibility relations between two noes with the output for every possible value of the label pair (y p, y q ) iniviually parameterize by CNNs. Clearly, these two types of pairwise potential formulations have ifferent purposes an effects. Most recent segmentation methos, e.g., the work in [3], [5], [6], [7], have applie the ense CRF metho [22] in the preiction refinement stage for refining (sharpen object bounaries) the coarse (low-resolution) preiction. The

5 5 FeatMap-Net: Image pyrami scale level 1 Conv Block 1-5 (share across scales) Conv Block 6 (for 1 scale only) Conv Block 6 (for 1 scale only) Multi-scale feature maps Sliing pyrami pooling Sliing pyrami pooling 3 3 up-sample concatenate Final feature map 9 scale level 2 Conv Block 6 (for 1 scale only) Sliing pyrami pooling 3 scale level 3 Fig. 6 The etails of our FeatMap-Net. An input image is first resize into 3 scales, then each resize image goes through 6 convolution blocks to output one feature map. Top 5 convolution blocks are share for all scales. Every scale has a specific convolution block (Conv Block 6). We perform 2-level sliing pyrami pooling an concatenate the poole feature map to the original feature map. The symbol enotes the feature imension. ense CRF metho is a Potts-moel-base fully-connecte CRF with pairwise potentials base on color contrast for local smoothness. It is important to clarify that, this smoothness CRFs an our contextual eep CRFs are working in ifferent preiction stages. Our contextual CNN pairwise potentials are applie in the coarse preiction stage to improve the lower-resolution preiction, rather than applying in the bounary refinement stage. In our framework, after obtaining the coarse level preiction, we still nee to perform a refinement step to obtain the final high-resolution preiction (as shown in Fig. 1). Hence we also apply the ense CRF metho [22], as in many other recent methos, in the preiction refinement step. Therefore, our metho takes avantage of both contextual CNN potentials an the traitional smoothness potentials to improve the final result. More etails for preiction can be foun in Sec Asymmetric pairwise potentials As in [29], [30], moeling asymmetric relations requires learning asymmetric potential functions, the output of which shoul epen on the input orer of a pair of noes. In other wors, the potential function is require to be capable of moeling ifferent input orers. Typically we have the following case for asymmetric relations: V (y p, y q, x pq ) V (y q, y p, x qp ). (5) Ieally, the potential V is learne from the training ata. Here we iscuss the asymmetric relation above/below as an example. We take avantage of the input pair orer to inicate the spatial configuration of two noes, thus the input (y p, y q, x pq ) inicates the configuration that the noe p is spatially lies above the noe q. Clearly, the potential function is require to moel ifferent input orers. The asymmetric property is reaily achieve with our general formulation of pairwise potentials. The ege features for the noe pair (p, q) are generate from a concatenation of the corresponing features of noes p an q (as in [28]), in that orer. The potential output for every possible pairwise label combination for (p, q) is iniviually Feature map Sliing pooling winow size: 5x5 Sliing pooling winow size: 9x9 Poole feature map Poole feature map 3 Concatenate feature map Fig. 7 Details for sliing pyrami pooling. We perform 2-level sliing pyrami pooling on the feature map for capturing patchbackgroun context, which encoe rich backgroun information an increase the fiel-of-view for the feature map. parameterize by the pairwise CNNs. These factors ensure that the ege response is orer epenent, easily satisfying the asymmetric requirement. 4 EXPLORING BACKGROUND CONTEXT WITH FEATMAP-NET We evelop multi-scale CNNs an sliing pyrami pooling in our FeatMap-Net to encoe rich backgroun information for capturing patch-backgroun context. Fig. 6 shows the architecture of FeatMap-Net. Details are presente shortly in the squeal. Applying CNNs on multi-scale images has shown improve performance in some recent segmentation methos, e.g., [1], [8]. In our multi-scale network, an input image is first resize into 3 scales, then each resize image goes through 6 convolution blocks to output one feature map. In our experiment, the 3 scales for the input image are set to 1.2, 0.8 an 0.4. All scales share the same top 5 convolution blocks. In aition, each scale has an exclusive convolution block ( Conv Block 6 in the figure) which captures scaleepenent information. The resulting 3 feature maps (corresponing to 3 scales) are of ifferent resolutions, therefore

6 6 we upscale the two smaller ones to the size of the largest feature map using bilinear interpolation. These feature maps are then concatenate to form one feature map. We perform spatial pyrami pooling [10] (a moifie version using sliing winows) on the feature map to capture information from backgroun regions in multiple sizes. From another point of view, this increases the fiel-of-view for the feature map, for which feature vectors are able to encoe information from a larger image region. Increasing the fiel-of-view generally helps to improve performance, which is also iscusse in [3]. The etails of spatial pyrami pooling are illustrate in Fig. 7. In our experiment, we perform 2-level pooling for each image scale. We efine 5 5 an 9 9 sliing pooling winows with max-pooling to generate 2 sets of poole feature maps. These poole feature maps are then concatenate to the original feature map to construct the final feature map, an thus the resulting feature imension is for one image scale. 5 NETWORK CONFIGURATIONS We show the etaile network layer configuration for all networks in Fig. 8. For FeatMap-Net, the configuration of the convolution blocks is similar to the VGG-16 moel [21]. The top 5 convolution blocks share the same configuration as the VGG-16 network. The first fully-connecte layer in VGG-16 is converte into a convolution layer ( see FCN in [2] for etails) an merge into the 5-th convolution block. We only transfer the first fully-connecte (FC) layer into our network rather than 2 FC layers. Note that transferring 2 FC layers is commonly applie in almost all recent FCN base methos [2], [3], [6]. The FC layer in the VGG-16 moel contains a large number of filters (4096), thus our network which transfers only one FC layer is more efficient. In FeatMap-Net, we a a new convolution block ( Conv Block 6 in the figure) which contains 2 convolution layers. This extra convolution block is not existe in the VGG-16 network. With this new convolution block, we are able to capture scale-epenent information an increase the abstraction level. We also have the consieration of increasing the fiel-of-view for the final feature map by aing this extra block. As iscusse in Sec. 1.1, The strie setting of the convolution an pooling layers will result in a feature map which has a smaller resolution than the input image. For the convolution an pooling layers, the resolution of the output feature map is own-sample if the strie is greater than 1. Note that there are a number of convolution/pooling layers in VGG-16 moel which use the strie setting of 2. Therefore, for the original VGG-16 moel, the resolution of the output feature map is 32 times smaller than the size of the input image (see FCN [2] for etails). To increase the resolution of the feature map, almost all recent VGG- 16 base methos [2], [3], [6] reuce the strie of the last two pooling layers to 1, which reuces the own-sampling factor from 32 to 8. In our setting, we reuce the strie of the last max pooling layer (only one layer) in the VGG-16 network to 1, instea of reucing for two pooling layers in many other methos [3], [6]. The resolution of the resulting feature map Conv block 1: 3 x 3 conv 64 3 x 3 conv 64 2 x 2 pooling Conv block 4: 2 x 2 pooling Unary-Net 2 fully-connecte layers: Fully-con 512 Fully-con K FeatMap-Net Conv block 2: 3 x 3 conv x 3 conv x 2 pooling Conv block 5: 2 x 2 pooling 7 x 7 conv 4096 Conv block 3: 3 x 3 conv x 3 conv x 3 conv x 2 pooling Conv block 6: Pairwise-Net 2 fully-connecte layers: Fully-con 512 Fully-con K 2 Fig. 8 The etaile configuration of the networks: FeatMap-Net, Unary-Net an Pairwise-Net. K is the number of classes. The filter size for convolution an the number of filters are shown for all layers. For FeatMap-Net, the top 5 convolution blocks share the same configuration as the convolution blocks in the VGG-16 network. The strie of the last max pooling layer is 1, an for the other max pooling layers we use the same strie setting as the VGG-16 network. is 16 times smaller than the size of the input image. For a input image, the resolution of the resulting feature map is aroun Directly changing the strie inevitably egraes the performance of the learne filters since the fiel-of-view of the input feature map for some filters is change. To preserve the fiel-of-view, recent work has propose a number of approaches. For example, a straightforwar approach is to increase the receptive fiel size of the filter (e.g., ouble the filter size). Large filter sizes will significantly increase the computation cost for convolution operations. This approach also brings the problem of how to upsample the filter weights. Probably a better approach is to apply the hole algorithm as in [3], which performs a skipping (sample) ot-prouct calculation for filter convolution. Therefore, a large convolution winow size can be applie without increasing the computation cost. Different from existing approaches, here we apply a simple yet effective approach. We a extra two 3 3 convolution layers ( Conv Block 6 ) instea of increasing the filter size. These extra layers are able to enlarge the fiel-of-view an compensate the sie-effect of reucing the strie in pooling layers. 6 PREDICTION At the preiction stage, our eep structure moel generates low-resolution preiction (as shown in Fig. 1), which is 1/16 of the input image size. As iscusse in Sec. 5, this is ue to the strie setting of pooling layers. Therefore, we apply two preiction stages for obtaining the final high-resolution preiction: the coarse-level preiction stage an the preiction refinement stage. We first perform CRF inference on our contextual structure moel to generate a score map for coarse-level preiction, then we bilinearly

7 1. Coarse-level preiction stage: Score maps in low-resolution eep contextual moel CRF inference 2. Preiction refinement stage: Final preiction Sharpen bounary Score maps bilinearly upsample Fig.

We first perform CRF inference on our contextual moel to generate a score map for coarse-level preiction, then we bilinearly unsmaple the score map an apply a bounary refinement metho [22] to obtain

7 7 1. Coarse-level preiction stage: Score maps in low-resolution eep contextual moel CRF inference 2. Preiction refinement stage: Final preiction Sharpen bounary Score maps bilinearly upsample Fig. 9 An illustration of our two-stage preiction process. The preiction process consists of two stages: the coarse-level preiction stage an the preiction refinement stage. We first perform CRF inference on our contextual moel to generate a score map for coarse-level preiction, then we bilinearly unsmaple the score map an apply a bounary refinement metho [22] to obtain the final preiction which has the same resolution as the input image. unsmaple the score map an apply a bounary refinement metho [22] to obtain the final preiction which has the same resolution as the input image. This two-stage preiction process is illustrate in Fig Coarse-level preiction stage We perform CRF inference on our contextual structure moel to obtain the coarse preiction of a test image. For example, we can solve the maximum a posteriori (MAP) problem: y = argmax y P (y x). Alternatively, we also can consier the marginal inference over noes for preiction: p N : P (y p x) = y\y p P (y x). (6) We obtain the marginal istribution for each noe after performing this marginal inference. This marginal istribution can be further applie in the next preiction stage for bounary refinement. Details are shown in the next section. Our CRF graph oes not form a tree structure, nor are the potentials submoular, hence we nee to an apply approximate inference. To aress this we apply an efficient message passing algorithm which is base on the mean fiel approximation [31]. The mean fiel algorithm constructs a simpler istribution Q(y), e.g., a prouct of inepenent marginals: Q(y) = p N Q p(y p ), which minimizes the KLivergence between the istribution Q(y) an P (y). In our experiments, we perform 3 mean fiel iterations. 6.2 Preiction refinement stage We generate the score map for the coarse preiction from the marginal istribution which we obtain from the mean-fiel inference. We first bilinearly up-sample the score map of the coarse preiction to the size of the input image. Then we apply a common post-processing metho [22] (ense CRF) to sharpen the object bounary for generating the final highresolution preiction. This post-processing metho leverages low-level pixel intensity information (color contrast) for bounary refinement. Note that most recent work on image segmentation prouce low-resolution preiction an have a upsampling an refinement process/moel for the final preiction, e.g., [3], [6], [7]. In summary, we simply perform bilinear upsampling of the coarse score map an apply the bounary refinement post-processing. We argue that this stage can be further improve by applying more sophisticate refinement methos, e.g., training econvolution networks [23] training multiple coarse to fine learning networks [20], an exploring mile layer features for high-resolution preiction [2], [25]. It is expecte that applying better refinement approaches will gain further performance improvement. In the experiment part, we show an example of exploring the feature maps from mile layers to refine the coarse preiction. We apply this improve refinement approach on the ataset PASCAL VOC Refer to Sec. 9.2 for etails. 7 CRF TRAINING A common approach to CRF learning is to maximize the likelihoo, or equivalently minimize the negative loglikelihoo, which can be written for one image as: log P (y x; θ) = E(y, x; θ) + log Z(x; θ). (7) Aing regularization to the CNN parameter θ, the optimization problem for CRF learning is: min θ λ N 2 θ 2 2 log P (y (i) x (i) ; θ). (8) i=1 Here x (i), y (i) enote the i-th training image an its segmentation mask; N is the number of training images; λ is the weight ecay parameter. Substituting (7) into (8) yiels: min θ λ N 2 θ i=1 [ ] E(y (i), x (i) ; θ) + log Z(x (i) ; θ). (9) We can apply stochastic graient (SGD) base methos to optimize the above problem for learning θ. The energy function E(y, x; θ) is constructe from CNNs, an its graient θ E(y, x; θ) easily compute by applying the chain rule as in conventional CNNs. However, the partition function Z brings ifficulties for optimization. Its graient is written: θ log Z(x; θ) = θ log exp [ E(y, x)] y = exp [ E(y, x; θ)] y y exp [ E(y, x; θ)] θ[ E(y, x; θ)] = E y P (y x;θ) θe(y, x; θ) (10) Generally the size of the output space Y is exponential in the number of noes, which prohibits the irect calculation of Z an its graient. The CRF graph we consiere for segmentation here is a loopy graph (not tree-structure), in which a large number of noes (more than 1000) an pairwise connections (more than ) are involve for one image. For loopy graph with large number of noes an eges, typically approximation is require for inference, an even this is generally computationally expensive.

8 8 More importantly, usually a large number of SGD iterations are require for training CNNs. Typically the number of iterations is in tens or hunres of thousans. Thus performing inference at each SGD iteration is very computationally expensive. 7.1 Piecewise training of CRFs Instea of irectly solving the optimization in (9), we propose to apply an approximate CRF learning metho. In the literature, there are two popular types of learning methos which approximate the CRF objective: pseuo-likelihoo learning [32] an piecewise learning [9]. The main avantage of these methos in term of training eep CRF is that they o not involve marginal inference for graient calculation, which significantly improves the efficiency of training. Decision tree fiels [33] an regression tree fiels [34] are base on pseuo-likelihoo learning, while piecewise learning has been applie in the work [9], [28]. Here we evelop this iea for the case of training the CRF with the CNN potentials. In piecewise training, the conitional likelihoo is formulate as a number of inepenent likelihoos efine on potentials, written as: P (y x) = P U (y p x) P V (y p, y q x). U U p N U V V (p,q) S V The likelihoo P U (y p x) is constructe from the unary potential U. Likewise, P V (y p, y q x) is constructe from the pairwise potential V. P U an P V are written as: P U (y p x) = exp [ U(y p, x p )] y p exp [ U(y p, x p )], (11) P V (y p, y q x) = exp [ V (y p, y q, x pq )] y p,y q exp [ V (y p, y q, x pq )]. (12) The log-likelihoo for piecewise training is then: log P (y x) = log P U (y p x) U U p N U + log P V (y p, y q x). (13) V V (p,q) S V The optimization problem for piecewise training is to minimize the negative log likelihoo with regularization: min θ λ N 2 θ V V i=1 [ U U (p,q) S (i) V p N (i) U log P U (y p x (i) ; θ U ) ] log P V (y p, y q x (i) ; θ V ). (14) Compare to the objective in (9) for irect maximum likelihoo learning, the above objective oes not involve the global partition function Z(x; θ). To calculate the graient of the above objective, we only nee to calculate the graient θu log P U an θv log P V. With the efinition in (11), P U is a conventional softmax normalization function over only K (the number of classes) elements. Similar analysis can also be applie to P V. Hence, we can easily calculate the graient without involving expensive inference. Moreover, we are able to perform parallele training of potential functions, since the above objective is formulate by a summation of inepenent log-likelihoos. As previously iscusse, CNN training usually involves a large number of graient upate iteration which prohibit the repeate expensive inference. Our piecewise approach here provies a practical solution for learning CRFs with CNN potentials on large-scale ata. 8 IMPLEMENTATION DETAILS For the FeatMap-Net, the first 5 convolution blocks an the first convolution layer in the 6th convolution block are initialize from the VGG-16 network [21]. All remaining layers are ranomly initialize. Note that VGG-16 network is wiely applie in recent segmentation methos. All layers are traine using back-propagation/stochastic graient escen (SGD). We apply simple ata augmentation in the training stage. Specifically, we perform ranom scaling (from 0.7 to 1.2) an flipping of the images for training. As illustrate in Fig. 3, we use 2 types of pairwise potential functions. In total, we have 1 type of unary potential function an 2 types of pairwise potential functions. We formulate one specific FeatMap-Net an potential network (Unary-Net or Pairwise-Net) for one type of potential function. In other wors, one type of potential function is constructe by one FeatMap-Net an a shallow potential network. More etails of FeatMap-Net an Unary/Pairwise- Net can be foun in Fig. 4. There are two main benefits of moeling specific FeatMap-Net for each potential instea of sharing one FeatMap-Net across potentials. Different types of potentials have ifferent focus an thus probably require separate feature maps. Using separate FeatMap-Net allows generating specific high-level features for the corresponing potential function. Moreover, with separate FeatMap-Net, we are able to parallel the training of ifferent types of potentials, an thus ease the implementation an spee up the training. 8.1 Efficient learning As previously iscusse, one noe in the CRF graph is connecte to all noes which lie within a preefine range box. Uner this setting, the number of pairwise connections for one noe can be a few hunre. For example, for an input image with a resolution of pixels an 1.2x scaling, the resolution of the feature map after going through FeatMap-Net is aroun In this case, the number of noes are aroun 1200, an the number of connections is aroun 200 for one noe. Hence for this image, we nee to process pairwise relations for generating the ege features, passing forwar the Pairwise- Net an back-propagating the graients (in the training stage). These operations are consierably computationally expensive for such a large number of pairwise connections. If using high feature imension for the feature map, these operations can even run out of the GPU memory. In our solution, to spee up the training an testing of the pairwise potentials, we perform sampling of pairwise connections for each noe in the CRF graph. Since the feature map encoes reunant information in local regions, performing sampling can still preserve sufficient pairwise

magnitue fewer connections than the original setting.

2 Asynchronous graient upate The number of pairwise connections is still large even with sampling, which brings the problem of keeping the ege features in the GPU memory.

This is similar to the case that using an extremely large batch size for graient calculation in the training of a conventional classification network.

Moreover, as iscusse in [36], using small batch size may perform noise injection in the graient calculation as is a form of regularization, which may lea to better parameter solutions.

Therefore, we probably shoul not consier all pairwise connections in one graient iteration for upating the Pairwise-Net.

With asynchronous graient upate, the graient calculations for ifferent parts of the network are not require in the same iteration, which breaks the epenency between ifferent parts (or layers) of the

Specifically, in one stochastic graient iteration, we perform multiple sub-iterations of graient upate for the Pairwise-Net an collect the graients for the FeatMap-Net.

Clearly, in each subiteration we only process a small number of connections for upating the network parameters of Pairwise-Net, thus GPU consumption is low an the batch size for learning Pairwise-Net

After going through all pairwise connections, we collect the graients for FeatMap-Net, an perform a conventional back-propagation graient upate to FeatMap-Net.

types of scene images, (a) Testing (b) Groun Truth (c) Preiction Fig. 10 Preiction examples of our metho on NYUDv2 ataset. TABLE 1 Segmentation results on NYUDv2 ataset (40 classes).

[38] FCN-32s [2] FCN-HHA [2] ours-unary ours ours+ training ata RGB-D RGB RGB-D RGB RGB RGB pixel accuracy 60.3 60.0 65.4 65.4 67.6 69.9 mean accuracy 42.2 46.1 47.2 49.6 51.6 IoU 28.6 29.2 34.0 34.

9 9 relations while removing reunancies. Specifically, we sample 24 neighboring noes base on a regular 5 5 gri spanning the range box (excluing self-connection), an thus we have 24 pairwise connections for each noe, which is an orer of magnitue fewer connections than the original setting. We observe that this sampling setting, which reuces the number of pairwise connections significantly, spees up the training without egraing the performance. 8.2 Asynchronous graient upate The number of pairwise connections is still large even with sampling, which brings the problem of keeping the ege features in the GPU memory. Moreover, consiering a large number of pairwise connections (more than ) in one iteration for upating the parameters of Pairwise-Net might result in egrae graients. This is similar to the case that using an extremely large batch size for graient calculation in the training of a conventional classification network. An extremely large batch size for graient upate can significantly slow own the convergence an may ecrease the performance [35]. Moreover, as iscusse in [36], using small batch size may perform noise injection in the graient calculation as is a form of regularization, which may lea to better parameter solutions. Overall, from both empirical observations an theoretical analysis, a appropriate setting of batch size is key to the network training. Therefore, we probably shoul not consier all pairwise connections in one graient iteration for upating the Pairwise-Net. To reuce the GPU memory consumption an improve the batch upate for the Pairwise-Net, we perform asynchronous graient upate for training the FeatMap-Net an Pairwise-Net. With asynchronous graient upate, the graient calculations for ifferent parts of the network are not require in the same iteration, which breaks the epenency between ifferent parts (or layers) of the networks. Asynchronous graient upate is wiely applie in largescale istribute network learning. For etails one may refer to [37]. Specifically, in one stochastic graient iteration, we perform multiple sub-iterations of graient upate for the Pairwise-Net an collect the graients for the FeatMap-Net. In each sub-iteration, a subset of pairwise connections is selecte (e.g., 2000) for graient calculation an the parameters of Pairwise-Net are upate. Clearly, in each subiteration we only process a small number of connections for upating the network parameters of Pairwise-Net, thus GPU consumption is low an the batch size for learning Pairwise-Net is reuce. This asynchronous approach aresses the problems of GPU memory an large batch size for training Pairwise-Net. After going through all pairwise connections, we collect the graients for FeatMap-Net, an perform a conventional back-propagation graient upate to FeatMap-Net. 9 E XPERIMENTS We evaluate our metho on 8 challenging semantic segmentation atasets: PASCAL VOC 2012, NYUDv2, PASCAL-Context, SIFT-flow, SUN-RGBD, KITTY, COCO an Cityscapes, which covers various types of scene images, (a) Testing (b) Groun Truth (c) Preiction Fig. 10 Preiction examples of our metho on NYUDv2 ataset. TABLE 1 Segmentation results on NYUDv2 ataset (40 classes). We compare to a number of recent methos. Our metho significantly outperforms the existing methos. metho Gupta et al. [38] FCN-32s [2] FCN-HHA [2] ours-unary ours ours+ training ata RGB-D RGB RGB-D RGB RGB RGB pixel accuracy mean accuracy IoU incluing inoor/outoor scene, street scene, etc. Our comprehensive experiments show that the propose metho achieves new state-or-the-art performance on these atasets. Our system is built on MatConvNet [40]. The segmentation performance is measure by the intersection-over-union (IoU) score [41], the pixel accuracy an the mean accuracy on categories [2]. We enote cij as an element in the confusion matrix, which is the number of pixels with the i-th category as the groun truth an the j th category as the preiction; ti is the total number of pixels for the i-th category in the groun truth; K is the number of categories. Pixel accuracy measures the portion of correctly P i cii P. Mean accuracy measures the perpreicte pixels: i ti 1 P cii category pixel accuracy: K i ti. IoU score calculates the portion of the intersection between the groun truth an the 1 P P cii preiction: K. i ti + cij cii j 9.1 Results on the NYUDv2 ataset We first evaluate our metho on the NYUDv2 [42] ataset which has 1449 RGB-D inoor scene images. We use the segmentation labels provie in [43] for which the labels are processe into 40 classes. We use the stanar training set which contains 795 images an the test set which contains 654 images. We train our moels only on RGB images without using the epth information.

10 (a) Testing (b) Groun Truth (c) Preiction ()

The table shows the value ae by the ifferent

sliing pyrami pooling + multi-scales + bounary

Some preiction examples are shown in Fig.

competing methos even though it makes no use of

pyrami pooling in our metho which captures rich

Applying our full contextual moel with CNN

further improvement over the unary-only moel,

which clearly verifies its effectiveness.

we pretrain our moel on the Microsoft COCO

10 10 (a) Testing (b) Groun Truth (c) Preiction () Testing (e) Groun Truth (f) Preiction (g) Testing (h) Groun Truth (i) Preict Fig. 11 Some preiction examples of our metho on PASCAL VOC 2012 ataset. TABLE 2 Ablation Experiments. The table shows the value ae by the ifferent system components of our metho on the NYUDv2 ataset (40 classes). metho FCN-32s [2] FullyConvNet Baseline + sliing pyrami pooling + multi-scales + bounary refinement + CNN contextual pairwise pixel accuracy mean accuracy IoU Results are shown in Table 1. Some preiction examples are shown in Fig. 10 Unless otherwise specifie, our moels are initialize using the VGG-16 network. VGG-16 is also use in most competing methos such as FCN [2]. Our unary-only moel alreay outperforms the competing methos even though it makes no use of the extra epth information. This verifies the effectiveness of the multiscale network architecture an sliing pyrami pooling in our metho which captures rich backgroun contextual information. Applying our full contextual moel with CNN pairwise potentials, we attain significant further improvement over the unary-only moel, which sets new state-ofthe-art performance on the NYUDv2 ataset. Our contextual CNN pairwise potentials bring significant improvement (2.5 percentage point) over the unary-only moel, which clearly verifies its effectiveness. To further explore the potential of our metho, we pretrain our moel on the Microsoft COCO ataset [44] with labels in 80 classes, an we initialize from this moel an train on the NYUD images. This result is enote as ours+ in the result table. Our moel gets further improve with the COCO image pre-training. On this ataset, we have improve the segmentation accuracy by a large margin even without using the epth information Component evaluation We evaluate the performance contribution of ifferent components of the FeatMap-Net for capture patch-backgroun context on the NYUDv2 ataset. We present the results of aing ifferent components in FeatMap-Net, which are shown in Table 2. We start from a baseline setting of our FeatMap-Net ( FullyConvNet Baseline in the result table), for which multi-scale an sliing pooling is remove. This baseline setting is the conventional fully convolution network for segmentation, which can be consiere as our implementation of the FCN metho in [2]. The result shows that our CNN baseline implementation ( FullyConvNet ) achieves very similar performance (slightly better) than the FCN metho. Clearly, our multi-scale network esign an sliing pyrami pooling significantly improve the performance, which shows the effectiveness of the propose approach for encoing rich backgroun context. Applying the ense CRF metho [22] for bounary refinement gains further improvement. Finally, aing our contextual CNN pairwise potentials, we achieve very significant further improvement. 9.2 Results on the PASCAL VOC 2012 ataset PASCAL VOC 2012 [41] is a well-known segmentation evaluation ataset which consists of 20 object categories an one backgroun category. This ataset is split into a training set, a valiation set an a test set, which respectively contain 1464, 1449 an 1456 images. Following a conventional setting in [3], [15], the training set is augmente by extra annotate VOC images provie in [45], which results in training images. We verify our performance on the PASCAL VOC 2012 test set. We compare with a number of recent methos with competitive performance. Since the groun truth labels are not available for the test set, we evaluate our metho through the VOC evaluation server.

11 TABLE 3 Iniviual category results on the PASCAL VOC 2012 test set (IoU scores). Our metho performs the best 11 aero bike bir boat bottle bus car cat chair cow metho mean Only using VOC training ata FCN-8s [2] Zoom-out [8] DeepLab [3] CRF-RNN [6] DeconvNet [23] DPN [39] ours Using VOC+COCO training ata DeepLab [3] CRF-RNN [6] BoxSup [7] DPN [39] ours table og horse mbike person potte sheep sofa train tv The IoU scores are shown in the last column of Table 3. Preiction examples of our metho are shown in Fig. 11. We first train our moel only using the VOC images. We achieve an IoU score of 75.3, which is the best result amongst methos that only use the VOC training ata. 1 To improve the performance, following the setting in recent work [3], [7], we train our moel with the extra images from the COCO ataset [44]. With these extra training images, we achieve an IoU score of As escribe in Sec. 6, our eep structure moel generates low-resolution coarse preiction, which is 1/16 of the input image size. To obtain the final high-resolution preiction we apply a simple yet effective approach: we first perform bilinear upsampling of the coarse score map an then apply the bounary refinement post-processing [22]. To improving this simple approach, we exploit the feature maps from mile layers to refine the coarse preiction an prouce high-resolution preiction, which is similar to the methos in [2], [3], [25]. With this improve refinement approach, we finally achieve an IoU score of 77.8, which is best reporte result on this challenging ataset. 2 The feature maps from the mile layers encoe lower level visual information (from ege patterns to texture/object part patterns) an have higher resolution than the final output, thus it is expecte that learning extra layers on these feature maps helps to preict etails in the object bounaries. Specifically, we a refinement layers on top of the feature maps from the first 5 max-pooling layers an the score map of the coarse preiction (output by our eep structure moel). Details are shown in Fig. 12. These refinement layers play a role of refining the coarse preiction by exploring mile layer features, which increase the resolution of the preiction from 1/16 (coarse preiction) to 1/2 of the input image. With this improve preiction, we perform bounary refinement using [22] to generate the final preiction. The results for each category are shown in Table 3. We outperform comparing methos in most categories. For only using the VOC training set, our metho outperforms the secon best metho, DPN [39], on 18 categories out of 20. For using VOC+COCO training set, our metho outperforms DPN [39] on 15 categories out of The result link at the VOC evaluation server: ox.ac.uk:8080/anonymous/keffm4.html 2. The result link at the VOC evaluation server: ox.ac.uk:8080/anonymous/mvtntx.html Input Image 1/1 size Input score map (coarse preiction) 1/16 size Feature maps from top 5 max-pooling layers 1/2 size 1/4 size Apply 2 Conv layers for each feature map upsample an concatenate 1/2 size 3 Conv layers Refine score map 1/2 size Fig. 12 The illustration of exploiting the feature maps from mile layers to refine the low-resolution (1/16 of the input image) coarse preiction. The refine preiction has a resolution of 1/2 of the input image. TABLE 4 Segmentation results on the val set of Cityscapes ataset (19 classes). Our metho significantly outperforms the fully convolution network baseline ( FullyConvNet ). metho pixel accuracy mean accuracy IoU FullyConvNet FullyConvNet + refine ours Results on the Cityscapes ataset The large scale outoor image ataset Cityscapes 3 [46] contains street scene images from 50 ifferent cities. This ataset provies pixel-level semantic segmentation labels of 5000 images for 25 classes incluing roa, car, peestrian, bicycle, sky etc. The provie trainval set has 3475 image. We use the training set (2975 images) for training. Since the groun truth of the test set is not available, we evaluate on the val set which contains 500 images. Following the ataset evaluation protocol, 19 classes are vali for evaluation, the rest 6 classes are treate as voi which are not consiere in evaluation. Results are shown in Table 4. Preiction examples are shown in Fig. 13. Since this is a newly release ataset, there is no publishe results for comparison. Therefore, we present two baseline results for comparison. The first baseline ( FullyConvNet in the table) is our implementation of the conventional fully convolution network for segmentation. More etails of this baseline metho can be foun in Sec The secon baseline comparison metho ( FullyConvNet + refine in the table) applies the ense CRF [22] on top of FullyConvNet 3. Avaiable at:

12 12 (a) Testing (b) Groun Truth () Testing (c) Preiction (e) Groun Truth (f) Preiction Fig. 13 Preiction examples of our metho on Cityscapes ataset. TABLE 5 Segmentation results on PASCAL-Context ataset (60 classes). Our metho performs the best. metho O2P [47] CFM [48] FCN-8s [2] BoxSup [7] ours pixel accuracy mean accuracy IoU for bounary refinement. Clearly, our metho significantly outperforms these baseline methos. 9.4 metho Liu et al. [51] Ren et al. [52] Kenall et al. [53] ours training ata RGB-D RGB-D RGB RGB pixel accuracy mean accuracy IoU Results are shown in Table 6. Our metho outperform the existing methos by a large margin, even though we oes not make use of the epth information for training. Results on the PASCAL-Context ataset The PASCAL-Context [49] ataset provies the segmentation labels of the whole scene (incluing the stuff labels) for the PASCAL VOC images. We use the segmentation labels which contain 60 classes (59 classes plus the backgroun class ) for evaluation. We use the provie training/test splits. The training set contains 4998 images an the test set contains 5105 images. Results are shown in Table 5. Preiction examples are shown in Fig. 14 Our metho significantly outperform the competing methos. To our knowlege, ours is the best reporte result on this ataset. 9.5 TABLE 6 Segmentation results on SUN-RGBD ataset (37 classes). We compare to a number of recent methos. Our metho significantly outperforms the existing methos. Results on the SUN-RGBD ataset SUN-RGBD [50] is a segmentation ataset contains aroun 10, 000 inoor images an provies pixel labeling masks of 37 classes, which is an extension of the NYUD ataset [42]. 9.6 Results on the COCO ataset The COCO ataset [44] contains more than 1 million images an provie segmentation labels for 80 classes. Since the test set is not available, we generate 2599 images for testing an the remaining images are for training. We select the test images on a class balance basis which ensures every category at lease appears in 50 images. Labeling regions which are smaller than 200 pixels are treate as voi which are not consiere in training an evaluation. Results are shown in Table 7. We compare to two baseline methos which are base on conventional fully convolution networks. The etails of these baseline methos are the same as that for the Cityscapes ataset (see Sec. 9.3). The results shows that our metho significantly outperforms the baselines.

TABLE 8 Segmentation results on SIFT-flow ataset (33 classes). Our metho performs the best. metho pixel accuracy mean accuracy IoU Liu et al. [51] 76.7 - - Tighe et al. [54] 75.

5 ours 88.1 53.4 44.9 TABLE 9 Segmentation results on KITTI ataset (10 classes). We compare to a number of recent methos. Our metho significantly outperforms the existing methos.

3 13 (a) Testing (b) Groun Truth (c) Preiction Fig. 14 Preiction examples on PASCAL-Context ataset. TABLE 7 Segmentation results on COCO ataset (80 classes).

0 41.3 ours 88.3 58.7 46.8 9.7 Results on the SIFT-flow ataset We further evaluate our metho on the SIFT-flow ataset.

The training set has 2488 images an the test set has 200 images. Since the images are in small sizes, we upscale the image by a factor of 2 for training.

8 Results on the KITTI ataset We perform further evaluation on the KITTI ataset [56] for roa image segmentation. Zhang at el.

We follow the provie training an testing splits for evaluation an report the results in Table 9. Clearly, our metho performs the best.

10 CONCLUSIONS We have propose a metho which combines CNNs an CRFs to exploit complex contextual information for semantic image segmentation.

We have performe comprehensive experiments on 8 challenging segmentation atasets an we achieve best performance on all evaluate ataset incluing the PASCAL VOC 2012 ataset.

ACKNOWLEDGMENTS This research was supporte by the Data to Decisions Cooperative Research Centre an by the Australian Research Council through the ARC Centre for Robotic Vision

Rei s participation was in part supporte by an ARC Laureate Fellowship (FL130100102). C. Shen is the corresponing author. REFERENCES [1] C. Farabet, C. Couprie, L. Najman, an Y.

Darrell, Fully convolutional networks for semantic segmentation, in Proc. IEEE Conf. Comp. Vis. Pattern Recogn., 2015. [3] L.

13 TABLE 8 Segmentation results on SIFT-flow ataset (33 classes). Our metho performs the best. metho pixel accuracy mean accuracy IoU Liu et al. [51] Tighe et al. [54] Tighe et al. (MRF) [54] Farabet et al. (balance) [1] Farabet et al. [1] Pinheiro et al. [55] FCN-16s [2] ours TABLE 9 Segmentation results on KITTI ataset (10 classes). We compare to a number of recent methos. Our metho significantly outperforms the existing methos. metho pixel accuracy mean accuracy IoU Caena et al. [58] Zhang et al. [57] ours ours (a) Testing (b) Groun Truth (c) Preiction Fig. 14 Preiction examples on PASCAL-Context ataset. TABLE 7 Segmentation results on COCO ataset (80 classes). Our metho significantly outperforms the fully convolution network ( FullyConvNet ). metho pixel accuracy mean accuracy IoU FullyConvNet FullyConvNet + refine ours Results on the SIFT-flow ataset We further evaluate our metho on the SIFT-flow ataset. This ataset contains 2688 images an provies the segmentation labels for 33 classes. We use the stanar split for training an evaluation. The training set has 2488 images an the test set has 200 images. Since the images are in small sizes, we upscale the image by a factor of 2 for training. Results are shown in Table 8. We achieve the best performance on this ataset. 9.8 Results on the KITTI ataset We perform further evaluation on the KITTI ataset [56] for roa image segmentation. Zhang at el. [57] provie semantic segmentation labels of 10 classes for 252 images, in which 140 images are for training an the remaining 112 are for testing. We follow the provie training an testing splits for evaluation an report the results in Table 9. Clearly, our metho performs the best. To further improve the performance, we perform pre-training on COCO images, for which the result is enote by ours+ in the result table. 10 CONCLUSIONS We have propose a metho which combines CNNs an CRFs to exploit complex contextual information for semantic image segmentation. Basically, we formulate CNN base pairwise potentials for moeling semantic relations between image regions. We have performe comprehensive experiments on 8 challenging segmentation atasets an we achieve best performance on all evaluate ataset incluing the PASCAL VOC 2012 ataset. The propose metho is potentially wiely applicable to other tasks. ACKNOWLEDGMENTS This research was supporte by the Data to Decisions Cooperative Research Centre an by the Australian Research Council through the ARC Centre for Robotic Vision (CE ). C. Shen s participation was in part supporte by an ARC Future Fellowship (FT ). I. Rei s participation was in part supporte by an ARC Laureate Fellowship (FL ). C. Shen is the corresponing author. REFERENCES [1] C. Farabet, C. Couprie, L. Najman, an Y. LeCun, Learning hierarchical features for scene labeling, IEEE T. Pattern Analysis & Machine Intelligence, [2] J. Long, E. Shelhamer, an T. Darrell, Fully convolutional networks for semantic segmentation, in Proc. IEEE Conf. Comp. Vis. Pattern Recogn., [3] L. Chen, G. Papanreou, I. Kokkinos, K. Murphy, an A. L. Yuille, Semantic image segmentation with eep convolutional nets an fully connecte CRFs, in Proc. Int. Conf. Learning Representations, [4] G. Heitz an D. Koller, Learning spatial context: Using stuff to fin things, in Proc. European Conf. Computer Vision, [5] A. G. Schwing an R. Urtasun, Fully connecte eep structure networks, 2015, [6] S. Zheng, S. Jayasumana, B. Romera-Parees, V. Vineet, Z. Su, D. Du, C. Huang, an P. Torr, Conitional ranom fiels as recurrent neural networks, in Proc. Int. Conf. Comp. Vis., [7] J. Dai, K. He, an J. Sun, BoxSup: Exploiting bouning boxes to supervise convolutional networks for semantic segmentation, in Proc. Int. Conf. Comp. Vis., [8] M. Mostajabi, P. Yaollahpour, an G. Shakhnarovich, Feeforwar semantic segmentation with zoom-out features, in Proc. IEEE Conf. Comp. Vis. Pattern Recogn., [9] C. A. Sutton an A. McCallum, Piecewise training for unirecte moels, in Proc. Conf. Uncertainty Artificial Intelli, 2005.

Exploring Context with Deep Structured models for Semantic Segmentation

APPEARING IN IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, APRIL 2017. 1 Exploring Context with Deep Structure moels for Semantic Segmentation Guosheng Lin, Chunhua Shen, Anton van en