EE-559 Deep learning Networks for semantic segmentation

Size: px

Start display at page:

Download "EE-559 Deep learning Networks for semantic segmentation"

Madlyn Boyd
5 years ago
Views:

1 EE-559 Deep learning 7.4. Networks for semantic segmentation François Fleuret Mon Feb 8 3:35:5 UTC 209 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The historical approach to image segmentation was to define a measure of similarity between pixels, and to cluster groups of similar pixels. Such approaches account poorly for semantic content. The deep-learning approach re-casts semantic segmentation as pixel classification, and re-uses networks trained for image classification by making m fully convolutional. François Fleuret EE-559 Deep learning / 7.4. Networks for semantic segmentation / 8

2 Shelhamer et al. (206) use a pre-trained classification network (e.g. VGG 6 layers) from which final fully connected layer is removed, and or ones are converted to convolutional filters. They add a final convolutional layers with 2 output channels (VOC 20 classes + background ). Since VGG6 has 5 max-pooling with 2 2 kernels, with proper padding, output is /2 5 = /32 size of input. This map is n up-scaled with a de-convolution layer with kernel and stride to get a final map of same size as input image. Training is achieved with full images and pixel-wise cross-entropy, starting with a pre-trained VGG6. All layers are fine-tuned, although fixing up-scaling de-convolution to bilinear does as well. François Fleuret EE-559 Deep learning / 7.4. Networks for semantic segmentation 2 / 8 3d 2, 64d 4, 28d 8, 256d 6, 52d 32, 52d 2 /relu 32, 4096d 32 32, 2d 2d François Fleuret EE-559 Deep learning / 7.4. Networks for semantic segmentation 3 / 8

3 Although this Fully Connected Network (FCN) achieved almost state-of--art results when published, its main weakness is coarseness of signal from which final output is produced (/32 of original resolution). Shelhamer et al. proposed an additional element, that consists of using same prediction/up-scaling from intermediate layers of VGG network. François Fleuret EE-559 Deep learning / 7.4. Networks for semantic segmentation 4 / 8 3d 2, 64d 4, 28d 8, 256d 6, 52d 32, 52d 2 /relu 32, 4096d 32, 2d 6, 2d 2 6, 2d 8, 2d + 2 6, 2d 8, 2d + 8 8, 2d 2d François Fleuret EE-559 Deep learning / 7.4. Networks for semantic segmentation 5 / 8

FCN-8s SDS [4] Ground Truth Image TABLE 8 The role of foreground, background, and shape cues.

differs. train Reference Reference-FG Reference-BG FG-only BG-only Shape Left column is Franc ois Fleuret Fig. 6.

The second shows output of previous best methodet by Hariharan et al. best network from Shelhamer al. (206). [4].

Masking experiments investigate role of context and shape by reducing input to only foreground, only background, or shape

Output Inputbackground model checks necessity of learning a background classifier for semantic segmentation.

Finally, we measure bounds on task accuracy for given output resolutions to show re is still much to improve. 6.

background pixels in prediction. Is foreground appearance sufficient for inference, or does context influence output?

Masking To explore se issues we experiment with ed versions of standard PASCAL VOC segmentation challenge.

Results with a Franc ois Fleuret test FG BG FG BG Masking foreground at inference time However, ing foreground during l a

confused. All-in-al that FCNs do incorporate context even thoug driven by foreground pixels.

The accuracy in this shape-only cond than when only foreground is ed, s net is capable of learning context to boo Noneless, it

See F Background modeling It is standard in semantic segmentation to have6 a/ 8backgroun model usually takes same form as m

in ground. Is this actually necessary, or do class m To investigate, we define a net with a nul model that a constant of zero.

In gives practice, we findscore that online lear ing with softmax loss, which induces c and yields better FCN models in less

assigned highes 6.3 Upper bounds on IU In all or respects experiment is identic 32s onachieve PASCAL VOC.

32s trained on all classes including backgro tion. To better understand this metric an sigmoid cross-entropy loss.

with predic reduces total number of parameters by l resolutions.

4 FCN-8s SDS [4] Ground Truth Image TABLE 8 The role of foreground, background, and shape cues. mean intersection over union metric excluding bac architecture and optimization are fixed to those of FC and only input ing differs. train Reference Reference-FG Reference-BG FG-only BG-only Shape Left column is Franc ois Fleuret Fig. 6. Fully convolutional networks improve performance on PASCAL. The left column shows output of our most accurate net, FCN-8s. The second shows output of previous best methodet by Hariharan et al. best network from Shelhamer al. (206). [4]. Notice fine structures recovered (first row), ability to separate closely interacting objects (second row), and robustness to occluders (third row). The fifth and sixth rows show failure cases: net sees lifejacketsee-559 in a boat as people and confuses human hair with a dog. Deep learning / 7.4. Networks for semantic segmentation 6 A NALYSIS We examine learning and inference of fully convolutional networks. Masking experiments investigate role of context and shape by reducing input to only foreground, only background, or shape Defining a null Image Ground Truth alone. Output Inputbackground model checks necessity of learning a background classifier for semantic segmentation. We detail an approximation between momentum and batch size to furr tune whole image learning. Finally, we measure bounds on task accuracy for given output resolutions to show re is still much to improve. 6. Cues Given large receptive field size of an FCN, it is natural to wonder about relative importance of foreground and background pixels in prediction. Is foreground appearance sufficient for inference, or does context influence output? Conversely, can a network learn to recognize a class by its shape and context alone? Masking To explore se issues we experiment with ed versions of standard PASCAL VOC segmentation challenge. We both input to networks trained on normal PASCAL, and learn new networks on ed PASCAL. See Table 8 for ed results. Results with a Franc ois Fleuret test FG BG FG BG Masking foreground at inference time However, ing foreground during l a network capable of recognizing object seg observing a single pixel of labeled class background has little effect overall but doe confusion in certain cases. When backgro during both learning and inference, netw ingly achieves nearly perfect background accu certain classes are more confused. All-in-al that FCNs do incorporate context even thoug driven by foreground pixels. To separate contribution of shape, w restricted to simple input of foregroun s. The accuracy in this shape-only cond than when only foreground is ed, s net is capable of learning context to boo Noneless, it is surprisingly accurate. See F Background modeling It is standard in semantic segmentation to have6 a/ 8backgroun model usually takes same form as m classes of interest, but is supervised by nega In our experiments we have followed sa learning parameters to score all classes in ground. Is this actually necessary, or do class m To investigate, we define a net with a nul model that a constant of zero. In learning. In gives practice, we findscore that online lear ing with softmax loss, which induces c and yields better FCN models in less wall cl normalizing across classes, we train with entropy loss, which independently normaliz For inference each pixel is assigned highes 6.3 Upper bounds on IU In all or respects experiment is identic 32s onachieve PASCAL VOC. The null backgroun FCNs good performance on m point lower with reference FCN-32s ands tation metricthan even spatially coarse 32s trained on all classes including backgro tion. To better understand this metric an sigmoid cross-entropy loss. To thiscompu drop this approach with respect to put it, we note that discarding background mode upper bounds on performance with predic reduces total number of parameters by l resolutions. We do this by downsamplin Noneless, this result suggestsback that learnin images and n upsampling to sim background model with for semantic segmentation results obtainable a particular downs The following table gives mean IU o 6.2 Momentum batch size PASCAL 20 valand for various downsamplin In comparing optimization schemes for FCN factor meanmomentu IU heavy online learning with high accurate models in less wall clock time (se Here we detail a relationship between momen size that motivates heavy learning Pixel-perfect prediction is clearly no achieve mean IU well above state-of- Fig. 7. FCNs learn to recognize by shape when deprived of or input versely, mean IU is a not a good measure of detail. From left to right: regular image (not seen by network), ground racy. The gaps between oracle and state-oftruth, output, trained input. from only (Shelhamer et al., 206). network at every stride suggest that recognition and bottleneck for this metric. By writing updates computed by gradient accumuee-559 Deep learning / 7.4. Networks for semantic segmentation lation as a non-recursive sum, we will see that momentum 7/8

5 It is noteworthy that for detection and semantic segmentation, re is an heavy re-use of large networks trained for classification. The models mselves, as much as source code of algorithm that produced m, or training data, are generic and re-usable assets. François Fleuret EE-559 Deep learning / 7.4. Networks for semantic segmentation 8 / 8 References E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. CoRR, abs/ , 206.

Fully Convolutional Networks for Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation Jonathan Long* Evan Shelhamer* Trevor Darrell UC Berkeley Chaim Ginzburg for Deep Learning seminar 1 Semantic Segmentation Define a pixel-wise labeling