Paper Motivation. Fixed geometric structures of CNN models. CNNs are inherently limited to model geometric transformations

Size: px

Start display at page:

Download "Paper Motivation. Fixed geometric structures of CNN models. CNNs are inherently limited to model geometric transformations"

Evelyn Burke
5 years ago
Views:

2 Paper Motivation Fixed geometric structures of CNN models CNNs are inherently limited to model geometric transformations Higher-level features combine lower-level features at fixed positions as a weighted sum Pooling chooses the dominating features / averages features at fixed positions 2

3 Invariance to Geometric Transformations Learned from data augmentation Using transformationinvariant features and algorithms Unknown or complex geometric transformations not learned or modeled 3

4 Standard Convolution and RoI Pooling Convolution samples feature map at fixed locations RoI pooling reduces the spatial resolution at a fixed ratio The higher the layer, the less desired behaviour 4

5 Deformable Convolution Adds 2D offset to the regular grid sampling locations Free form deformation of the sampling grid 5

6 Deformable Convolution Offsets are learned from the preceding feature maps via additional convolutional layers 6

7 Deformable RoI Pooling Adds 2D offset to each bin position in the regular bin partition Adaptive part localization for objects with different shapes 7

8 Deformable RoI Pooling Offsets are learned from the preceding feature maps via additional RoI and a fully connected layer 8

9 Deformable Position-Sensitive RoI Pooling Differs by having a different set of feature maps for each bin position 9

10 Deformable Convolution and RoI Pooling Summary Inference: offsets depend on the input features Learning: offsets are learned from data Filters are differentiable 10

11 Method Details Offsets are fractional bilinear interpolation For (PS) RoI pooling, normalized offsets must be used The number of additional parameters Convolution and RoI pooling: PS RoI pooling: Learning rate for offsets can be different 11

12 PS RoI Offsets Examples One 3x3 deformable PS RoI pooling layer Input: a bounding box with a label 12

13 PS RoI Offsets Examples 13

14 Conv Offsets Examples Three consecutive 3x3 deformable convolutional layers = 9^3 points 14

15 Conv Example Man and a Goat Blue dots standard convolution sample locations Red dots deformable convolution sample locations For 1, 2 and 3 consequent layers 15

16 Conv Example Man and a Goat Center of convolution on a man, sky and grass For 3 consequent layers 16

17 Conv Example Man and a Goat The magnitude of offsets For 3 consequent layers res5a, res5b and res5c 17

18 Conv Example Man and a Goat The anisotropic scale HSV visualization Red horizontal, Green vertical For 3 consequent layers 18

19 Conv Example Man and a Goat Offsets HSV visual. For 3 layers 19

20 Conv Example Cars The magnitude of offsets For 3 consequent layers The foregroundbackground separation can be seen 20

21 Affine Transformation Approximation The unknown and complex transformation was approximated by an affine transformation Format is MEAN (STD), the first is vertical axis Unit is pixels in the feature map Man and a Goat Cars Mean squared error 3.1 (1.5) 2.7 (1.4) Scale 3.4, 3.7 (0.8, 1.1) 2.9, 3.6 (1.0, 1.1) Translation 0.8, 0.0 (1.3, 0.2) 0.3, 0.0 (1.2, 0.1) Rotation -0.1 (0.0) -0.1 (0.0) Shear 0.0 (0.0) 0.0 (0.0) Other tested images had similar results 21

22 Statistics of Learned Scale Effective Dilation The mean of the distances between all adjacent pairs of sampling locations in the deformable convolution filter 22

23 Remarks The shift is a function of feature maps and not constrained by any (e.g. affine) transformation surprisingly no need for shift regularization 23

be converted to CNN, learning not end-to-end Deformable

24 Relation to Deformable Part Models Maximizing the similarity of parts while minimizing the interpart connection cost Inference can be converted to CNN, learning not end-to-end Deformable convolutions: no spatial relations between parts, unlimited in modeling deformations 24

25 Relation to Spatial Transform Networks 1. Localization net Input: feature map Output: affine transformation 2. Grid generator Generate a sampling grid according to transformation 3. Sampler 25

26 Relation to Spatial Transform Networks Can be inserted between any two layers Deformable convolutions: No global parametric transformation Easier training 26

27 Relation to Atrous / Dilated Convolutions Exponential expansion of the receptive field Deformable convolutions: input-dependent and learnable dilated convolution Both can replace filters with larger receptive field while constraining their connectivity 27

28 Relation to Active Convolution Learning the shape of convolution during training Deformable convolutions: input-dependent offsets 28

29 Relation to Dynamic Filter Network Weights for convolution are generated from the input feature map Deformable convolutions: the same but for offsets 29

30 Their Task Semantic segmentation Object detection 30

31 Their Setup SoA object detection and semantic segmentation CNNs: 1. Deep network generates feature maps Replace last 3 conv layers with deformable 2. Shallow task specific network generates results Replace (PS) RoI pooling with deformable Convolutions and offsets are learned simultaneously 31

32 Results Object detection VOC 07: 82.3 vs COCO: 56.8 vs Semantic segmentation Cityscapes: 75.2 vs miou VOC 12: 75.9 vs miou Others results COCO (with Soft-NMS):

33 Paper Evaluation Formal Objections Page 2 formula (2) notation is misleading since depends on Page 3 paragraph 3 scalar gamma further scales normalized offsets, empirically set to 0.1 Page 5 figure 4 figure is misleading, the output feature map has depth (C+1) 33

34 Paper Evaluation - Subjective Objections Page 3 paragraph 1 and 2 notation is ambiguous Max pooling application is missing 34

35 References Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in Neural Information Processing Systems Jeon, Yunho, and Junmo Kim. "Active Convolution: Learning the Shape of Convolution for Image Classification." arxiv preprint arxiv: (2017). Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." arxiv preprint arxiv: (2015). Felzenszwalb, Pedro F., et al. "Object detection with discriminatively trained part-based models." IEEE transactions on pattern analysis and machine intelligence 32.9 (2010): De Brabandere, Bert, et al. "Dynamic filter networks." Neural Information Processing Systems (NIPS)

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Encoder-Decoder Networks for Semantic Segmentation Sachin Mehta Outline > Overview of Semantic Segmentation > Encoder-Decoder Networks > Results What is Semantic Segmentation? Input: RGB Image Output: