Overall Description. Goal: to improve spatial invariance to the input data. Translation, Rotation, Scale, Clutter, Elastic

Size: px

Start display at page:

Download "Overall Description. Goal: to improve spatial invariance to the input data. Translation, Rotation, Scale, Clutter, Elastic"

Sharlene Sharp
5 years ago
Views:

1 Philippe Giguère

Overall Description Goal: to improve spatial invariance to the input data Translation, Rotation, Scale, Clutter, Elastic How: add a learnable module which explicitly manipulate spatially

2 Overall Description Goal: to improve spatial invariance to the input data Translation, Rotation, Scale, Clutter, Elastic How: add a learnable module which explicitly manipulate spatially the data Fully-differentiable Can be inserted into existing architecture No knowledge of the ground truth transformation is given Obtain state-of-the art results Prediction of the transform 3

3 Benefits to multifarious tasks Image classification (with significant distortions) Spatial attention Many and of various types Focus on smaller, lower resolution inputs (increase computational efficiency) (PG: maybe less overfit?) Co-localisation (when multiple instances of same objet are present) 4

4 Spatial transformer (ST) Can be dropped in any architecture Manipulates the feature maps (data), not the filters Warping applied to all the channels At multiple depth or in parallel Applies a spatial transformation in a single forward pass (compared to [34] who does multiple passes through net) Fully differentiable: end-to-end training with backprop [34] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention through feedback connections. In NIPS,

5 Spatial transformer CNN or fully-connected(fcn), with final regression layer Thus, transformation q is conditional on input Spatial transformation parameters e.g. q is 6-dim for affine Can forward to rest of network (if wanted, as it contains pose information.) U H W C U H ' W ' C 10

6 Architecture detail (PyTorch) # Spatial transformer localization-network self.localization = nn.sequential( nn.conv2d(1, 8, kernel_size=7), nn.maxpool2d(2, stride=2), nn.relu(true), nn.conv2d(8, 10, kernel_size=5), nn.maxpool2d(2, stride=2), nn.relu(true) ) # Regressor for the 3 * 2 affine matrix self.fc_loc = nn.sequential( nn.linear(10 * 3 * 3, 32), nn.relu(true), nn.linear(32, 3 * 2) ) 11

7 Spatial transformer 6 parameters q ij allows cropping, translation, rotation, scale and skew Identity Rotation or pure attention model (cropping via scaling s + translation) Similar to texture mapping In the end, any transformation, as long as differentiable U H W C U H ' W ' C 12

8 Closer view of sampling Sampling kernel (regular, fixed grid) Bi-linear kernel Fully (sub-)differentiable wrt U, x s i and y s i 13

9 Spatial transform sidenotes Can do under/oversampling of features Watch out for aliasing Can have several spatial transforms in a CNN, at different depth Have increasingly arbitrary transforms Or n in parallel To focus on exactly n objects/parts in a picture 14

10 Example 15

11 Distorted MNIST Experimentations Street View House Numbers Bird Classification dataset CUB parallel ST 4 parallel ST 16

12 Rotation (R) MNIST Rotation, scale and translation (RTS) Elastic warping (E) (which cannot always be inverted) Networks Baseline: CNN Fully-Connected (FCN) New: ST-CNN ST-FCN (standard training: backprop, SGD, sched. learn. rate, multinomial x-entropy) 17

13 MNIST Two layers of maxpool (for some spatial invariance) Error rate (%) input predicted transform ST output Type of ST (spatial transform) Thin plate spline transform TPS is best! (does not seem to overfit on R) 18

14 60x60 images MNIST Large translation + rotation + clutter FCN CNN ST-FCN ST-CNN Error (%)

15 Street View House Numbers SVHN : 200k images, 1 to 5 digits to recognize Large variability in scale/spatial arrangement Localization network (LN): 4-layer CNN 5 softmaxs (1 per digit, with NULL) q ST 11 layers CNN Baseline LN : 2-layer FCN w/ 32 hidden units Single ST Multi ST: ST conv ST conv ST conv ST conv All trained with SGD + dropout 20

16 SVHN Model averaging + Monte Carlo averaging (baseline) Single pass ST-CNN is only 6% slower than CNN 21

17 Bird data set : CUB Fine-grained classification : 200 species (6k training images, 5.8k testing) Multiple ST in parallel (more details later) Only image class label for training Baseline : Inception + batch normalization, Pre-trained on ImageNet Fine-tuned on CUB Achieves state-of-the-art (82.3%) 22

18 Bird data set : CUB Have 2 networks 2 parallel Spatial Transform 4 parallel Spatial Transform Transform is with attention subset of parameters s fixed (0.5) i.e. search for square bounding boxes of ½ size 23

19 Architecture for 2x parallel Shared for all transforms q i Scale is fixed to 50% Softmax fc fc 1x1 Beheaded Inception (Details in Arxiv version of paper) 24

20 Bird data set : CUB-200 TP2 (baseline) Specialization for free! 26

21 Conclusion New self-contained module for NN Can be dropped into a network Performs explicit spatial transformations of features Leaned end-to-end, no change in loss function Provides extra information (q) Early experiments shows it works well for recurrent networks Over 600 citations 27

Spatial Localization and Detection. Lecture 8-1

Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday