Overall Description. Goal: to improve spatial invariance to the input data. Translation, Rotation, Scale, Clutter, Elastic

Philippe Giguère

Overall Description Goal: to improve spatial invariance to the input data Translation, Rotation, Scale, Clutter, Elastic How: add a learnable module which explicitly manipulate spatially the data Fully-differentiable Can be inserted into existing architecture No knowledge of the ground truth transformation is given Obtain state-of-the art results Prediction of the transform 3

Benefits to multifarious tasks Image classification (with significant distortions) Spatial attention Many and of various types Focus on smaller, lower resolution inputs (increase computational efficiency) (PG: maybe less overfit?) Co-localisation (when multiple instances of same objet are present) 4

Spatial transformer (ST) Can be dropped in any architecture Manipulates the feature maps (data), not the filters Warping applied to all the channels At multiple depth or in parallel Applies a spatial transformation in a single forward pass (compared to [34] who does multiple passes through net) Fully differentiable: end-to-end training with backprop [34] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention through feedback connections. In NIPS, 2014. 9

Spatial transformer CNN or fully-connected(fcn), with final regression layer Thus, transformation q is conditional on input Spatial transformation parameters e.g. q is 6-dim for affine Can forward to rest of network (if wanted, as it contains pose information.) U H W C U H ' W ' C 10

Architecture detail (PyTorch) # Spatial transformer localization-network self.localization = nn.sequential( nn.conv2d(1, 8, kernel_size=7), nn.maxpool2d(2, stride=2), nn.relu(true), nn.conv2d(8, 10, kernel_size=5), nn.maxpool2d(2, stride=2), nn.relu(true) ) # Regressor for the 3 * 2 affine matrix self.fc_loc = nn.sequential( nn.linear(10 * 3 * 3, 32), nn.relu(true), nn.linear(32, 3 * 2) ) http://pytorch.org/tutorials/intermediate/spatial_transformer_tutorial.html 11

Spatial transformer 6 parameters q ij allows cropping, translation, rotation, scale and skew Identity Rotation or pure attention model (cropping via scaling s + translation) Similar to texture mapping In the end, any transformation, as long as differentiable U H W C U H ' W ' C 12

Closer view of sampling Sampling kernel (regular, fixed grid) Bi-linear kernel Fully (sub-)differentiable wrt U, x s i and y s i 13

Spatial transform sidenotes Can do under/oversampling of features Watch out for aliasing Can have several spatial transforms in a CNN, at different depth Have increasingly arbitrary transforms Or n in parallel To focus on exactly n objects/parts in a picture 14

Example 15

Distorted MNIST Experimentations Street View House Numbers Bird Classification dataset CUB-200-2011 2 parallel ST 4 parallel ST 16

Rotation (R) MNIST Rotation, scale and translation (RTS) Elastic warping (E) (which cannot always be inverted) Networks Baseline: CNN Fully-Connected (FCN) New: ST-CNN ST-FCN (standard training: backprop, SGD, sched. learn. rate, multinomial x-entropy) 17

MNIST Two layers of maxpool (for some spatial invariance) Error rate (%) input predicted transform ST output Type of ST (spatial transform) Thin plate spline transform TPS is best! (does not seem to overfit on R) 18

60x60 images MNIST Large translation + rotation + clutter FCN CNN ST-FCN ST-CNN Error (%) 13.2 3.5 2.0 1.7 19

Street View House Numbers SVHN : 200k images, 1 to 5 digits to recognize Large variability in scale/spatial arrangement Localization network (LN): 4-layer CNN 5 softmaxs (1 per digit, with NULL) q ST 11 layers CNN Baseline LN : 2-layer FCN w/ 32 hidden units Single ST Multi ST: ST conv ST conv ST conv ST conv All trained with SGD + dropout 20

SVHN Model averaging + Monte Carlo averaging (baseline) Single pass ST-CNN is only 6% slower than CNN 21

Bird data set : CUB Fine-grained classification : 200 species (6k training images, 5.8k testing) Multiple ST in parallel (more details later) Only image class label for training Baseline : Inception + batch normalization, Pre-trained on ImageNet Fine-tuned on CUB Achieves state-of-the-art (82.3%) 22

Bird data set : CUB Have 2 networks 2 parallel Spatial Transform 4 parallel Spatial Transform Transform is with attention subset of parameters s fixed (0.5) i.e. search for square bounding boxes of ½ size 23

Architecture for 2x parallel Shared for all transforms q i Scale is fixed to 50% Softmax fc fc 1x1 Beheaded Inception (Details in Arxiv version of paper) 24

Bird data set : CUB-200 TP2 (baseline) Specialization for free! 26

Conclusion New self-contained module for NN Can be dropped into a network Performs explicit spatial transformations of features Leaned end-to-end, no change in loss function Provides extra information (q) Early experiments shows it works well for recurrent networks Over 600 citations 27