Overall Description. Goal: to improve spatial invariance to the input data. Translation, Rotation, Scale, Clutter, Elastic

Similar documents
Spatial Localization and Detection. Lecture 8-1

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Machine Learning. MGS Lecture 3: Deep Learning

An Exploration of Computer Vision Techniques for Bird Species Classification

Deep Learning for Computer Vision II

Perceptron: This is convolution!

Dynamic Routing Between Capsules

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

ECE 5470 Classification, Machine Learning, and Neural Network Review

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Dynamic Routing Between Capsules. Yiting Ethan Li, Haakon Hukkelaas, and Kaushik Ram Ramasamy

Deep Learning with Tensorflow AlexNet

Know your data - many types of networks

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

Bilinear Models for Fine-Grained Visual Recognition

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Deep Residual Learning

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Advanced Video Analysis & Imaging

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Jersey Number Recognition using Convolutional Neural Networks

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Deep Model Compression

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Face Recognition A Deep Learning Approach

Inception Network Overview. David White CS793

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

LEARNING TO INFER GRAPHICS PROGRAMS FROM HAND DRAWN IMAGES

All You Want To Know About CNNs. Yukun Zhu

Rich feature hierarchies for accurate object detection and semant

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

CAP 6412 Advanced Computer Vision

OBJECT DETECTION HYUNG IL KOO

arxiv: v1 [cs.cv] 20 Dec 2016

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Object Detection Based on Deep Learning

Computer Vision Lecture 16

Neural Networks and Deep Learning

arxiv: v2 [cs.cv] 23 May 2016

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

Keras: Handwritten Digit Recognition using MNIST Dataset

Supplementary Material for SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CSE 559A: Computer Vision

R-FCN: Object Detection with Really - Friggin Convolutional Networks

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

arxiv: v1 [cs.cv] 29 Sep 2016

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

Capsule Networks. Eric Mintun

Channel Locality Block: A Variant of Squeeze-and-Excitation

Computer Vision Lecture 16

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Alternatives to Direct Supervision

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Outline GF-RNN ReNet. Outline

11. Neural Network Regularization

Plankton Classification Using ConvNets

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Joint Object Detection and Viewpoint Estimation using CNN features

Deep Learning in Image Processing

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Convolution Neural Networks for Chinese Handwriting Recognition

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Transfer Learning. Style Transfer in Deep Learning

CNN Basics. Chongruo Wu

Efficient Algorithms may not be those we think

INTRODUCTION TO DEEP LEARNING

arxiv: v1 [cs.lg] 16 Jan 2013

Martian lava field, NASA, Wikipedia

Deep Learning for Computer Vision

Object detection with CNNs

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Machine Learning 13. week

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Learning Transferable Features with Deep Adaptation Networks

Optimizing Object Detection:

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

3D Object Classification via Spherical Projections

Midterm Review. CS230 Fall 2018

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Lab meeting (Paper review session) Stacked Generative Adversarial Networks

YOLO 9000 TAEWAN KIM

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Toward Scale-Invariance and Position-Sensitive Region Proposal Networks

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Unsupervised Learning

Deconvolutions in Convolutional Neural Networks

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Transcription:

Philippe Giguère

Overall Description Goal: to improve spatial invariance to the input data Translation, Rotation, Scale, Clutter, Elastic How: add a learnable module which explicitly manipulate spatially the data Fully-differentiable Can be inserted into existing architecture No knowledge of the ground truth transformation is given Obtain state-of-the art results Prediction of the transform 3

Benefits to multifarious tasks Image classification (with significant distortions) Spatial attention Many and of various types Focus on smaller, lower resolution inputs (increase computational efficiency) (PG: maybe less overfit?) Co-localisation (when multiple instances of same objet are present) 4

Spatial transformer (ST) Can be dropped in any architecture Manipulates the feature maps (data), not the filters Warping applied to all the channels At multiple depth or in parallel Applies a spatial transformation in a single forward pass (compared to [34] who does multiple passes through net) Fully differentiable: end-to-end training with backprop [34] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention through feedback connections. In NIPS, 2014. 9

Spatial transformer CNN or fully-connected(fcn), with final regression layer Thus, transformation q is conditional on input Spatial transformation parameters e.g. q is 6-dim for affine Can forward to rest of network (if wanted, as it contains pose information.) U H W C U H ' W ' C 10

Architecture detail (PyTorch) # Spatial transformer localization-network self.localization = nn.sequential( nn.conv2d(1, 8, kernel_size=7), nn.maxpool2d(2, stride=2), nn.relu(true), nn.conv2d(8, 10, kernel_size=5), nn.maxpool2d(2, stride=2), nn.relu(true) ) # Regressor for the 3 * 2 affine matrix self.fc_loc = nn.sequential( nn.linear(10 * 3 * 3, 32), nn.relu(true), nn.linear(32, 3 * 2) ) http://pytorch.org/tutorials/intermediate/spatial_transformer_tutorial.html 11

Spatial transformer 6 parameters q ij allows cropping, translation, rotation, scale and skew Identity Rotation or pure attention model (cropping via scaling s + translation) Similar to texture mapping In the end, any transformation, as long as differentiable U H W C U H ' W ' C 12

Closer view of sampling Sampling kernel (regular, fixed grid) Bi-linear kernel Fully (sub-)differentiable wrt U, x s i and y s i 13

Spatial transform sidenotes Can do under/oversampling of features Watch out for aliasing Can have several spatial transforms in a CNN, at different depth Have increasingly arbitrary transforms Or n in parallel To focus on exactly n objects/parts in a picture 14

Example 15

Distorted MNIST Experimentations Street View House Numbers Bird Classification dataset CUB-200-2011 2 parallel ST 4 parallel ST 16

Rotation (R) MNIST Rotation, scale and translation (RTS) Elastic warping (E) (which cannot always be inverted) Networks Baseline: CNN Fully-Connected (FCN) New: ST-CNN ST-FCN (standard training: backprop, SGD, sched. learn. rate, multinomial x-entropy) 17

MNIST Two layers of maxpool (for some spatial invariance) Error rate (%) input predicted transform ST output Type of ST (spatial transform) Thin plate spline transform TPS is best! (does not seem to overfit on R) 18

60x60 images MNIST Large translation + rotation + clutter FCN CNN ST-FCN ST-CNN Error (%) 13.2 3.5 2.0 1.7 19

Street View House Numbers SVHN : 200k images, 1 to 5 digits to recognize Large variability in scale/spatial arrangement Localization network (LN): 4-layer CNN 5 softmaxs (1 per digit, with NULL) q ST 11 layers CNN Baseline LN : 2-layer FCN w/ 32 hidden units Single ST Multi ST: ST conv ST conv ST conv ST conv All trained with SGD + dropout 20

SVHN Model averaging + Monte Carlo averaging (baseline) Single pass ST-CNN is only 6% slower than CNN 21

Bird data set : CUB Fine-grained classification : 200 species (6k training images, 5.8k testing) Multiple ST in parallel (more details later) Only image class label for training Baseline : Inception + batch normalization, Pre-trained on ImageNet Fine-tuned on CUB Achieves state-of-the-art (82.3%) 22

Bird data set : CUB Have 2 networks 2 parallel Spatial Transform 4 parallel Spatial Transform Transform is with attention subset of parameters s fixed (0.5) i.e. search for square bounding boxes of ½ size 23

Architecture for 2x parallel Shared for all transforms q i Scale is fixed to 50% Softmax fc fc 1x1 Beheaded Inception (Details in Arxiv version of paper) 24

Bird data set : CUB-200 TP2 (baseline) Specialization for free! 26

Conclusion New self-contained module for NN Can be dropped into a network Performs explicit spatial transformations of features Leaned end-to-end, no change in loss function Provides extra information (q) Early experiments shows it works well for recurrent networks Over 600 citations 27