Deep Learning for Vision: Tricks of the Trade

Similar documents
Large-Scale Visual Recognition With Deep Learning

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Deep Neural Networks:

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Computer Vision Lecture 16

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Deep Learning for Computer Vision II

Machine Learning. MGS Lecture 3: Deep Learning

Deep Learning & Neural Networks

Large-Scale Deep Learning

ECE 6504: Deep Learning for Perception

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Neural Networks and Deep Learning

Advanced Introduction to Machine Learning, CMU-10715

Deep Learning for Generic Object Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

Introduction to Deep Learning

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Why Is Recognition Hard? Object Recognizer

Efficient Algorithms may not be those we think

Deconvolutions in Convolutional Neural Networks

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Computer Vision Lecture 16

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Rotation Invariance Neural Network

Computer Vision Lecture 16

THE MNIST DATABASE of handwritten digits Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Convolutional Neural Networks

Study of Residual Networks for Image Recognition

Deep Learning. Volker Tresp Summer 2015

Learning-based Methods in Vision

CS 523: Multimedia Systems

Deep Learning with Tensorflow AlexNet

Structured Prediction using Convolutional Neural Networks

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides


Deep Learning. Volker Tresp Summer 2014

Know your data - many types of networks

Introduction to Neural Networks

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Advanced Video Analysis & Imaging

Seminars in Artifiial Intelligenie and Robotiis

6. Convolutional Neural Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Autoencoders, denoising autoencoders, and learning deep networks

Machine Learning 13. week

Early Hierarchical Contexts Learned by Convolutional Networks for Image Segmentation

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Weighted Convolutional Neural Network. Ensemble.

Day 3 Lecture 1. Unsupervised Learning

arxiv: v1 [cs.lg] 16 Jan 2013

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Deep Learning With Noise

Convolutional Neural Networks

Perceptron: This is convolution!

Face Recognition A Deep Learning Approach

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

Alternatives to Direct Supervision

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Recurrent Convolutional Neural Networks for Scene Labeling

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep (1) Matthieu Cord LIP6 / UPMC Paris 6

Dynamic Routing Between Capsules

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

Unsupervised Learning

DropConnect Regularization Method with Sparsity Constraint for Neural Networks

Unsupervised Learning of Spatiotemporally Coherent Metrics

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Introduction to Neural Networks

Neural Networks: promises of current research

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Yiqi Yan. May 10, 2017

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Locally Scale-Invariant Convolutional Neural Networks

Deep Learning. Volker Tresp Summer 2017

Discriminative classifiers for image recognition

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Transcription:

Deep Learning for Vision: Tricks of the Trade Marc'Aurelio Ranzato Facebook, AI Group www.cs.toronto.edu/~ranzato BAVM Friday, 4 October 2013

Ideal Features Ideal Feature Extractor - window, right - chair, left - monitor, top of shelf - carpet, bottom - drums, corner - - pillows on couch Q.: What objects are in the image? Where is the lamp? What is on the couch?... 2 Ranzato

The Manifold of Natural Images 3

Ideal Feature Extraction Pixel n Ideal Feature Extractor Pose Pixel 2 Pixel 1 Expression 4 Ranzato

Learning Non-Linear Features Given a dictionary of simple non-linear functions: g 1,, g n Proposal #1: linear combination f x j g j + Proposal #2: composition f x g 1 g 2 g n x 5 Ranzato

Learning Non-Linear Features Given a dictionary of simple non-linear functions: g 1,, g n Proposal #1: linear combination Kernel learning Boosting... Proposal #2: composition f x j g j S h a l l o w f x g 1 g 2 g n x Deep learning Scattering networks (wavelet cascade) S.C. Zhou & D. Mumford grammar D e e p 6 Ranzato

Linear Combination + prediction of class... templete matchers BAD: it may require an exponential nr. of templates!!! Input image 7 Ranzato

Composition high-level parts prediction of class... mid-level parts reuse intermediate parts distributed representations low level parts Input image GOOD: (exponentially) more efficient 8 Lee et al. Convolutional DBN's... ICML 2009, Zeiler & Fergus Ranzato

A Potential Problem with Deep Learning Optimization is difficult: non-convex, non-linear system 1 2 3 4 9 Ranzato

Classifier Pooling k-means SIFT A Potential Problem with Deep Learning Optimization is difficult: non-convex, non-linear system Solution #1: freeze first N-1 layer (engineer the features) It makes it shallow! - How to design features of features? - How to design features for new imagery? 4 10 Ranzato

A Potential Problem with Deep Learning Optimization is difficult: non-convex, non-linear system 1 2 3 4 Solution #2: live with it! It will converge to a local minimum. It is much more powerful!! Given lots of data, engineer less and learn more!! Just need to know a few tricks of the trade... 11 Ranzato

Deep Learning in Practice Optimization is easy, need to know a few tricks of the trade. 1 2 3 4 Q: What's the feature extractor? And what's the classifier? A: No distinction, end-to-end learning! 12 Ranzato

Deep Learning in Practice It works very well in practice: 13 Ranzato

KEY IDEAS: WHY DEEP LEARNING We need non-linear system We need to learn it from data Build feature hierarchies Distributed representations Compositionality End-to-end learning 14 Ranzato

What Is Deep Learning? 15 Ranzato

Buzz Words It's a Convolutional Net It's a Contrastive Divergence It's a Feature Learning It's a Unsupervised Learning It's just old Neural Nets It's a Deep Belief Net 16 Ranzato

(My) Definition A Deep Learning method is: a method which makes predictions by using a sequence of non-linear processing stages. The resulting intermediate representations can be interpreted as feature hierarchies and the whole system is jointly learned from data. Some deep learning methods are probabilistic, others are loss-based, some are supervised, other unsupervised... It's a large family! 17 Ranzato

1957 Rosenblatt Perceptron THE SPACE OF MACHINE LEARNING METHODS 18

Neural Net Perceptron 80s back-propagation & compute power Autoencoder Neural Net 19

Recurrent Neural Net Convolutional Neural Net Neural Net Perceptron 90s LeCun's CNNs Autoencoder Neural Net Sparse Coding GMM 20

Convolutional Neural Net Neural Net Recurrent Neural Net Perceptron Boosting 00s SVMs Autoencoder Neural Net Sparse Coding Restricted BM SVM GMM 21

Recurrent Neural Net Convolutional Neural Net Neural Net Perceptron Boosting Deep Belief Net 2006 Hinton's DBN Autoencoder Neural Net Sparse Coding Restricted BM SVM GMM BayesNP 22

2009 Recurrent Neural Net Convolutional Neural Net Boosting ΣΠ Neural Net Deep (sparse/denoising) Autoencoder Deep Belief Net 2009 ASR (data + GPU) Perceptron Autoencoder Neural Net Sparse Coding Restricted BM SVM GMM BayesNP 23

Recurrent Neural Net Convolutional Neural Net Boosting ΣΠ Neural Net Deep (sparse/denoising) Autoencoder Deep Belief Net 2012 CNNs (data + GPU) Perceptron Autoencoder Neural Net Sparse Coding Restricted BM SVM GMM BayesNP 24

Neural Net Recurrent Neural Net Convolutional Neural Net Perceptron Boosting SVM ΣΠ Deep (sparse/denoising) Autoencoder Deep Belief Net Autoencoder Neural Net Sparse Coding Restricted BM GMM BayesNP 25

TIME Convolutional Neural Net 2012 Convolutional Neural Net 1998 Convolutional Neural Net 1988 Q.: Did we make any prgress since then? A.: The main reason for the breakthrough is: data and GPU, but we have also made networks deeper and more non-linear. 26

ConvNets: History - Fukushima 1980: designed network with same basic structure but did not train by backpropagation. - LeCun from late 80s: figured out backpropagation for CNN, popularized and deployed CNN for OCR applications and others. - Poggio from 1999: same basic structure but learning is restricted to top layer (k-means at second stage) - LeCun from 2006: unsupervised feature learning - DiCarlo from 2008: large scale experiments, normalization layer - LeCun from 2009: harsher non-linearities, normalization layer, learning unsupervised and supervised. - Mallat from 2011: provides a theory behind the architecture - Hinton 2012: use bigger nets, GPUs, more data LeCun et al. Gradient-based learning applied to document recognition IEEE 1998 27

ConvNets: till 2012 Loss Common wisdom: training does not work because we get stuck in local minima parameter 28

ConvNets: today Loss Local minima are all similar, there are long plateaus, it can take long time to break symmetries. w w 1 input/output invariant to permutations breaking ties between parameters W T X Saturating units parameter 29

Like walking on a ridge between valleys 30

ConvNets: today Loss Local minima are all similar, there are long plateaus, it can take long to break symmetries. Optimization is not the real problem when: dataset is large unit do not saturate too much normalization layer parameter 31

ConvNets: today Loss Today's belief is that the challenge is about: generalization How many training samples to fit 1B parameters? How many parameters/samples to model spaces with 1M dim.? scalability parameter 32

Neural Net Recurrent Neural Net Convolutional Neural Net SHALLOW Perceptron Boosting SVM ΣΠ Deep (sparse/denoising) Autoencoder Deep Belief Net Autoencoder Neural Net Sparse Coding Restricted BM GMM BayesNP DEEP 33

Neural Net Recurrent Neural Net Convolutional Neural Net SHALLOW Perceptron Boosting SUPERVISED ΣΠ BayesNP Deep (sparse/denoising) Autoencoder Deep Belief Net DEEP Autoencoder Neural Net Sparse Coding Restricted BM SVM UNSUPERVISED GMM 34

Deep Learning is a very rich family! I am going to focus on a few methods... 35 Ranzato

CNN Neural Net Recurrent Neural Net SHALLOW Perceptron Boosting SUPERVISED ΣΠ BayesNP Deep (sparse/denoising) Autoencoder DBN DEEP Autoencoder Neural Net Sparse Coding Restricted BM SVM UNSUPERVISED GMM 36

Deep Gated MRF Layer 1: E x, h c, h m = 1 2 x ' 1 x p x,h c,h m e E x, hc,h m x p x q pair-wise MRF Ranzato et al. Modeling natural images with gated MRFs PAMI 2013 37

Deep Gated MRF Layer 1: E x, h c, h m = 1 2 x ' C C ' x x p F x q pair-wise MRF Ranzato et al. Modeling natural images with gated MRFs PAMI 2013 38

Deep Gated MRF Layer 1: E x, h c, h m = 1 2 x ' C [diag hc ]C ' x h k c F gated MRF x p C F C x q Ranzato et al. Modeling natural images with gated MRFs PAMI 2013 39

Deep Gated MRF Layer 1: E x, h c, h m = 1 2 x ' C [diag hc ]C ' x 1 2 x' x x' W hm M h k c p x h c h m e E x,hc, h m gated MRF x p C F C W x q N Ranzato et al. Modeling natural images with gated MRFs PAMI 2013 h j m 40

Deep Gated MRF Layer 1: E x, h c, h m = 1 2 x ' C [diag hc ]C ' x 1 2 x' x x' W hm Inference of latent variables: just a forward pass p x h c h m e E x,hc, h m Training: requires approximations (here we used MCMC methods) Ranzato et al. Modeling natural images with gated MRFs PAMI 2013 41

Deep Gated MRF Layer 1 Layer 2 input h 2 Ranzato et al. Modeling natural images with gated MRFs PAMI 2013 42

Deep Gated MRF Layer 1 Layer 2 Layer 3 input h 2 h 3 Ranzato et al. Modeling natural images with gated MRFs PAMI 2013 43

Sampling High-Resolution Images Gaussian model marginal wavelet from Simoncelli 2005 Pair-wise MRF FoE 44 from Schmidt, Gao, Roth CVPR 2010

Sampling High-Resolution Images Gaussian model marginal wavelet gmrf: 1 layer from Simoncelli 2005 Pair-wise MRF FoE Ranzato et al. PAMI 2013 45 from Schmidt, Gao, Roth CVPR 2010

Sampling High-Resolution Images Gaussian model marginal wavelet gmrf: 1 layer from Simoncelli 2005 Pair-wise MRF FoE Ranzato et al. PAMI 2013 46 from Schmidt, Gao, Roth CVPR 2010

Sampling High-Resolution Images Gaussian model marginal wavelet gmrf: 1 layer from Simoncelli 2005 Pair-wise MRF FoE Ranzato et al. PAMI 2013 47 from Schmidt, Gao, Roth CVPR 2010

Sampling High-Resolution Images Gaussian model marginal wavelet gmrf: 3 layers from Simoncelli 2005 Pair-wise MRF FoE Ranzato et al. PAMI 2013 48 from Schmidt, Gao, Roth CVPR 2010

Sampling High-Resolution Images Gaussian model marginal wavelet gmrf: 3 layers from Simoncelli 2005 Pair-wise MRF FoE Ranzato et al. PAMI 2013 49 from Schmidt, Gao, Roth CVPR 2010

Sampling High-Resolution Images Gaussian model marginal wavelet gmrf: 3 layers from Simoncelli 2005 Pair-wise MRF FoE Ranzato et al. PAMI 2013 50 from Schmidt, Gao, Roth CVPR 2010

Sampling High-Resolution Images Gaussian model marginal wavelet gmrf: 3 layers from Simoncelli 2005 Pair-wise MRF FoE Ranzato et al. PAMI 2013 51 from Schmidt, Gao, Roth CVPR 2010

Sampling After Training on Face Images unconstrained samples Original Input 1 st layer 2 nd layer 3 rd layer 4 th layer 10 times conditional (on the left part of the face) samples Ranzato et al. PAMI 2013 52

Expression Recognition Under Occlusion Ranzato et al. PAMI 2013 53

Pros Feature extraction is fast Unprecedented generation quality Advances models of natural images Trains without labeled data Cons Training is inefficient Slow Tricky Sampling scales badly with dimensionality What's the use case of generative models? Conclusion If generation is not required, other feature learning methods are more efficient (e.g., sparse auto-encoders). What's the use case of generative models? 54

CNN Neural Net Recurrent Neural Net SHALLOW Perceptron Boosting SUPERVISED ΣΠ BayesNP Deep (sparse/denoising) Autoencoder DBN DEEP Autoencoder Neural Net SPARSE CODING Restricted BM SVM UNSUPERVISED GMM 55

CONV NETS: TYPICAL ARCHITECTURE One stage (zoom) Convol. LCN Pooling Whole system Input Image Fully Conn. Layers Class Labels 1 st stage 2 nd stage 3 rd stage 56 Ranzato

CONV NETS: TYPICAL ARCHITECTURE One stage (zoom) Convol. LCN Pooling Conceptually similar to: SIFT K-Means Pyramid Pooling SVM Lazebnik et al....spatial Pyramid Matching... CVPR 2006 SIFT Fisher Vect. Pooling SVM Sanchez et al. Image classifcation with F.V.: Theory and practice IJCV 2012 57 Ranzato

CONV NETS: EXAMPLES - OCR / House number & Traffic sign classification Ciresan et al. MCDNN for image classification CVPR 2012 Wan et al. Regularization of neural networks using dropconnect ICML 2013 58

CONV NETS: EXAMPLES - Texture classification Sifre et al. Rotation, scaling and deformation invariant scattering... CVPR 2013 59

CONV NETS: EXAMPLES - Pedestrian detection Sermanet et al. Pedestrian detection with unsupervised multi-stage.. CVPR 2013 60

CONV NETS: EXAMPLES - Scene Parsing Farabet et al. Learning hierarchical features for scene labeling PAMI 2013 Ranzato 61

CONV NETS: EXAMPLES - Segmentation 3D volumetric images Ciresan et al. DNN segment neuronal membranes... NIPS 2012 Turaga et al. Maximin learning of image segmentation NIPS 2009 62 Ranzato

CONV NETS: EXAMPLES - Action recognition from videos Taylor et al. Convolutional learning of spatio-temporal features ECCV 2010 63

CONV NETS: EXAMPLES - Robotics Sermanet et al. Mapping and planning...with long range perception IROS 2008 64

CONV NETS: EXAMPLES - Denoising original noised denoised Burger et al. Can plain NNs compete with BM3D? CVPR 2012 65 Ranzato

CONV NETS: EXAMPLES - Dimensionality reduction / learning embeddings Hadsell et al. Dimensionality reduction by learning an invariant mapping CVPR 2006 66

CONV NETS: EXAMPLES - Image classification Object Recognizer railcar Krizhevsky et al. ImageNet Classification with deep CNNs NIPS 2012 67 Ranzato

CONV NETS: EXAMPLES - Deployed in commercial systems (Google & Baidu, spring 2013) 68 Ranzato

How To Use ConvNets...(properly) 69

CHOOSING THE ARCHITECTURE Task dependent Cross-validation [Convolution LCN pooling]* + fully connected layer The more data: the more layers and the more kernels Look at the number of parameters at each layer Look at the number of flops at each layer Computational cost Be creative :) 70 Ranzato

HOW TO OPTIMIZE SGD (with momentum) usually works very well Pick learning rate by running on a subset of the data Bottou Stochastic Gradient Tricks Neural Networks 2012 Start with large learning rate and divide by 2 until loss does not diverge Decay learning rate by a factor of ~100 or more by the end of training Use non-linearity Initialize parameters so that each feature across layers has similar variance. Avoid units in saturation. 71 Ranzato

HOW TO IMPROVE GENERALIZATION Weight sharing (greatly reduce the number of parameters) Data augmentation (e.g., jittering, noise injection, etc.) Dropout Hinton et al. Improving Nns by preventing co-adaptation of feature detectors arxiv 2012 Weight decay (L2, L1) Sparsity in the hidden units Multi-task (unsupervised learning) 72 Ranzato

OTHER THINGS GOOD TO KNOW Check gradients numerically by finite differences Visualize features (feature maps need to be uncorrelated) and have high variance. samples hidden unit Good training: hidden units are sparse across samples and across features. 73 Ranzato

OTHER THINGS GOOD TO KNOW Check gradients numerically by finite differences Visualize features (feature maps need to be uncorrelated) and have high variance. samples hidden unit Bad training: many hidden units ignore the input and/or exhibit strong correlations. 74 Ranzato

OTHER THINGS GOOD TO KNOW Check gradients numerically by finite differences Visualize features (feature maps need to be uncorrelated) and have high variance. Visualize parameters GOOD BAD BAD BAD too noisy too correlated lack structure Good training: learned filters exhibit structure and are uncorrelated. 75 Ranzato

OTHER THINGS GOOD TO KNOW Check gradients numerically by finite differences Visualize features (feature maps need to be uncorrelated) and have high variance. Visualize parameters Measure error on both training and validation set. Test on a small subset of the data and check the error 0. 76 Ranzato

WHAT IF IT DOES NOT WORK? Training diverges: Learning rate may be too large decrease learning rate BPROP is buggy numerical gradient checking Parameters collapse / loss is minimized but accuracy is low Check loss function: Is it appropriate for the task you want to solve? Does it have degenerate solutions? Network is underperforming Compute flops and nr. params. if too small, make net larger Visualize hidden units/params fix optmization Network is too slow Compute flops and nr. params. GPU,distrib. framework, make net smaller 77 Ranzato

SUMMARY Deep Learning = Learning Hierarchical representations. Leverage compositionality to gain efficiency. Unsupervised learning: active research topic. Supervised learning: most successful set up today. Optimization Don't we get stuck in local minima? No, they are all the same! In large scale applications, local minima are even less of an issue. Scaling GPUs Distributed framework (Google) Better optimization techniques Generalization on small datasets (curse of dimensionality): Input distortions weight decay dropout 78 Ranzato

THANK YOU! NOTE: IJCV Special Issue on Deep Learning. Deadline: 9 Feb. 2014. 79 Ranzato

SOFTWARE Torch7: learning library that supports neural net training http://www.torch.ch http://code.cogbits.com/wiki/doku.php (tutorial with demos by C. Farabet) Python-based learning library (U. Montreal) - http://deeplearning.net/software/theano/ (does automatic differentiation) C++ code for ConvNets (Sermanet) http://eblearn.sourceforge.net/ Efficient CUDA kernels for ConvNets (Krizhevsky) code.google.com/p/cuda-convnet More references at the CVPR 2013 tutorial on deep learning: http://www.cs.toronto.edu/~ranzato/publications/ranzato_cvpr13.pdf 80 Ranzato