Efficient Algorithms may not be those we think

Similar documents
Deep Learning for Generic Object Recognition

THE MNIST DATABASE of handwritten digits Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York

More EBM Applications: Trainable dissimilarity Metrics, Segmentation, Sequence Labeling

Visual Perception with Deep Learning

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

Shape Context Matching For Efficient OCR

Neural Networks and Deep Learning

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Unsupervised Learning of Spatiotemporally Coherent Metrics

Capsule Networks. Eric Mintun

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Kernels and Clustering

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Day 3 Lecture 1. Unsupervised Learning

Multi Layer Architectures Gradient Back Propagation

Deep Generative Models Variational Autoencoders

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

CS 343: Artificial Intelligence

Introduction to Deep Learning

Energy-Based Learning

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Neural Networks: promises of current research

3D Object Recognition with Deep Belief Nets

All lecture slides will be available at CSC2515_Winter15.html

Probabilistic Siamese Network for Learning Representations. Chen Liu

Advanced Introduction to Machine Learning, CMU-10715

SVM-KNN : Discriminative Nearest Neighbor Classification for Visual Category Recognition

6. Convolutional Neural Networks

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Unsupervised Learning of Feature Hierarchies

Deep (1) Matthieu Cord LIP6 / UPMC Paris 6

Facial Expression Classification with Random Filters Feature Extraction

Computer Vision Lecture 16

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

Dynamic Routing Between Capsules. Yiting Ethan Li, Haakon Hukkelaas, and Kaushik Ram Ramasamy

Learning Feature Hierarchies for Object Recognition

Applying Supervised Learning

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Natural Language Processing

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Learning a Deep Hierarchy of Sparse and Invariant Features

Learning Convolutional Feature Hierarchies for Visual Recognition

Classification: Feature Vectors

Metric Learning for Large-Scale Image Classification:

The exam is closed book, closed notes except your one-page cheat sheet.

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

CSE 573: Artificial Intelligence Autumn 2010

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Autoencoders, denoising autoencoders, and learning deep networks

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

Implicit Mixtures of Restricted Boltzmann Machines

Deep Learning for Computer Vision

Recognizing Hand-written Digits Using Hierarchical Products of Experts

Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance

Effectiveness of Sparse Features: An Application of Sparse PCA

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

M. Sc. (Artificial Intelligence and Machine Learning)

Stacked Denoising Autoencoders for Face Pose Normalization

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

arxiv: v1 [cs.lg] 16 Jan 2013

Class 6 Large-Scale Image Classification

Dynamic Routing Between Capsules

MACHINE LEARNING CLASSIFIERS ADVANTAGES AND CHALLENGES OF SELECTED METHODS

Large Scale Manifold Transduction

Handwritten Hindi Numerals Recognition System

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification

Rotation Invariance Neural Network

3D Object Recognition using Multiclass SVM-KNN

A Fast Learning Algorithm for Deep Belief Nets

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Deep Learning. Volker Tresp Summer 2014

Some fast and compact neural network solutions for artificial intelligence applications

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Deep Neural Networks:

Deep Learning. Architecture Design for. Sargur N. Srihari

Sparse arrays of signatures for online character recognition

Lecture 37: ConvNets (Cont d) and Training

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

27: Hybrid Graphical Models and Neural Networks

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Deep Learning via Semi-Supervised Embedding. Jason Weston, Frederic Ratle and Ronan Collobert Presented by: Janani Kalyanam

Does the Brain do Inverse Graphics?

Deep Learning for Computer Vision II

SGD: Stochastic Gradient Descent

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Unsupervised Learning

Transcription:

Efficient Algorithms may not be those we think Yann LeCun, Computational and Biological Learning Lab The Courant Institute of Mathematical Sciences New York University http://yann.lecun.com http://www.cs.nyu.edu/~yann

The Importance of Architecture The internal structure of factors influences the efficiency of inference

Feed Forward (bottom up) Deep Belief Net Predicting Y from X is trivial Predicting X from Y is very hard + Distance Distance Complicated Complicated Complicated function function function X Distance Z1 Z2 Y

Feedback (top down) Deep Belief Net Predicting X from Y is trivial Predicting Y from X is very hard + Distance X Distance Distance Complicated Complicated Complicated function function function Z1 Z2 Y

Feed Forward vs Feedback? How far can we go with Feed Forward architectures? Animals have a vested interest in picking out certain objects very quickly (e.g. Predators, preys, obstacles...) Feedback (top down) systems are slower, but smarter Couldn't we use a feedback system to drive/train a feed-forward one? 1. Run the top-down inference system (slow, but good) 2. Train a feed-forward system to predict the same answer ==> you get a fast, bottom-up/top-down system that has the best of both worlds.

Inference & Learning

Natural image patches Berkeley 200 decoder filters (reshaped columns of matrix Wd)

Handwritten digits MNIST 60,000 28x28 images 196 units in the code 0.01 1 learning rate 0.001 L1, L2 regularizer 0.005 Encoder direct filters

Handwritten digits MNIST Handwritten digits MNIST forward propagation through encoder and decoder after training there is no need to minimize in code space

Top Down+Bottom Up Distance Distance Complicated Complicated Complicated function function function X Z1 Z2 Y Complicated Complicated Complicated function function function Distance Distance Distance Distance

Low dimensional representations for matching? A solution to the many class problem Example: face recognition We do not have pictures of every person We must be able to learn something without seeing all the classes Possible Solution: learn a similarity metric to make matching efficient Map images to a low dimensional space in which Two images of the same person are mapped to nearby points Two images of different persons are mapped to distant points

Comparing Objects: Learning an Invariant Dissimilarity Metric E(W,X1,X2) [Chopra, Hadsell, LeCun CVPR 2005] Gw(X1) Gw(X2) Training a parameterized, invariant dissimilarity metric may be a solution to the many category problem. Find a mapping Gw(X) such that the Euclidean distance Gw(X1) Gw(X2) reflects the semantic distance between X1 and X2. Once trained, a trainable dissimilarity metric can be used to classify new categories using a very small number of Gw(X1) With EBMs, we can put what we want in the box (e.g. A convolutional net). Siamese Architecture Application: face verification/recognition X2 X1 training samples (used as prototypes). This is an example where probabilistic models are too constraining, because we would have to limit ourselves to models that can be normalized over the space of input pairs. Gw(X2) E(W,X1,X2) Gw(X1) Gw(X2) Gw(X1) X1 Gw(X2) X2

Internal state for genuine and impostor pairs

Sparse vs Dense Features? Low dimensional, dense feature representations Efficient in terms of information content per variable Efficient for matching Difficult to use for classification High dimensional, sparse feature representations Diluted information (inefficient for memory) May be inefficient for matching Easy to use for classification Any family of function can be parameterized linearly using a sufficiently high dimensional and sparse representation of the input variable e.g. Using a kernel representation But that representation may be too large!

Initializing a Convolutional Net with SPoE Architecture: LeNet 6 1->50->50->200->10 Baseline: random initialization 0.7% error on test set First Layer Initialized with SpoE 0.6% error on test set Training with elastically distorted samples: 0.38% error on test set

Initializing a Convolutional Net with SPoE Architecture: LeNet 6 1->50->50->200->10 9x9 kernels instead of 5x5 Baseline: random initialization First Layer Initialized with SpoE

Best Results on MNIST (from raw images: no preprocessing) CLASSIFIER DEFORMATION Knowledge-free methods 2-layer NN, 800 HU, CE 3-layer NN, 500+300 HU, CE, reg SVM, Gaussian Kernel Unsupervised Stacked RBM + backprop Convolutional nets Convolutional net LeNet-5, Convolutional net LeNet-6, Conv. net LeNet-6- + unsup learning Training set augmented with Affine Distortions 2-layer NN, 800 HU, CE Affine Virtual SVM deg-9 poly Affine Convolutional net, CE Affine Training et augmented with Elastic Distortions 2-layer NN, 800 HU, CE Elastic Convolutional net, CE Elastic Conv. net LeNet-6- + unsup learning Elastic ERROR Reference 1.60 1.53 1.40 0.95 Simard et al., ICDAR 2003 Hinton, in press, 2005 Cortes 92 + Many others Hinton, in press, 2005 0.80 0.70 0.60 LeCun 2005 Unpublished LeCun 2006 Unpublished LeCun 2006 Unpublished 1.10 0.80 0.60 Simard et al., ICDAR 2003 Scholkopf Simard et al., ICDAR 2003 0.70 0.40 0.38 Simard et al., ICDAR 2003 Simard et al., ICDAR 2003 LeCun 2006 Unpublished

Convexity is Overrated Using a suitable architecture (even if it leads to non convex loss functions) is more important than insisting on convexity (particularly if it restricts us to unsuitable architectures) e.g.: Shallow (convex) classifiers versus Deep (non-convex) classifiers Even for shallow/convex architecture, such as SVM, using non convex loss functions actually improves the accuracy and speed See trading convexity for efficiency by Collobert, Bottou, and Weston, ICML 2006 (best paper award)

Normalized Uniform Set: Error Rates Linear Classifier on raw stereo images: 30.2% error. K Nearest Neighbors on raw stereo images: 18.4% error. K Nearest Neighbors on PCA 95: 16.6% error. Pairwise SVM on 96x96 stereo images: 11.6% error Pairwise SVM on 95 Principal Components: 13.3% error. Convolutional Net on 96x96 stereo images: 5.8% error. Training instances Test instances

Normalized Uniform Set: Learning Times SVM: using a parallel implementation by Chop off the Graf, Durdanovic, and Cosatto (NEC Labs) last layer of the convolutional net and train an SVM on it

Experiment 2: Jittered Cluttered Dataset 291,600 training samples, 58,320 test samples SVM with Gaussian kernel 43.3% error Convolutional Net with binocular input: 7.8% error Convolutional Net + SVM on top: 5.9% error Convolutional Net with monocular input: 20.8% error Smaller mono net (DEMO): 26.0% error Dataset available from http://www.cs.nyu.edu/~yann

Jittered Cluttered Dataset OUCH! The convex loss, VC bounds Chop off the last layer, and representers theorems and train an SVM on it don't seem to help it works!

Optimization algorithms for learning Neural nets: conjugate gradient, BFGS, LM-BFGS,... don't work as well as stochastic gradient SVM: batch quadratic programming methods don't work as well as SMO. SMO don't work as well as recent on-line methods CRF: Iterative scaling (or whatever) doesn't work as well as stochastic gradient (Schraudolph et al ICML 2006) The discriminative learning folks in speech and handwriting recognition have known this for a long time Stochastic gradient has no good theoretical guarantees That doesn't mean we shouldn't use them, because the empirical evidence that it works better is overwhelming

Theoretical Guarantees are overrated When Empirical Evidence suggests a fact for which we don't have theoretical guarantees, it just means the theory is inadequate. When empirical evidence and theory disagree, the theory is wrong. Let's not be afraid of methods for which we have no theoretical guarantee, particularly if they have been shown to work well But, let's aggressively look to those theoretical guarantees. We should use our theoretical understanding to expand our creativity, not to restrict it.