Learning Convolutional Feature Hierarchies for Visual Recognition

Similar documents
Learning Feature Hierarchies for Object Recognition

Learning Convolutional Feature Hierarchies for Visual Recognition

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Integral Channel Features Addendum

What is the Best Multi-Stage Architecture for Object Recognition?

Pedestrian Detection with Unsupervised Multi-Stage Feature Learning

CS229 Final Project Report. A Multi-Task Feature Learning Approach to Human Detection. Tiffany Low

Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

Adaptive Deconvolutional Networks for Mid and High Level Feature Learning

Unsupervised Learning of Feature Hierarchies

Learning Hierarchical Feature Extractors For Image Recognition

Learning Hierarchical Feature Extractors For Image Recognition

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

Learning-based Methods in Vision

Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification?

Efficient Algorithms may not be those we think

Sparse Models in Image Understanding And Computer Vision

Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

Bilevel Sparse Coding

LEARNING A SPARSE DICTIONARY OF VIDEO STRUCTURE FOR ACTIVITY MODELING. Nandita M. Nayak, Amit K. Roy-Chowdhury. University of California, Riverside

Histograms of Sparse Codes for Object Detection

arxiv: v1 [cs.lg] 20 Dec 2013

Modeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

Facial Expression Classification with Random Filters Feature Extraction

Using Machine Learning for Classification of Cancer Cells

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Supervised Translation-Invariant Sparse Coding

Unsupervised Learning of Spatiotemporally Coherent Metrics

A HMAX with LLC for Visual Recognition

Joint Deep Learning for Pedestrian Detection

Computer Vision Lecture 16

Convolutional-Recursive Deep Learning for 3D Object Classification

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Computer Vision Lecture 16

Deep Learning for Generic Object Recognition

The Fastest Pedestrian Detector in the West

arxiv: v1 [cs.cv] 4 Oct 2017

Object detection with CNNs

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Sparse coding for image classification

Image Restoration and Background Separation Using Sparse Representation Framework

Extracting and Composing Robust Features with Denoising Autoencoders

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

A New Algorithm for Training Sparse Autoencoders

Learning Fast Approximations of Sparse Coding

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

arxiv: v2 [cs.lg] 22 Mar 2014

Developing Open Source code for Pyramidal Histogram Feature Sets

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Learning Algorithms for Medical Image Analysis. Matteo Santoro slipguru

Sparse Coding and Dictionary Learning for Image Analysis

Stacks of Convolutional Restricted Boltzmann Machines for Shift-Invariant Feature Learning

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Deep Neural Networks:

Efficient Learning of Sparse Representations with an Energy-Based Model

Sparsity and image processing

Pedestrian Detection with Deep Convolutional Neural Network

Pedestrian Detection Based on Deep Convolutional Neural Network with Ensemble Inference Network

Greedy algorithms for Sparse Dictionary Learning

Supplementary material: Efficient pedestrian detection by directly optimizing the partial area under the ROC curve

TA Section: Problem Set 4

Object Classification Problem

Discriminative sparse model and dictionary learning for object category recognition

Reconstructive Sparse Code Transfer for Contour Detection and Semantic Labeling

Return of the Devil in the Details: Delving Deep into Convolutional Nets

An Exploration of Computer Vision Techniques for Bird Species Classification

Novel Lossy Compression Algorithms with Stacked Autoencoders

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

Computer Vision Lecture 16

Effective Auto Encoder For Unsupervised Sparse Representation

Human Vision Based Object Recognition Sye-Min Christina Chan

Lightweight Unsupervised Domain Adaptation by Convolutional Filter Reconstruction

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. MGS Lecture 3: Deep Learning

A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009]

Multipath Sparse Coding Using Hierarchical Matching Pursuit

Large-Scale Visual Recognition With Deep Learning

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

A Learned Dictionary Model for Texture Classification

Unsupervised Learning

Alternatives to Direct Supervision

Deep Learning & Neural Networks

Multipath Sparse Coding Using Hierarchical Matching Pursuit

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Support Kernel Machines for Object Recognition

On Compact Codes for Spatially Pooled Features

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

WITH increasing penetration of portable multimedia. A Convolutional Neural Network Based Chinese Text Detection Algorithm via Text Structure Modeling

Visual Perception with Deep Learning

Sparsity Based Regularization

arxiv: v1 [cs.lg] 16 Nov 2010

Machine Learning for Physicists Lecture 6. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

Deconvolution Networks

ECE 6504: Deep Learning for Perception

Object Category Detection. Slides mostly from Derek Hoiem

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Transcription:

Learning Convolutional Feature Hierarchies for Visual Recognition Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michael Mathieu, Yann LeCun Computer Science Department Courant Institute of Mathematical Sciences New York University

Overview Feature Extractors Unsupervised Feature Learning Sparse Coding Convolutional Sparse Coding Efficient Predictors for Recognition Hierarchical Object Recognition

Object Recognition Feature Extraction Gabor, SIFT, HoG, Color, combinations... Classification PMK-SVM, Linear,... Grauman 05, Lazebnik 06, Serre 05, Mutch 06,...

Object Recognition Feature Extractor Classifier It would be better to learn everything adaptive to different domains Learn feature extractor and classifier together

Feature Extraction Filterbank Non-lin pooling Can be based on unsupervised learning Should be efficient to extract features Overcomplete sparse representations are easily separable Conventional sparse coding is slow

Sparse Coding Represent an input vector using an overcomplete dictionary y i X D j 0 D j i D z j # of dictionary elements > size of input # of zero elements > > > # of non-zero Input Dictionary z Representation (sparse) Each X is represented using a linear combination of columns of D How do we calculate z for a given X? How do we learn D?

Sparse Coding 1) Find the sparsest solution that satisfies a given reconstruction error min z 0 s.t. x i D i z i 2 2 2) Find the best k-sparse representation that minimizes reconstruction error min x i D i z i 2 2 s.t. z 0 = k L0 minimization requires search not tractable

Sparse Coding Matching Pursuit Algorithms offer greedy solution [Mallat and Zhang 93] Greedily pick the dictionary element that reduces residual most very fast, but unstable Function MP (Y,D,n) R=Y,z=0 for k=1..n i = argmax(d T R) z_i = D it R R end = R - z_i D i

Sparse Coding min 1 2 x Dz2 2 + λ i z i Input Code Dictionary Sparsity D is given, search for optimal z Reconstruction + Sparsity A mapping f : x z For every input x optimization required to get z Chen 98, Beck 09, Li 09

Sparse Modeling min 1 2 x Dz2 2 + λ i z i Learn from data D has to be bounded to avoid trivial solutions Online or batch algorithms for updating dictionary Learn mapping f D : x z Olshausen and Field 97, Aharon 06, Lee 07, Ranzato 07, Kavukcuoglu 08, Zeiler 10,...

Per sample energy Sparse Modeling E(x, z, D) =min 1 2 x Dz2 2 + λ i z i Loss L(x, D) = 1 X x X E(x, z, D) For each sample, 1. do inference minimize E(x,z,D) wrt z (sparse coding) 2. update parameters D D η E D 3. Constrain elements of D to be unit norm

Sparse Modeling Two problems 1. Inference takes long time Train a predictor function 2. Patch based modeling produces redundant features Use convolutional sparse modeling

Predictive Sparse Decomposition min 1 2 x Dz2 2 + λ i z i + z C(x; K) 2 2 z j = g j tanh(k j x) Learning For each sample from data, do: 1. Fix K and D, minimize to get optimal z 2. Using the optimal value of z update D and K 3. Scale elements of D to be unit norm.

Predictive Sparse Decomposition Encoder (K) Decoder (D) 12x12 image patches 256 dictionary elements

Predictive Sparse Decomposition Encoder (k) Decoder (D) 28x28 MNIST digit images 200 dictionary elements Strokes for digit parts

Recognition Architecture C(x; K) Filterbank + Non-linearity + Pooling Linear classifier 64 filters Pinto 08

Recognition - C101 Optimal (Feature Sign, Lee 07) vs PSD features PSD features perform slightly better Naturally optimal point of sparsity After 64 features not much gain PSD features are hundreds of times faster

Redundancy in Feature Extraction Filters Convolve Feature maps Patch based learning has to model same structure at every location They produce highly redundant features

Convolutional PSD 1 2 mask(x) i D i z i 2 2 + z 1 + i z i C(k i x) 2 2 x R w h D R K s s z R K (w s+1) (h s+1) Patch based Convolutional Convolutional training yields a more diverse set of features

Convolutional PSD Measuring the redundancy in the dictionary Cumulative histogram of angle between every pair of dictionary elements 10 4 acos(abs(max(d i D T j ))) Patch Based Training Convolutional Training # of cross corr > deg 10 3 10 2 10 1 10 0 0 20 40 60 80 deg

Convolutional PSD Encoder Training 2nd order information is important for fast convergence Better sparse representations can be obtained by using shrinkage operator Smooth shrinkage is important for conserving derivatives and parameters are learned 1 β log(exp(β b)+exp(β s) 1) b

Convolutional PSD Recognition Performance on C101 Low level convolutional feature learning improves significantly Patch Based SC Convolutional SC Unsup 52.2% 57.1% Unsup+ 54.2% 57.6% Unsup+ Unsupervised feature learning followed by supervised fine tuning

Multi-Stage Object Recognition Unsupervised Pre-Training Filter Bank Non- Linearity Pooling Unsupervised Pre-Training x z 1 Filter Bank Non- Linearity Pooling z 2 Supervised Refinement Filterbank - C(x;K) Non-linearities Pooling Building block of a multi-stage architecture

Recognition Accuracy on Caltech 101 70 65 60 55 50 45 40 52.2 Patch Based Training 57.1 Unsupervised 54.2 57.6 Unsupervised + Supervised 63.7 Convolutional Training 65.3 Unsupervised 65.5 66.3 1 Stage 1 Stage 2 Stages 2 Stages Unsupervised + Supervised Unsupervised pre-training with Convolutional PSD yields better accuracy than patch-based PSD

Pedestrian Detection On INRIA 0.9 1 0.8 0.7 0.6 0.5 0.4 Shapelet orig (90.5%) PoseInvSvm (68.6%) VJ OpenCv (53.0%) PoseInv (51.4%) Shapelet (50.4%) 0.3 VJ (47.5%) FtrMine (34.0%) miss rate 0.2 0.1 14.8% Pls (23.4%) HOG (23.1%) HikSvm (21.9%) LatSvm V1 (17.5%) MultiFtr (15.6%) R+R+ (14.8%) U+U+ (11.5%) 0.05 MultiFtr+CSS (10.9%) 11.5% LatSvm V2 (9.3%) FPDW (9.3%) ChnFtrs (8.7%) 10 2 10 1 10 0 10 1 false positives per image Purely supervised training: 14.8% miss rate Unsupervised pre-training with Conv PSD + supervised refinement : 11.5% Close to state of the art and improving quickly...

Questions?