Learning Convolutional Feature Hierarchies for Visual Recognition

Learning Convolutional Feature Hierarchies for Visual Recognition Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michael Mathieu, Yann LeCun Computer Science Department Courant Institute of Mathematical Sciences New York University

Overview Feature Extractors Unsupervised Feature Learning Sparse Coding Convolutional Sparse Coding Efficient Predictors for Recognition Hierarchical Object Recognition

Object Recognition Feature Extraction Gabor, SIFT, HoG, Color, combinations... Classification PMK-SVM, Linear,... Grauman 05, Lazebnik 06, Serre 05, Mutch 06,...

Object Recognition Feature Extractor Classifier It would be better to learn everything adaptive to different domains Learn feature extractor and classifier together

Feature Extraction Filterbank Non-lin pooling Can be based on unsupervised learning Should be efficient to extract features Overcomplete sparse representations are easily separable Conventional sparse coding is slow

Sparse Coding Represent an input vector using an overcomplete dictionary y i X D j 0 D j i D z j # of dictionary elements > size of input # of zero elements > > > # of non-zero Input Dictionary z Representation (sparse) Each X is represented using a linear combination of columns of D How do we calculate z for a given X? How do we learn D?

Sparse Coding 1) Find the sparsest solution that satisfies a given reconstruction error min z 0 s.t. x i D i z i 2 2 2) Find the best k-sparse representation that minimizes reconstruction error min x i D i z i 2 2 s.t. z 0 = k L0 minimization requires search not tractable

Sparse Coding Matching Pursuit Algorithms offer greedy solution [Mallat and Zhang 93] Greedily pick the dictionary element that reduces residual most very fast, but unstable Function MP (Y,D,n) R=Y,z=0 for k=1..n i = argmax(d T R) z_i = D it R R end = R - z_i D i

Sparse Coding min 1 2 x Dz2 2 + λ i z i Input Code Dictionary Sparsity D is given, search for optimal z Reconstruction + Sparsity A mapping f : x z For every input x optimization required to get z Chen 98, Beck 09, Li 09

Sparse Modeling min 1 2 x Dz2 2 + λ i z i Learn from data D has to be bounded to avoid trivial solutions Online or batch algorithms for updating dictionary Learn mapping f D : x z Olshausen and Field 97, Aharon 06, Lee 07, Ranzato 07, Kavukcuoglu 08, Zeiler 10,...

Per sample energy Sparse Modeling E(x, z, D) =min 1 2 x Dz2 2 + λ i z i Loss L(x, D) = 1 X x X E(x, z, D) For each sample, 1. do inference minimize E(x,z,D) wrt z (sparse coding) 2. update parameters D D η E D 3. Constrain elements of D to be unit norm

Sparse Modeling Two problems 1. Inference takes long time Train a predictor function 2. Patch based modeling produces redundant features Use convolutional sparse modeling

Predictive Sparse Decomposition min 1 2 x Dz2 2 + λ i z i + z C(x; K) 2 2 z j = g j tanh(k j x) Learning For each sample from data, do: 1. Fix K and D, minimize to get optimal z 2. Using the optimal value of z update D and K 3. Scale elements of D to be unit norm.

Predictive Sparse Decomposition Encoder (K) Decoder (D) 12x12 image patches 256 dictionary elements

Predictive Sparse Decomposition Encoder (k) Decoder (D) 28x28 MNIST digit images 200 dictionary elements Strokes for digit parts

Recognition Architecture C(x; K) Filterbank + Non-linearity + Pooling Linear classifier 64 filters Pinto 08

Recognition - C101 Optimal (Feature Sign, Lee 07) vs PSD features PSD features perform slightly better Naturally optimal point of sparsity After 64 features not much gain PSD features are hundreds of times faster

Redundancy in Feature Extraction Filters Convolve Feature maps Patch based learning has to model same structure at every location They produce highly redundant features

Convolutional PSD 1 2 mask(x) i D i z i 2 2 + z 1 + i z i C(k i x) 2 2 x R w h D R K s s z R K (w s+1) (h s+1) Patch based Convolutional Convolutional training yields a more diverse set of features

Convolutional PSD Measuring the redundancy in the dictionary Cumulative histogram of angle between every pair of dictionary elements 10 4 acos(abs(max(d i D T j ))) Patch Based Training Convolutional Training # of cross corr > deg 10 3 10 2 10 1 10 0 0 20 40 60 80 deg

Convolutional PSD Encoder Training 2nd order information is important for fast convergence Better sparse representations can be obtained by using shrinkage operator Smooth shrinkage is important for conserving derivatives and parameters are learned 1 β log(exp(β b)+exp(β s) 1) b

Convolutional PSD Recognition Performance on C101 Low level convolutional feature learning improves significantly Patch Based SC Convolutional SC Unsup 52.2% 57.1% Unsup+ 54.2% 57.6% Unsup+ Unsupervised feature learning followed by supervised fine tuning

Multi-Stage Object Recognition Unsupervised Pre-Training Filter Bank Non- Linearity Pooling Unsupervised Pre-Training x z 1 Filter Bank Non- Linearity Pooling z 2 Supervised Refinement Filterbank - C(x;K) Non-linearities Pooling Building block of a multi-stage architecture

Recognition Accuracy on Caltech 101 70 65 60 55 50 45 40 52.2 Patch Based Training 57.1 Unsupervised 54.2 57.6 Unsupervised + Supervised 63.7 Convolutional Training 65.3 Unsupervised 65.5 66.3 1 Stage 1 Stage 2 Stages 2 Stages Unsupervised + Supervised Unsupervised pre-training with Convolutional PSD yields better accuracy than patch-based PSD

Pedestrian Detection On INRIA 0.9 1 0.8 0.7 0.6 0.5 0.4 Shapelet orig (90.5%) PoseInvSvm (68.6%) VJ OpenCv (53.0%) PoseInv (51.4%) Shapelet (50.4%) 0.3 VJ (47.5%) FtrMine (34.0%) miss rate 0.2 0.1 14.8% Pls (23.4%) HOG (23.1%) HikSvm (21.9%) LatSvm V1 (17.5%) MultiFtr (15.6%) R+R+ (14.8%) U+U+ (11.5%) 0.05 MultiFtr+CSS (10.9%) 11.5% LatSvm V2 (9.3%) FPDW (9.3%) ChnFtrs (8.7%) 10 2 10 1 10 0 10 1 false positives per image Purely supervised training: 14.8% miss rate Unsupervised pre-training with Conv PSD + supervised refinement : 11.5% Close to state of the art and improving quickly...

Questions?