Learning Feature Hierarchies for Object Recognition

Size: px

Start display at page:

Download "Learning Feature Hierarchies for Object Recognition"

Ann Morgan
6 years ago
Views:

1 Learning Feature Hierarchies for Object Recognition Koray Kavukcuoglu Computer Science Department Courant Institute of Mathematical Sciences New York University Marc Aurelio Ranzato, Kevin Jarrett, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Arthur Szlam Rob Fergus and Yann Lecun

2 Overview Feature Extractors Unsupervised Feature Learning Sparse Coding Learning Invariance Convolutional Sparse Coding Hierarchical Object Recognition

3 Object Recognition Feature Extraction Gabor, SIFT, HoG, Color, combinations... Classification PMK-SVM, Linear,... Grauman 05, Lazebnik 06, Serre 05, Mutch 06,...

4 Object Recognition Feature Extractor Classifier It would be better to learn everything adaptive to different domains Learn feature extractor and classifier together

5 Feature Extraction Can be based on unsupervised learning Should be efficient to extract features Overcomplete sparse representations are easily separable

6 Sparse Coding min 1 2 x Dz2 2 + λ i z i Input Code Dictionary D is given, search for optimal z Sparsity Reconstruction + Sparsity A mapping f : x z For every input x inference takes too much time Mallat 93, Chen 98, Beck 09, Li 09

7 Sparse Modeling min 1 2 x Dz2 2 + λ i z i Learn from data D has to be bounded to avoid trivial solutions Online or batch algorithms for updating dictionary Learn mapping f D : x z Olshausen and Field 97, Aharon 06, Lee 07, Ranzato 07, Kavukcuoglu 08, Zeiler 10,...

8 Per sample energy Sparse Modeling E(x, z, D) =min 1 2 x Dz2 2 + λ i z i Loss L(x, D) = 1 X x X E(x, z, D) For each sample, 1. do inference minimize E(x,z,D) wrt z (sparse coding) 2. update parameters keeping z fixed D D η E D 3. Project columns of D on the unit sphere

9 Sparse Modeling min 1 2 x Dz2 2 + λ i z i Iteration 1 Convergence = Inference process suppresses many except few

10 Sparse Modeling Problems 1. Inference takes long time Train a predictor function 2. Sparse coding is unstable Complex cell model 3. Patch based modeling produces redundant features Convolutional sparse modeling

11 Predictive Sparse Decomposition min 1 2 x Dz2 2 + λ i z = g tanh(k T x) z = sh λ k T x z = sh λ k T x + Ssh λ (k T x) z i + αz F e (x; K) 2 2 z Learned ISTA Gregor 10 Learning For each sample from data, do: 1. Fix K and D, minimize to get optimal z 2. Using the optimal value of z update D and K 3. Scale elements of D to be unit norm

12 Predictive Sparse Decomposition Encoder (k) Decoder (D) 12x12 image patches 256 dictionary elements

13 Predictive Sparse Decomposition Encoder (k) Decoder (D) 28x28 MNIST digit images 200 dictionary elements Strokes for digit parts

14 Good Representation? Performance on MNIST using 28x28 filters Compare representations from different methods PSD : worse reconstruction than other models, but better recognition Ranzato 07, Kavukcuoglu 08

15 Recognition Filterbank + Non-linearity + Pooling 64 filters Non-linearity max / av Contrast Rectification Local Normalization Pinto 08 Pooling

16 Recognition - C101 Optimal (Feature Sign, Lee 07) vs PSD features PSD features perform slightly better Naturally optimal point of sparsity After 64 features not much gain PSD features are order of magnitude faster

(In)Stability of Sparse Coding 0.5 0.4 0.3 0.2 0.

17 (In)Stability of Sparse Coding x16 input patch 1024 dictionary elements (4x overcomplete) pixel shifted

18 Learning Invariance min 1 2 x Dz2 2 + λ K i=1 j P i w j z 2 j + αz F e(x; K) 2 2 Group sparsity : Idea proposed by Hyvarinen&Hoyer (2001) in the context of square ICA w j : Gaussian weighting window Learning algorithm is the same as PSD Feedforward regressor F e(x;k), followed by pooling function produces invariant representations efficiently Ability to learn necessary transformations from data

units Drives basis functions in a pool to be similar Overlapping pools

19 Learning Invariance Overlapping Neighborhoods P i P 1 { P Gaussian Window w j Map of z P K (a) (b) Sparsity across pools rather than units Drives basis functions in a pool to be similar Overlapping pools ensure smooth representation manifolds Pool size =1 Regular PSD Kavukcuoglu 09

20 Topographic Maps Circular boundary conditions in both directions 6x6 pools with stride 2 in both dimensions

21 How invariant? 1.5 rotation 0 degrees 1.2 rotation 25 degrees 1 Normalized MSE Normalized MSE horizontal shift SIFT non rot. inv. SIFT Our alg. non inv. Our alg. inv horizontal shift Left: Normalized MSE between representations of original and transformed 16x16 patches Right: Same after 25 rotation IPSD is more invariant

2% IPSD(34x34) 59.6% PMK IPSD(56x56) 62.

22 Good for Recognition? i=1 i=2 i=k Caltech 101 (Accuracy) IPSD(24x24) 50.9% Linear SIFT(not rot.inv.) (24x24) 51.2% SVM SIFT (rot.inv.) (24x24) 45.2% IPSD(34x34) 59.6% PMK IPSD(56x56) 62.6% SVM IPSD(120x120) 65.6% MNIST (Error Rate) Linear IPSD (5x5) 1.0% SVM SIFT(not rot.inv.) (5x5) 1.5%

23 Multi-Stage Object Recognition Each stage contains a filter-bank, non-linearity and pooling Filterbank Tanh Abs LCN Pooling Conv Net Learned Average HMAX Gabor Max Jarret 09

24 Multi-Stage Object Recognition Unsupervised Pre-Training Filter Bank Non- Linearity Pooling Unsupervised Pre-Training x z 1 Filter Bank Non- Linearity Pooling z 2 Supervised Refinement Filterbank - Fe(x;K) Non-linearities Pooling Building block of a multi-stage architecture

25 Multi-Stage Object Recognition R U R + U + RR UU R + R + U + U + Pa N-Pa N-Pm Rabs-Pa Rabs-N-Pa C-Rabs-N-Pa Pa Unsupervised Pm Random N Supervised Fine Tuning Rabs Absolute Value Rect C Convolutional Unsup U Unsupervised R Random + Supervised Fine Tuning 2 stage > 1 stage

26 Multi-Stage Object Recognition R U R + U + RR UU R + R + U + U + Pa N-Pa N-Pm Rabs-Pa Rabs-N-Pa C-Rabs-N-Pa Pa Unsupervised Pm Random N Supervised Fine Tuning Rabs Absolute Value Rect C Convolutional Unsup U Unsupervised R Random + Supervised Fine Tuning Abs > No Abs

27 Multi-Stage Object Recognition R U R + U + RR UU R + R + U + U + Pa N-Pa N-Pm Rabs-Pa Rabs-N-Pa C-Rabs-N-Pa Pa Unsupervised Pm Random N Supervised Fine Tuning Rabs Absolute Value Rect C Convolutional Unsup U Unsupervised R Random + Supervised Fine Tuning LCN > No LCN

28 Multi-Stage Object Recognition R U R + U + RR UU R + R + U + U + Pa N-Pa N-Pm Rabs-Pa Rabs-N-Pa C-Rabs-N-Pa Pa Unsupervised Pm Random N Supervised Fine Tuning Rabs Absolute Value Rect C Convolutional Unsup U Unsupervised R Random + Supervised Fine Tuning Even Random Works!!!

29 Optimal Stimuli PSD Random Optimize input to maximize output of one unit after abs + LCN + average pooling Random feature extraction respond to oriented gratings too.

Random Filter Performance NORB Dataset: error rate 50 40 35 30 25 20 15 10 9 8 7 6 1.

30 Random Filter Performance NORB Dataset: error rate x96 grayscale images Caltech 101 F CSG P A (R + R + ) F CSG R abs N P A (UU) F CSG R abs N P A (R + R + ) F CSG R abs N P A (RR) F CSG R abs N P A (U + U + ) number of training samples per class 2. 5 classes (human, car, truck, airplane, animal) 3. Almost 5000 training samples per class

31 Redundancy in Feature Extraction Filters Convolve Feature maps Patch based learning has to model same structure at every location They produce highly redundant features

Convolutional PSD 1 D k z k 2 2 + λ z 1 + α z F e

32 Convolutional PSD 1 D k z k λ z 1 + α z F e (x) x k x R w h D R K s s z R K (w s+1) (h s+1) Patch based Convolutional Convolutional training yields a more diverse set of features Kavukcuoglu 10

33 Convolutional PSD Measuring the redundancy in the dictionary Cumulative histogram of angle between ALL PAIRS of dictionary elements 10 4 acos(max(abs(d i D T j ))) Patch Based Training Convolutional Training # of cross corr < deg deg

Convolutional PSD 1 D k z k 2 2 + λ z 1 + α z F e (x) 2 2 2 x k x R w h D R K s s z R K (w s+1) (h s+1) = Convolutional

34 Convolutional PSD 1 D k z k λ z 1 + α z F e (x) x k x R w h D R K s s z R K (w s+1) (h s+1) = Convolutional sparse coding model large images rather than small image patches Each iteration reduces redundancy in the feature representation

35 Convolutional PSD Input (x) Dictionary (D) Reconstruction Code (z) at Iteration 1 Each iteration reduces redundancy in the feature representation

36 Convolutional PSD Input (x) Dictionary (D) Reconstruction Code (z) at Iteration 2 Each iteration reduces redundancy in the feature representation

37 Convolutional PSD Input (x) Dictionary (D) Reconstruction Code (z) at Convergence Each iteration reduces redundancy in the feature representation

38 Convolutional PSD - Better Encoders To be able to predict convolutional sparse representations, simple encoders are very inadequate A better encoder should use shrinkage operator with a learned suppression matrix to be able to approximate sparse codes (Gregor 10) Encoder Training 2nd order information is important for fast convergence Smooth shrinkage is important for conserving derivatives z = sh λ k T x z = sh λ k T x + Ssh λ (k T x) 1 β log(exp(β b)+exp(β s) 1) b

39 Convolutional Training Inference and Training Order of magnitude more costly Efficient inference algorithms are crucial (ISTA, FISTA, CD) 64 filters = 64 times overcomplete representation Proper handling of border effects is important Test time is the same as patch based model

40 Convolutional PSD Recognition Performance on C101 Low level convolutional feature learning improves Patch Based Convolutional 1 Unsup 52.2% 57.1% Stage Unsup % 57.6% 2 Unsup 63.7% 65.5% Stage Unsup % 66.3%

41 Pedestrian Detection On INRIA Shapelet orig (90.5%) PoseInvSvm (68.6%) VJ OpenCv (53.0%) PoseInv (51.4%) Shapelet (50.4%) 0.3 VJ (47.5%) FtrMine (34.0%) miss rate % Pls (23.4%) HOG (23.1%) HikSvm (21.9%) LatSvm V1 (17.5%) MultiFtr (15.6%) R+R+ (14.8%) U+U+ (11.5%) 0.05 MultiFtr+CSS (10.9%) 11.5% LatSvm V2 (9.3%) FPDW (9.3%) ChnFtrs (8.7%) false positives per image Purely supervised training: 14.8% miss rate Unsupervised pre-training with Conv PSD + supervised refinement : 11.5% Close to state of the art and improving quickly...

42 Questions?

Learning Convolutional Feature Hierarchies for Visual Recognition

Learning Convolutional Feature Hierarchies for Visual Recognition Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michael Mathieu, Yann LeCun Computer Science Department Courant Institute