Learning Convolutional Feature Hierarchies for Visual Recognition

Size: px

Start display at page:

Download "Learning Convolutional Feature Hierarchies for Visual Recognition"

Nathan Welch
6 years ago
Views:

1 Learning Convolutional Feature Hierarchies for Visual Recognition Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michael Mathieu, Yann LeCun Computer Science Department Courant Institute of Mathematical Sciences New York University

2 Overview Feature Extractors Unsupervised Feature Learning Sparse Coding Convolutional Sparse Coding Efficient Predictors for Recognition Hierarchical Object Recognition

3 Object Recognition Feature Extraction Gabor, SIFT, HoG, Color, combinations... Classification PMK-SVM, Linear,... Grauman 05, Lazebnik 06, Serre 05, Mutch 06,...

4 Object Recognition Feature Extractor Classifier It would be better to learn everything adaptive to different domains Learn feature extractor and classifier together

extract features Overcomplete sparse representations

5 Feature Extraction Filterbank Non-lin pooling Can be based on unsupervised learning Should be efficient to extract features Overcomplete sparse representations are easily separable Conventional sparse coding is slow

6 Sparse Coding Represent an input vector using an overcomplete dictionary y i X D j 0 D j i D z j # of dictionary elements > size of input # of zero elements > > > # of non-zero Input Dictionary z Representation (sparse) Each X is represented using a linear combination of columns of D How do we calculate z for a given X? How do we learn D?

7 Sparse Coding 1) Find the sparsest solution that satisfies a given reconstruction error min z 0 s.t. x i D i z i 2 2 2) Find the best k-sparse representation that minimizes reconstruction error min x i D i z i 2 2 s.t. z 0 = k L0 minimization requires search not tractable

8 Sparse Coding Matching Pursuit Algorithms offer greedy solution [Mallat and Zhang 93] Greedily pick the dictionary element that reduces residual most very fast, but unstable Function MP (Y,D,n) R=Y,z=0 for k=1..n i = argmax(d T R) z_i = D it R R end = R - z_i D i

9 Sparse Coding min 1 2 x Dz2 2 + λ i z i Input Code Dictionary Sparsity D is given, search for optimal z Reconstruction + Sparsity A mapping f : x z For every input x optimization required to get z Chen 98, Beck 09, Li 09

10 Sparse Modeling min 1 2 x Dz2 2 + λ i z i Learn from data D has to be bounded to avoid trivial solutions Online or batch algorithms for updating dictionary Learn mapping f D : x z Olshausen and Field 97, Aharon 06, Lee 07, Ranzato 07, Kavukcuoglu 08, Zeiler 10,...

11 Per sample energy Sparse Modeling E(x, z, D) =min 1 2 x Dz2 2 + λ i z i Loss L(x, D) = 1 X x X E(x, z, D) For each sample, 1. do inference minimize E(x,z,D) wrt z (sparse coding) 2. update parameters D D η E D 3. Constrain elements of D to be unit norm

12 Sparse Modeling Two problems 1. Inference takes long time Train a predictor function 2. Patch based modeling produces redundant features Use convolutional sparse modeling

13 Predictive Sparse Decomposition min 1 2 x Dz2 2 + λ i z i + z C(x; K) 2 2 z j = g j tanh(k j x) Learning For each sample from data, do: 1. Fix K and D, minimize to get optimal z 2. Using the optimal value of z update D and K 3. Scale elements of D to be unit norm.

14 Predictive Sparse Decomposition Encoder (K) Decoder (D) 12x12 image patches 256 dictionary elements

15 Predictive Sparse Decomposition Encoder (k) Decoder (D) 28x28 MNIST digit images 200 dictionary elements Strokes for digit parts

16 Recognition Architecture C(x; K) Filterbank + Non-linearity + Pooling Linear classifier 64 filters Pinto 08

17 Recognition - C101 Optimal (Feature Sign, Lee 07) vs PSD features PSD features perform slightly better Naturally optimal point of sparsity After 64 features not much gain PSD features are hundreds of times faster

18 Redundancy in Feature Extraction Filters Convolve Feature maps Patch based learning has to model same structure at every location They produce highly redundant features

Convolutional PSD 1 2 mask(x) i D i z i 2 2 + z 1

19 Convolutional PSD 1 2 mask(x) i D i z i z 1 + i z i C(k i x) 2 2 x R w h D R K s s z R K (w s+1) (h s+1) Patch based Convolutional Convolutional training yields a more diverse set of features

20 Convolutional PSD Measuring the redundancy in the dictionary Cumulative histogram of angle between every pair of dictionary elements 10 4 acos(abs(max(d i D T j ))) Patch Based Training Convolutional Training # of cross corr > deg deg

21 Convolutional PSD Encoder Training 2nd order information is important for fast convergence Better sparse representations can be obtained by using shrinkage operator Smooth shrinkage is important for conserving derivatives and parameters are learned 1 β log(exp(β b)+exp(β s) 1) b

22 Convolutional PSD Recognition Performance on C101 Low level convolutional feature learning improves significantly Patch Based SC Convolutional SC Unsup 52.2% 57.1% Unsup+ 54.2% 57.6% Unsup+ Unsupervised feature learning followed by supervised fine tuning

Non- Linearity Pooling z 2 Supervised Refinement Filterbank - C(x;K)

23 Multi-Stage Object Recognition Unsupervised Pre-Training Filter Bank Non- Linearity Pooling Unsupervised Pre-Training x z 1 Filter Bank Non- Linearity Pooling z 2 Supervised Refinement Filterbank - C(x;K) Non-linearities Pooling Building block of a multi-stage architecture

24 Recognition Accuracy on Caltech Patch Based Training 57.1 Unsupervised Unsupervised + Supervised 63.7 Convolutional Training 65.3 Unsupervised Stage 1 Stage 2 Stages 2 Stages Unsupervised + Supervised Unsupervised pre-training with Convolutional PSD yields better accuracy than patch-based PSD

25 Pedestrian Detection On INRIA Shapelet orig (90.5%) PoseInvSvm (68.6%) VJ OpenCv (53.0%) PoseInv (51.4%) Shapelet (50.4%) 0.3 VJ (47.5%) FtrMine (34.0%) miss rate % Pls (23.4%) HOG (23.1%) HikSvm (21.9%) LatSvm V1 (17.5%) MultiFtr (15.6%) R+R+ (14.8%) U+U+ (11.5%) 0.05 MultiFtr+CSS (10.9%) 11.5% LatSvm V2 (9.3%) FPDW (9.3%) ChnFtrs (8.7%) false positives per image Purely supervised training: 14.8% miss rate Unsupervised pre-training with Conv PSD + supervised refinement : 11.5% Close to state of the art and improving quickly...

26 Questions?

Learning Feature Hierarchies for Object Recognition

Learning Feature Hierarchies for Object Recognition Koray Kavukcuoglu Computer Science Department Courant Institute of Mathematical Sciences New York University Marc Aurelio Ranzato, Kevin Jarrett, Pierre