Learning Convolutional Feature Hierarchies for Visual Recognition Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michael Mathieu, Yann LeCun Computer Science Department Courant Institute of Mathematical Sciences New York University
Overview Feature Extractors Unsupervised Feature Learning Sparse Coding Convolutional Sparse Coding Efficient Predictors for Recognition Hierarchical Object Recognition
Object Recognition Feature Extraction Gabor, SIFT, HoG, Color, combinations... Classification PMK-SVM, Linear,... Grauman 05, Lazebnik 06, Serre 05, Mutch 06,...
Object Recognition Feature Extractor Classifier It would be better to learn everything adaptive to different domains Learn feature extractor and classifier together
Feature Extraction Filterbank Non-lin pooling Can be based on unsupervised learning Should be efficient to extract features Overcomplete sparse representations are easily separable Conventional sparse coding is slow
Sparse Coding Represent an input vector using an overcomplete dictionary y i X D j 0 D j i D z j # of dictionary elements > size of input # of zero elements > > > # of non-zero Input Dictionary z Representation (sparse) Each X is represented using a linear combination of columns of D How do we calculate z for a given X? How do we learn D?
Sparse Coding 1) Find the sparsest solution that satisfies a given reconstruction error min z 0 s.t. x i D i z i 2 2 2) Find the best k-sparse representation that minimizes reconstruction error min x i D i z i 2 2 s.t. z 0 = k L0 minimization requires search not tractable
Sparse Coding Matching Pursuit Algorithms offer greedy solution [Mallat and Zhang 93] Greedily pick the dictionary element that reduces residual most very fast, but unstable Function MP (Y,D,n) R=Y,z=0 for k=1..n i = argmax(d T R) z_i = D it R R end = R - z_i D i
Sparse Coding min 1 2 x Dz2 2 + λ i z i Input Code Dictionary Sparsity D is given, search for optimal z Reconstruction + Sparsity A mapping f : x z For every input x optimization required to get z Chen 98, Beck 09, Li 09
Sparse Modeling min 1 2 x Dz2 2 + λ i z i Learn from data D has to be bounded to avoid trivial solutions Online or batch algorithms for updating dictionary Learn mapping f D : x z Olshausen and Field 97, Aharon 06, Lee 07, Ranzato 07, Kavukcuoglu 08, Zeiler 10,...
Per sample energy Sparse Modeling E(x, z, D) =min 1 2 x Dz2 2 + λ i z i Loss L(x, D) = 1 X x X E(x, z, D) For each sample, 1. do inference minimize E(x,z,D) wrt z (sparse coding) 2. update parameters D D η E D 3. Constrain elements of D to be unit norm
Sparse Modeling Two problems 1. Inference takes long time Train a predictor function 2. Patch based modeling produces redundant features Use convolutional sparse modeling
Predictive Sparse Decomposition min 1 2 x Dz2 2 + λ i z i + z C(x; K) 2 2 z j = g j tanh(k j x) Learning For each sample from data, do: 1. Fix K and D, minimize to get optimal z 2. Using the optimal value of z update D and K 3. Scale elements of D to be unit norm.
Predictive Sparse Decomposition Encoder (K) Decoder (D) 12x12 image patches 256 dictionary elements
Predictive Sparse Decomposition Encoder (k) Decoder (D) 28x28 MNIST digit images 200 dictionary elements Strokes for digit parts
Recognition Architecture C(x; K) Filterbank + Non-linearity + Pooling Linear classifier 64 filters Pinto 08
Recognition - C101 Optimal (Feature Sign, Lee 07) vs PSD features PSD features perform slightly better Naturally optimal point of sparsity After 64 features not much gain PSD features are hundreds of times faster
Redundancy in Feature Extraction Filters Convolve Feature maps Patch based learning has to model same structure at every location They produce highly redundant features
Convolutional PSD 1 2 mask(x) i D i z i 2 2 + z 1 + i z i C(k i x) 2 2 x R w h D R K s s z R K (w s+1) (h s+1) Patch based Convolutional Convolutional training yields a more diverse set of features
Convolutional PSD Measuring the redundancy in the dictionary Cumulative histogram of angle between every pair of dictionary elements 10 4 acos(abs(max(d i D T j ))) Patch Based Training Convolutional Training # of cross corr > deg 10 3 10 2 10 1 10 0 0 20 40 60 80 deg
Convolutional PSD Encoder Training 2nd order information is important for fast convergence Better sparse representations can be obtained by using shrinkage operator Smooth shrinkage is important for conserving derivatives and parameters are learned 1 β log(exp(β b)+exp(β s) 1) b
Convolutional PSD Recognition Performance on C101 Low level convolutional feature learning improves significantly Patch Based SC Convolutional SC Unsup 52.2% 57.1% Unsup+ 54.2% 57.6% Unsup+ Unsupervised feature learning followed by supervised fine tuning
Multi-Stage Object Recognition Unsupervised Pre-Training Filter Bank Non- Linearity Pooling Unsupervised Pre-Training x z 1 Filter Bank Non- Linearity Pooling z 2 Supervised Refinement Filterbank - C(x;K) Non-linearities Pooling Building block of a multi-stage architecture
Recognition Accuracy on Caltech 101 70 65 60 55 50 45 40 52.2 Patch Based Training 57.1 Unsupervised 54.2 57.6 Unsupervised + Supervised 63.7 Convolutional Training 65.3 Unsupervised 65.5 66.3 1 Stage 1 Stage 2 Stages 2 Stages Unsupervised + Supervised Unsupervised pre-training with Convolutional PSD yields better accuracy than patch-based PSD
Pedestrian Detection On INRIA 0.9 1 0.8 0.7 0.6 0.5 0.4 Shapelet orig (90.5%) PoseInvSvm (68.6%) VJ OpenCv (53.0%) PoseInv (51.4%) Shapelet (50.4%) 0.3 VJ (47.5%) FtrMine (34.0%) miss rate 0.2 0.1 14.8% Pls (23.4%) HOG (23.1%) HikSvm (21.9%) LatSvm V1 (17.5%) MultiFtr (15.6%) R+R+ (14.8%) U+U+ (11.5%) 0.05 MultiFtr+CSS (10.9%) 11.5% LatSvm V2 (9.3%) FPDW (9.3%) ChnFtrs (8.7%) 10 2 10 1 10 0 10 1 false positives per image Purely supervised training: 14.8% miss rate Unsupervised pre-training with Conv PSD + supervised refinement : 11.5% Close to state of the art and improving quickly...
Questions?