Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

Size: px

Start display at page:

Download "Generalized Lasso based Approximation of Sparse Coding for Visual Recognition"

Gwendoline Davis
5 years ago
Views:

1 Generalized Lasso based Approximation of Sparse Coding for Visual Recognition Nobuyuki Morioka The University of New South Wales & NICTA Sydney, Australia Shin ichi Satoh National Institute of Informatics Tokyo, Japan Abstract Sparse coding, a method of explaining sensory data with as few dictionary bases as possible, has attracted much attention in computer vision. For visual object category recognition, l 1 regularized sparse coding is combined with the spatial pyramid representation to obtain state-of-the-art performance. However, because of its iterative optimization, applying sparse coding onto every local feature descriptor extracted from an image database can become a major bottleneck. To overcome this computational challenge, this paper presents Generalized Lasso based Approximation of Sparse coding (). By representing the distribution of sparse coefficients with slice transform, we fit a piece-wise linear mapping function with the generalized lasso. We also propose an efficient post-refinement procedure to perform mutual inhibition between bases which is essential for an overcomplete setting. The experiments show that obtains a comparable performance to l 1 regularized sparse coding, yet achieves a significant speed up demonstrating its effectiveness for large-scale visual recognition problems. 1 Introduction Recently, sparse coding [3, 18] has attracted much attention in computer vision research. Its applications range from image denoising [23] to image segmentation [17] and image classification [10, 24], achieving state-of-the-art results. Sparse coding interprets an input signal x R D 1 with a sparse vector u R K 1 whose linear combination with an overcomplete set of K bases (i.e., D K), also known as dictionary B R D K, reconstructs the input as precisely as possible. To enforce sparseness on u, the l 1 norm is a popular choice due to its computational convenience and its interesting connection with the NP-hard l 0 norm in compressed sensing [2]. Several efficient l 1 regularized sparse coding algorithms have been proposed [4, 14] and are adopted in visual recognition [10, 24]. In particular, Yang et al. [24] compute the spare codes of many local feature descriptors with sparse coding. However, due to the l 1 norm being non-smooth convex, the sparse coding algorithm needs to optimize iteratively until convergence. Therefore, the local feature descriptor coding step becomes a major bottleneck for large-scale problems like visual recognition. The goal of this paper is to achieve state-of-the-art performance on large-scale visual recognition that is comparable to the work of Yang et al. [24], but with a significant improvement in efficiency. To this end, we propose Generalized Lasso based Approximation of Sparse coding, for short. Specifically, we encode the distribution of each dimension in sparse codes with the slice transform representation [9] and learn a piece-wise linear mapping function with the generalized lasso to obtain the best fit [21] to approximate l 1 regularized sparse coding. We further propose an efficient post-refinement procedure to capture the dependency between overcomplete bases. The effectiveness of our approach is demonstrated with several challenging object and scene category datasets, showing a comparable performance to Yang et al. [24] and perforg better than other fast algorithms that obtain sparse codes [22]. While there have been several supervised dictionary 1

2 learning methods for sparse coding to obtain more discriative sparse representations [16, 25], they have not been evaluated on visual recognition with many object categories due to its computational challenges. Furthermore, Ranzato et al. [19] have empirically shown that unsupervised learning of visual features can obtain a more general and effective representation. Therefore, in this paper, we focus on learning a fast approximation of sparse coding in an unsupervised manner. The paper is organized as follows: Section 2 reviews some related work including the linear spatial pyramid combined with sparse coding and other fast algorithms to obtain sparse codes. Section 3 presents. This is followed by the experimental results on several challenging categorization datasets in Section 4. Section 5 concludes the paper with discussion and future work. 2 Related Work 2.1 Linear Spatial Pyramid Matching Using Sparse Coding This section reviews the linear spatial pyramid matching based on sparse coding by Yang et al. [24]. Given a collection of N local feature descriptors randomly sampled from training images X = [x 1, x 2,..., x N ] R D N, an over-complete dictionary B = [b 1, b 2,..., b K ] R D K is learned by B,U N x i Bu i λ u i 1 s.t. b k 2 2 1, k = 1, 2,..., K. (1) i=1 The cost function above is a combination of the reconstruction error and the l 1 sparsity penalty which is controlled by λ. The l 2 norm on each b k is constrained to be less than or equal to 1 to avoid a trival solution. Since both B and [u 1, u 2,..., u N ] are unknown a priori, an alternating optimization technique is often used [14] to optimize over the two parameter sets. Under the spatial pyramid matching framework, each image is divided into a set of sub-regions r = [r 1, r 2,..., r R ]. For example, if 1 1, 2 2 and 4 4 partitions are used on an image, we have 21 sub-regions. Then, we compute the sparse solutions of all local feature descriptors, denoted as U rj, appearing in each sub-region r j by U rj X rj BU rj λ U rj 1. (2) The sparse solutions are max pooled for each sub-region and concatenated with other sub-regions to build a statistic of the image by h = [max( U r1 ), max( U r2 ),..., max( U rr ) ], (3) where max(.) is a function that finds the maximum value at each row of a matrix and returns a column vector. Finally, a linear SVM is trained on a set of image statistics for classification. The main advantage of using sparse coding is that state-of-the-art results can be achieved with a simple linear classifier as reported in [24]. Compared to kernel-based methods, this dramatically speeds up training and testing time of the classifier. However, the step of finding a sparse code for each local descriptor with sparse coding now becomes a major bottleneck. Using the efficient sparse coding algorithm based on feature-sign search [14], the time to compute the solution for one local descriptor u is O(KZ) where Z is the number of non-zeros in u. This paper proposes an approximation method whose time complexity reduces to O(K). With the post-refinement procedure, its time complexity is O(K + Z 2 ) which is still much lower than O(KZ). 2.2 Predictive Sparse Decomposition Predictive sparse decomposition (PSD) described in [10, 11] is a feedforward network that applies a non-linear mapping function on linearly transformed input data to match the optimal sparse coding solution as accurate as possible. Such feedfoward network is defined as: û i = Gg(Wx i, θ), where g(z, θ) denotes a non-linear parametric mapping function which can be of any form, but to name a few there are hyperbolic tangent, tanh(z + θ) and soft shrinkage, sign(z) max( z θ, 0). The function is applied to linearly transformed data Wx i and subsequently scaled by a diagonal matrix 2

3 G. Given training samples {x i } N i=1, the parameters can be estimated either jointly or separately from the dictionary B. When learning jointly, we imize the cost function given below: B,G,W,θ,U i=1 N x i Bu i λ u i 1 + γ u i Gg(Wx i, θ) 2 2. (4) When learning separately, B and U are obtained with Eqn. (1) first. Then, other remaining parameters G, W and θ are estimated by solving the last term of Eqn. (4) only. Gregor and LeCun [7] have later proposed a better, but iterative approximation scheme for l 1 regularized sparse coding. One downside of the parametric approach is its accuracy is largely dependent on how well its parametric function fits the target statistical distribution, as argued by Hel-Or and Shaked [9]. This paper explores a non-parametric approach which can fit any distribution as long as data samples available are representative. The advantage of our approach over the parametric approach is that we do not need to seek an appropriate parametric function for each distribution. This is particularly useful in visual recognition that uses multiple feature types, as it automatically estimates the function form for each feature type from data. We demonstrate this with two different local descriptor types in our experiments. 2.3 Locality-constrained Linear Coding Another notable work that overcomes the bottleneck of the local descriptor coding step is localityconstrained linear coding (LLC) proposed by Wang et al. [22], a fast version of local coordinate coding [26]. Given a local feature descriptor x i, LLC searches for M nearest dictionary bases of each local descriptor x i and these nearest bases stacked in columns are denoted as B φi R D M where φ i indicates the index list of the bases. Then, the coefficients u φi R M 1 whose linear combination with B φi reconstructs x i is solved by: u φi x i B φi u φi 2 2 s.t. 1 u φi = 1. (5) This is the least squares problem which can be solved quite efficiently. The final sparse code u i is obtained by setting its elements indexed at φ i to u φi. The time complexity of LLC is O(K + M 2 ). This excludes the time required to find M nearest neighbours. While it is fast, the resulting sparse solutions obtained are not as discriative as the ones obtained by sparse coding. This may be due to the fact that M is fixed across all local feature descriptors. Some descriptors may need more bases for accurate representation and others may need less bases for more distinctiveness. In contrast, the number of bases selected with our post-refinement procedure to handle the mutual inhibition is different for each local descriptor. 3 Generalized Lasso based Approximation of Sparse Coding This section describes. We first learn a dictionary from a collection of local feature descriptors as given Eqn. (1). Then, based on slice transform representation, we fit a piece-wise linear mapping function with the generalized lasso to approximate the optimal sparse solutions of the local feature descriptors under l 1 regularized sparse coding. Finally, we propose an efficient post-refinement procedure to perform the mutual inhibition. 3.1 Slice Transform Representation Slice transform representation is introduced as a way to discretize a function space so to fit a piecewise linear function for the purpose of image denoising by Hel-Or and Shaked [9]. This is later adopted by Adler et al. [1] for single image super resolution. In this paper, we utilise the representation to approximate sparse coding to obtain sparse codes for local feature descriptor as fast as possible. Given a local descriptor x, we can linearly combine with B to obtain z. For the moment, we just consider one dimension of z denoted as z which is a real value and lies in a half open interval of [a, b). The interval is divided into Q 1 equal-sized bins whose boundaries form a vector q = [q 1, q 2,..., q Q ] such that a = q 1 < q 2 < q Q = b. 3

4 u Data RLS L u Data RLS L u Data RLS L z z z (a) (b) (c) Figure 1: Different approaches to fit a piece-wise linear mapping function. Regularized least squares (RLS) in red (see Eqn. (8)). l 1 -regularized sparse coding (L1-) in magenta (see Eqn. (9)). in green (see Eqn. (10)). (a) All three methods achieving a good fit. (b) A case when L1- fails to extrapolate well at the end and RLS tends to align itself to q in black. (c) A case when data samples at around 0.25 are removed artificially to illustrate that RLS fails to interpolate as no neighoring prior is used. In contrast, can both interpolate and extrapolate well in the case of missing or noisy data. The interval into which the value of z falls is expressed as: π(z) = j if z [q j 1, q j ), and its corresponding residue is calculated by: r(z) = Based on the above, we can re-express z as z q π(z) 1 q π(z) q π(z) 1. z = (1 r(z))q π(z) 1 + r(z)q π(z) = S q (z)q, (6) where S q (z) = [0,..., 0, 1 r(z), r(z), 0,..., 0]. If we now come back to the multivariate case of z = B x, then we have the following: z = [S q (z 1 )q, S q (z 2 )q,..., S q (z K )q], where z k implies the k th dimension of z. Then, we replace the boundary vector q with p = {p 1, p 2,..., p K } such that resulting vector approximates the optimal sparse solution of x obtained by l 1 regularized sparse coding as much as possible. This is written as û = [S q (z 1 )p 1, S q (z 2 )p 2,..., S q (z K )p K ]. (7) Hel-Or and Shaked [9] have formulated the problem of learning each p k as regularized least squares either independently in a transform domain or jointly in a spatial domain. Unlike their setting, we have significantly large number of bases which makes joint optimization of all p k difficult. Moreover, since we are interested in approximating the sparse solutions which are in the transform domain, we learn each p k independently. Given N local descriptors X = [x 1, x 2,..., x N ] R D N and their corresponding sparse solutions U = [u 1, u 2,..., u N ] = [y 1, y 2,..., y K ] R K N obtained with l 1 regularized sparse coding, we have an optimization problem given as p k y k S k p k α q p k 2 2, (8) where S k = S q (z k ). The regularization of the second term is essential to avoid singularity when computing the inverse and its consequence is that p k is encouraged to align itself to q when not many data samples are available. This might have been a reasonable prior for image denoising [9], but not desirable for the purpose of approximating sparse coing, as we would like to suppress most of the coefficients in u to zero. Figure 1 shows the distribution of one dimension of sparse coefficients z obtained from a collection of SIFT descriptors and q does not look similar to the distribution. This motivates us to look at the generalized lasso [21] as an alternative for obtaining a better fit of the distribution of the coefficients. 3.2 Generalized Lasso In the previous section, we have argued that regularized least squares stated in Eqn. (8) does not give the desired result. Instead most intervals need to be set to zero. This naturally leads us to consider l 1 regularized sparse coding also known as the lasso which is formulated as: p k y k S k p k α p k 1. (9) 4

5 However, the drawback of this is that the learnt piece-wise linear function may become unstable in cases when training data is noisy or missing as illustrated in Figure 1 (b) and (c). It turns out l 1 trend filtering [12], generally known as the generalized lasso [21], can overcome this problem. This is expressed as y p k S k p k α Dp k 1, (10) k where D R (Q 2) Q is referred to as a penalty matrix and defined as D = (11) To solve the above optimization problem, we can turn it into the sparse coding problem [21]. Since D is not invertible, the key is to augment D with A R 2 Q to build a square matrix D = [D; A] R Q Q such that rank( D) = Q and the rows of A are orthogonal to the rows of D. To satisfy such constraints, A can for example be set to [1, 2,..., Q; 2, 3,..., Q + 1]. If we let θ = [θ 1 ; θ 2 ] = Dp k where θ 1 = Dp k and θ 2 = Ap k, then S k p k = S k D 1 θ = S k1 θ 1 + S k2 θ 2. After some substitutions, we see that θ 2 can be solved by: θ 2 = (S k2s k2 ) 1 S k2(y k S k1 θ 1 ), given θ 1 is solved already. Now, to solve θ 1, we have the following sparse coding problem: θ 1 (I P)y k (I P)S k1 θ α θ 1 1, (12) where P = S k2 (S k2s k2 ) 1 S k2. Having computed both θ 1 and θ 2, we can recover the solution of p k by D 1 θ. Further details can be found in [21]. Given the learnt p, we can approximate sparse solution of x by Eqn. (7). However, explicitly computing S q (z) and multiplying it by p is somewhat redundant. Thus, we can alternatively compute each component of û as follows: û k = (1 r(z k )) p k (π(z k ) 1) + r(z k ) p k (π(z k )), (13) whose time complexity becomes O(K). In Eqn. (13), since we are essentially using p k as a lookup table, the complexity is independent from Q. This is followed by l 1 normalization on û. While û can readily be used for the spatial max pooling as stated in Eqn. (3), it does not yet capture any explaining away effect where the corresponding coefficients of correlated bases are mutually inhibited to remove redundancy. This is because each p k is estimated independently in the transform domain [9]. In the next section, we propose an efficient post-refinement technique to mutually inhibit between the bases. 3.3 Capturing Dependency Between Bases To handle the mutual inhibition between overcomplete bases, this section explains how to refine the sparse codes by solving regularized least squares on a significantly small active basis set. Given a local descriptor x and its initial sparse code û estimated with above method, we set the non-zero components of the code to be active. By denoting a set of these active components as φ, we have û φ and B φ which are the subsets of the sparse code and dictionary bases respectively. The goal is to compute the refined code of û φ denoted as ˆv φ such that B φ v φ reconstructs x i as accurately as possible. We formulate this as regularised least squares given below: ˆv φ x B φˆv φ β ˆv φ û φ 2 2, (14) where β is the weight parameter of the regularisation. This is convex and has the following analytical solution: ˆv φ = (B φ B φ + βi) 1 (B φ x + βû φ ). The intuition behind the above formulation is that the initial sparse code û is considered as a good starting point for refinement to further reduce the reconstruction error by allowing redundant bases to compete against each other. Empirically, the number of active components for each û is substantially small compared to the whole basis set. Hence, a linear system to be solved becomes much smaller 5

6 SIFT (128 Dim.) [15] Methods KM LLC [22] PSD [11] [24] + 15 Train 55.5± ± ± ± ± ± Train 63.0± ± ± ± ± ±0.7 Time (sec) Local Self-Similarity (30 Dim.) [20] Methods KM LLC [22] PSD [11] [24] + 15 Train 60.1± ± ± ± ± ± Train 63.0± ± ± ± ± ±1.1 Time (sec) Table 1: Recognition accuracy on Caltech-101. The dictionary sizes for all methods are set to We also report the time taken to process 1000 local descriptors for each method. which is computationally cheap. We also make sure that we do not deviate too much from the initial solution by introducing the regularization on ˆv φ. This refinement procedure may be similar to LLC [22]. However, in our case, we do not preset the number of active bases and detere by non-zero components of û. More importantly, we base our final solution on û and do not perform nearest neighbor search. With this refinement procedure, the total time complexity becomes O(K + Z 2 ). We refer with this post-refinement procedure as +. 4 Experimental Results This section evaluates and + on several challenging categorization datasets. To learn the mapping function, we have used 50,000 local descriptors as data samples. The parameters Q, α and β are fixed to 10, 0.1 and 0.25 respectively for all experiments, unless otherwise stated. For comparison, we have implemented methods discussed in Section 2. is our re-implementation of Yang et al. [24]. LLC is locality-constrained linear coding proposed by Wang et al. [22]. The number of nearest neighors to consider is set to 5. PSD is predictive sparse decomposition [11]. Shrinkage function is used as its parametric mapping function. We also include KM which builds its codebook with k-means clustering and adopts hard-assignment as its local descriptor coding. For all methods, exactly the same local feature descriptors, spatial max pooling technique and linear SVM are used to only compare the difference between the local feature descriptor coding techniques. As for the descriptors, SIFT [15] and Local Self-Similarity [20] are used. SIFT is a histogram of gradient directions computed over an image patch - capturing appearance information. We have sampled a patch at every 8 pixel step. In contrast, Local Self-Similarity computes correlation between a small image patch of interest and its surrounding region which captures the geometric layout of a local region. Spatial max pooling with 1 1, 2 2 and 4 4 image partitions is used. The implementation is all done in MATLAB for fair comparison. 4.1 Caltech-101 The Caltech-101 dataset [5] consists of 9144 images which are divided into 101 object categories. The images are scaled down to preserving their aspect ratios. We train with 15/30 images per class and test with 15 images per class. The dictionary size of each method is set to 1024 for both SIFT and Local Self-Similarity. The results are averaged over eight random training and testing splits and are reported in Table 1. For SIFT, + is consistently better than demonstrating the effectiveness of mutual inhibition by the post-refinement procedure. Both and + performs better than other fast algorithms that produces sparse codes. In addition and + performs competitively against. In fact, + is slightly better when 30 training images per class are used. While sparse codes for both and + are learned from the solutions of, the approximated codes are not exactly the same as the ones of. Moreover, sometimes produces unstable codes due to the non-smooth convex property of l 1 norm as previously observed in [6]. In contrast, + 6

7 Q Alpha RLS % 10% 20% 30% 40% % of Missing Data (a) (b) (c) Figure 2: (a) Q, the number of bins to quantize the interval of each sparse code component. (b) α, the parameter that controls the weight of the norm used for the generalized lasso. (c) When some data samples are missing is more robust than regularized least squares given in Eqn. (8). approximates its sparse codes with a relatively smooth piece-wise linear mapping function learned with the generalized lasso (note that the l 1 norm penalizes on changes in the shape of the function) and performs smooth post-refinement. We suspect these differences may be contributing to the slightly better results of + on this dataset. Although PSD performs quite close to for SIFT, this is not the case for Local Self-Similarity. outperforms PSD probably due to the distribution of sparse codes is not captured well by a simple shrinkage function. Therefore, might be more effective for a wide range of distributions. This is useful for recognition using multiple feature types where speed is critical. performs worse than, but + closes the gap between and. We suspect that due to Local Self-Similarity (30 dim.) being relatively low-dimensional than SIFT (128 dim.), the mutual inhibition becomes more important. This might also explain why LLC has performed reasonably well for this descriptor. Table 1 also reports computational time taken to process 1000 local descriptors for each method. and + are slower than KM and PSD, but are slightly faster than LLC and significantly faster than. This demonstrates the practical importance of our approach where competitive recognition results are achieved with fast computation. Different values for Q, α and β are evaluated one parameter at a time. Figure 2 (a) shows the results of different Q. The results are very stable after 10 bins. As sparse codes are computed by Eqn. (13), the time complexity is not affected by what Q is chosen. Figure 2 (b) shows the results for different α which look very stable. We also observe similar stability for β. We also validate if the generalized lasso given in Eqn. (10) is more robust than the regularized least squares solution given in Eqn. (8) when some data samples are missing. When learning each q k, we artificially remove data samples from an interval centered around a randomly sampled point, as also illustrated in Figure 1 (c). We evaluate with different numbers of data samples removed in terms of percentages of the whole data sample set. The results are shown in Figure 2 (c) where the performance of RLS significantly drops as the number of missing data is increased. However, both and + are not affected that much. 4.2 Caltech-256 Caltech-256 [8] contains 30,607 images and 256 object categories in total. Like Caltech-101, we scale the images down to preserving their aspect ratios. The results are averaged over eight random training and testing splits and are reported in Table 2. We use 25 testing images per class. This time, for SIFT, performs slightly worse than, but + outperforms probably due to the same argument given in the previous experiments on Caltech-101. For Local Self-Similarity, results similar to Caltech-101 are obtained. The performance of PSD is close to KM and is outperformed by, suggesting the inadequate fitting of sparse codes. LLC performs slightly better than, but could not perform better than +. While performed the best, the performance of + is quite close to. We also plot a graph of the computational time taken for each method with its achieved accuracy on SIFT and Local Self-Similarity in Figure 3 (a) and (b) respectively. 7

8 SIFT (128 Dim.) [15] Methods KM LLC [22] PSD [11] [24] + 15 Train 22.7± ± ± ± ± ± Train 27.4± ± ± ± ± ±0.4 Local Self-Similarity (30 Dim.) [20] Methods KM LLC [22] PSD [11] [24] + 15 Train 23.7± ± ± ± ± ± Train 28.5± ± ± ± ± ±0.5 Table 2: Recognition accuracy on Caltech-256. The dictionary sizes are all set to 2048 for SIFT and 1024 for Local Self-Similarity KM LLC PSD Computational Time KM 32 LLC PSD Computational Time KM LLC 78 PSD Computational Time (a) (b) (c) Figure 3: Plotting computational time vs. average recognition. (a) and (b) are SIFT and Local-Self Similarity respectively evaluated on Caltech-256 with 30 training images. The dictionary size is set to (c) is SIFT evaluated on 15 Scenes. The dictionary size is set to Scenes The 15 Scenes [13] dataset contains 4485 images divided into 15 scene classes ranging from indoor scenes to outdoor scenes. 100 training images per class are used for training and the rest for testing. We used SIFT to learn 1024 dictionary bases for each method. The results are plotted with computational time taken in Figure 3 (c). The result of + (80.6%) are very similar to that of (80.7%), yet the former is significantly faster. In summary, we show that our approach works well on three different challenging datasets. 5 Conclusion This paper has presented an approximation of l 1 sparse coding based on the generalized lasso called. This is further extended with the post-refinement procedure to handle mutual inhibition between bases which are essential in an overcomplete setting. The experiments have shown competitive performance of against and achieved significant computational speed up. We have also demonstrated that the effectiveness of on two local descriptor types, namely SIFT and Local Self-Similarity where LLC and PSD only perform well on one type. is not restricted to only approximate l 1 sparse coding, but should be applicable to other variations of sparse coding in general. For example, it may be interesting to try on Laplacian sparse coding [6] that achieves smoother sparse codes than l 1 sparse coding. Acknowledgment NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. 8

9 References [1] A. Adler, Y. Hel-Or, and M. Elad. A Shrinkage Learning Approach for Single Image Super- Resolution with Overcomplete Representations. In ECCV, [2] D.L. Donoho. For Most Large Underdetered Systems of Linear Equations the Minimal L1- norm Solution is also the Sparse Solution. Communications on Pure and Applied Mathematics, [3] D.L. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via L1 imization. PNAS, 100(5): , [4] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least Angle Regression. Annals of Statistics, [5] L. Fei-Fei, R. Fergus, and P. Perona. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories. In CVPR Workshop, [6] S. Gao, W. Tsang, L. Chia, and P. Zhao. Local Features Are Not Lonely - Laplacian Sparse Coding for Image Classification. In CVPR, [7] K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In ICML, [8] G. Griffin, A. Holub, and P. Perona. Caltech-256 Object Category Dataset. Technical Report, California Institute of Technology, [9] Y. Hel-Or and D. Shaked. A Discriative Approach for Wavelet Denoising. TIP, [10] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the Best Multi-Stage Architecture for Object Recognition. In ICCV, [11] K Kavukcuoglu, M Ranzato, and Y Lecun. Fast inference in sparse coding algorithms with applications to object recognition. Technical rrport CBLL-TR , Computational and Biological Learning Lab, Courant Institute, NYU, [12] S.-J. Kim, K. Koh, S. Boyd, and D. Gorinevsky. L1 trend filtering. SIAM Review, [13] S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In CVPR, [14] H. Lee, A. Battle, R. Raina, and A.Y. Ng. Efficient sparse coding algorithms. In NIPS, [15] D.G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, [16] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised Dictionary Learning. In NIPS, [17] J. Mairal, M. Leordeanu, F. Bach, M. Hebert, and J. Ponce. Discriative Sparse Image Models for Class-Specific Edge Detection and Image Interpretation. In ECCV, [18] B.A. Olshausen and D.J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, [19] M. Ranzato, F.J. Huang, Y. Boureau, and Y. LeCun. Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition. In CVPR, [20] E. Shechtman and M. Irani. Matching Local Self-Similarities across Image and Videos. In CVPR, [21] R. Tibshirani and J. Taylor. The Solution Path of the Generalized Lasso. The Annals of Statistics, [22] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained Linear Coding for Image Classification. In CVPR, [23] J. Yang, J. Wright, T. Huang, and Y. Ma. Image Super-Resolution via Sparse Representation. TIP, [24] J. Yang, K. Yu, Y. Gong, and T.S. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, [25] J. Yang, K. Yu, and T. Huang. Supervised Translation-Invariant Sparse Coding. In CVPR, [26] K. Yu, T. Zhang, and Y. Gong. Nonlinear Learning using Local Coordinate Coding. In NIPS,

arxiv: v1 [cs.lg] 20 Dec 2013

arxiv: v1 [cs.lg] 20 Dec 2013 Unsupervised Feature Learning by Deep Sparse Coding Yunlong He Koray Kavukcuoglu Yun Wang Arthur Szlam Yanjun Qi arxiv:1312.5783v1 [cs.lg] 20 Dec 2013 Abstract In this paper, we propose a new unsupervised