Learning Discriminative and Shareable Features. Scene Classificsion

Size: px

Start display at page:

Download "Learning Discriminative and Shareable Features. Scene Classificsion"

Justina Wilkerson
5 years ago
Views:

1 Learning Disriminative and Shareable Features for Sene Classifiation Zhen Zuo, Gang Wang, Bing Shuai, Lifan Zhao, Qingxiong Yang, and Xudong Jiang Nanyang Tehnologial University, Singapore, Advaned Digital Sienes Center, Sinapore, City University of Hong Kong Abstrat. In this paper, we propose to learn a disriminative and shareable feature transformation filter bank to transform loal image pathes (represented as raw pixel values) into features for sene image lassifiation. The learned filters are expeted to: (1) enode ommon visual patterns of a flexible number of ategories; (2) enode disriminative and lass-speifi information. For eah ategory, a subset of the filters are ativated in a data-adaptive manner, meanwhile sharing of filters among different ategories is also allowed. Disriminative power of the filter bank is further enhaned by enforing the features from the same ategory to be lose to eah other in the feature spae, while features from different ategories to be far away from eah other. The experimental results on three hallenging sene image lassifiation datasets indiate that our features an ahieve very promising performane. Furthermore, our features also show great omplementary effet to the state-of-the-art ConvNets feature. Keywords: Feature learning, Disriminant analysis, Information sharing, Sene Classifision 1 Introdution Generating robust, informative, and ompat loal features has been onsidered as one of the most ritial fators for good performane in omputer vision. In the last deade, numerous hand-rafted features, suh as SIFT [1] and HOG [2], have ruled the loal image representation area. Reently, a number of papers [3 9] have been published to learn feature representations from pixel values diretly, aiming to extrat data-adaptive features whih are more suitable. However, most of these works operate in an unsupervised way without onsidering the lass label information. We argue that extrating disriminative features is important for lassifiation, as information on loal pathes is usually redundant, features whih are disriminative for lassifiation should be extrated. In this paper, we develop a method to learn transformation filter bank to transform pixel values of loal image pathes into features, whih is alled Disriminative and Shareable Feature Learning (DSFL). As shown in Fig. 1, we aim to learn an over-omplete filter bank, whih is able to over the varianes

2 Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X.

w 4... w 3 w 4 w 3 w 4 w 5... w 5 w 5 w 5 w 5............... w D w D w D w D w D Fig. 1. Illustration of DSFL. w 1,.

For eah lass, we fore it to ativate a small subset of filters to learn lass-speifi patterns, and different lasses an

Finally, the feature of a image path x i an be represented as f i = F (W x i).

lasses and disriminative power of eah ategory.

the number of ategories, whih is not desirable for loal feature representation.

the learning proedure. Beyond reduing feature dimensions, sharing filters an also lead to more robust features.

The amount of information shared depends on the similarity between different ategories.

We introdue a binary seletion variable vetor to adaptively selet what filters to share, and among what ategories.

there might be pathes orresponding to bookshelf in offie, whih an hardly be found in omputer room ).

2 2 Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang Class 1 Class 2 Class 3 Class C Global Filter Bank w I w I w I w I w I w 2 w 2 w 2 w 2 w 2 w 3 w 4 w 3 w 4 w 3 w 4... w 3 w 4 w 3 w 4 w 5... w 5 w 5 w 5 w w D w D w D w D w D Fig. 1. Illustration of DSFL. w 1,..., w D represent the filters in the global filter bank W. For eah lass, we fore it to ativate a small subset of filters to learn lass-speifi patterns, and different lasses an share the same filters to learn shareable patterns. Finally, the feature of a image path x i an be represented as f i = F (W x i). (Best viewed in olor) of images from different lasses, meanwhile keeping the shareable orrelation among different lasses and disriminative power of eah ategory. To build suh a global filter bank, an intuitive way is to independently learn a filter bank for eah lass, and onatenate them together. However, if filters learned from different lasses are not shared, the number of filters will inrease linearly with the number of ategories, whih is not desirable for loal feature representation. To learn a more ompat global filter bank, we fore eah ategory to only ativate a subset of the global filters during the learning proedure. Beyond reduing feature dimensions, sharing filters an also lead to more robust features. Images belonging to different lasses do share some information in ommon (e.g. in sene lassifiation, both omputer room and offie ontain omputer and desk ). The amount of information shared depends on the similarity between different ategories. Hene, we allow filters to be shared, meaning that the same filters an be ativated by a number of ategories. We introdue a binary seletion variable vetor to adaptively selet what filters to share, and among what ategories. To improve the disrimination power, we introdue a disriminative term to fore features from the same ategory to be lose and features from different ategories to be far away. (e.g. there might be pathes orresponding to bookshelf in offie, whih an hardly be found in omputer room ). However, not all the pathes from the same ategories are lose, as they are very diverse. Hene, we introdue a method to selet disriminative exemplars from eah ategory, and a feature should be similar to a subgroup of the exemplars from the same ategory. Furthermore, not all the loal pathes from different lasses should be fored to be separable, thus, we relax the disriminative term to allow sharing similar pathes aross different lasses, and fous on separating the less similar pathes from different lasses. We tested our method on three widely used sene image lassifiation datasets: Sene 15, UIUC Sports, and MIT 67 Indoor. The experimental results show that our features an outperform most of the existing ones. By ombining our feature

3 Learning Disriminative and Shareable Features for Sene Classifiation 3 with the ConvNets [3, 10] features (supervised pretrained on ImageNet [11]), we an ahieve state-of-the-art results on Sene 15, UIUC Sports and MIT 67 Indoor with a lassifiation auray of 92.81%, 96.78%, and 76.23% respetively. 2 Related Works Our work fouses on learning loal feature desriptors. Hand-rafted features inluding SIFT [1], HOG [2], GIST [12], and LBP [13] were popular used in this area. However, even though they are very powerful, they an hardly apture any information other than what have been defined by prior knowledge. In this paper, we aim to learn a data adaptive loal feature representation. Reently, diretly learning features from image pixel values [4 9,14 18] emerges as a hot researh topi in omputer vision beause it is able to learn data adaptive features. And many of them have ahieved superior performane on many important omputer vision tasks suh as digital image reognition [6], and ation reognition [17]. However, most existing feature learning works adopt unsupervised learning methods to learn filters for feature extration. Different from them, we argue that disriminative information an be ritial for lassifiation and disriminative patterns an be learned. We experimentally show that our disriminative feature learning works better than unsupervised feature learning on sene datasets by enoding the shareable and disriminative lass orrelation lues into feature representation. While in the supervised feature learning line, the ConvNets [3] is a very deep feature learning struture (5 onvolutional layers, 2 fully onneted layers, and 1 softmax layer), it fouses on progressively learning multi-levels of visual patterns. When pre-trained on ImageNet, it is the state-of-the-art feature extrator on many tasks [10, 19, 20]. In ontrast, our DSFL fouses on enoding the shareable and disriminative orrelation among different lasses into eah layer s feature transformation. In the Setion 4, we will show that our DSFL learns signifiant omplementary information to this powerful feature, and ombines with whih, we an update the urrent state-of-the-art on all of the three sene lassifiation datasets. There are also some related papers trying to extrat disriminative representations from images. For example, [21 24] learn disriminative ditionaries to enode loal image features. Another line of work [25 28] that represents sene images in terms of weakly-supervised mined disriminative parts gained inreasingly popularity and suess. The basi idea is to build a disriminative framework, and use it to mine a set of representative and distint parts (multisale pathes) for every lass. Afterwards, images an be represented with the max pooled responses of suh mid-level patterns. Different from these works, we fous on disriminatively learning filters to transform loal image pathes into features, and allowing sharing loal feature transformation filters between different ategories. To the best of our knowledge, this hasn t been done before. Furthermore, in [29, 30], objet part filters at the middle level are shared to represent a large number of objet ategories for objet detetion. Compared to them, our training examples (image pathes) don t have strong supervised labels

4 4 Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang exept image-level lass labels, so we develop an exemplar seletion sheme and a nearest neighbour based maximum margin method to make it more robust to noise. 3 Disriminative and Shareable Feature Learning In this setion, we first desribe the three omponents of our Disriminative and Shareable Feature Learning (DSFL) framework. Then we will provide an alternating optimization strategy to solve this problem. 3.1 DSFL Learning Components We aim to learn features that an preserve the information of the original data, be shareable and be disriminative. To ahieve these goals, we have three learning omponents in the DSFL learning framework. We write x R Do as a vetor of raw pixel values of an image path. Given a number of x from different ategories, we aim to learn a feature transformation filter bank W R D D0 (eah row represents one filter, and there are D filters). By multiplying W with x, and applying an ativation funtion F ( ), we expet to generate feature f i = F (W x i ), whih is disriminative and as ompat as possible. For this purpose, W should be learned to enode information whih is disriminative among lasses and only has a small number of rows (filters). In our learning framework, we fore eah lass to ativate a subset of filters in W to learn lass-speifi patterns. And we allow different lasses to share filters to redue the number of filters. The Global Reonstrution Term To ensure that the feature transformation matrix W R D D0 an preserve the information hidden in the original data, we utilize a global reonstrution term, whih aims to minimize the error between the reonstruted data and the original data. The ost funtion is shown as following: L u = N L u (x i, W ) + λ 1 i=1 N i=1 f i 1 where L u (x i, W ) = xi W T W x i 2 2 and f i = F (W x i ), F ( ) = abs ( ) (1) where N is the total number of training pathes. L u is the empirial loss funtion with respet to global filter bank W and unlabelled training path x i. W T W x i denotes the reonstruted data of x i. This auto-enoder [4, 31] style reonstrution ost penalization term an not only prevent W from degeneration, but also allow W to be over-omplete. The term f i 1 is used to enfore the sparsity of the learned feature f i. Following [5, 17], we set F ( ) = abs ( ). Then the sparse term f i 1 degenerates to summation of all the dimensions of f i.

5 Learning Disriminative and Shareable Features for Sene Classifiation 5 Shareable Constraint Term Equation 1 an only learn a generative W without enoding any lass-speifi information. A method to overome this limitation is to fore a subset of filters to only respond to a speifi lass. Thus, we propose a onstraint term to ensure that only a subset of the filters will be ativated by one lass, while the same filters an potentially be ativated by multiple lasses. For eah lass, we write α R D as a vetor to indiate the seletion status of rows of W. If αd = 1, d = 1,..., D, then the d-th row of W is ativated, otherwise not ativated. We use A = diag (α ) for representation onveniene. The ost of our shareable onstraint term of lass is formulated as following: N L sha = L ( sha x j, A W ) + λ 2 α 0 j=1 s.t. αd {0, 1}, d = 1,..., D where L ( sha x j, A W ) = x j (A W ) T (A W ) x j 2 2 (2) where N is the number of training pathes from lass, and C is the total number of lasses. For the shareable term, similar to L u, L sha is the reonstrution ost funtion with respet to the filter bank subset α W and training path x j from lass. We apply l 0 norm on α to fore eah lass to ativate a small number of rows. Consequently, for the d-th element in α, if it is only set to 1 for lass, then it means the d-th row of W will only be ativated and learned with training pathes from lass. If the d-th element is set to 1 for lass 1 and lass 2, then the d-th row of W is a shareable filter, whih should be ativated and learned with training data from lass 1 and 2. When α is updated in eah iteration, the orresponding training data for eah filter will also be updated. Disriminative Regularization Term To enhane the disriminative power of feature desriptors, we further introdue a disriminative term based on the assumption that disriminative features should be lose to the features from the same ategory, and be far away from the features from different ategories in the feature spae. In the image level senario [32, 33], labels are onsistent with the targets. However, in path level senario, loal features from the same lass are inherently diverse, and diretly foring all of them to be similar to eah other is not suitable. Similar to [34 36], we adopt the nearest neighbour based path-to-lass distane metri to enfore disrimination. For a training path x j, its positive nearest neighbour path set from the same ategory is denoted as Γ ( xj) ; and its negative nearest neighbour path set from the ategories other than is denoted as Γ ( xj). The k-th nearest neighbour in the two sets are ( ) ( ) represented as Γ k x j and Γk x j respetively. In the lass-speifi feature spae of lass (transformed by A W ), the feature representation of the ( ) k-th positive ( and ( negative )) nearest ( ) neighbour ( pathes sets are denoted as Γ k f j = F A W Γ k x j and Γk f j = F A W Γ ( )) k x j

6 6 Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang orrespondingly. We aim to minimize the distane between eah feature to its positive nearest neighbours, while maximize the distane between eah feature to its negative nearest neighbours. Furthermore, aording to the maximum margin theory in learning, we should fous on the hard training samples. Hene, we develop a hinge-loss like objetive funtion to learn A W : N L dis = max ( δ + Dis ( x j, Γ ( x ( j)) Dis x j, Γ ( x )) ) j, 0 j=1 where Dis ( x j, Γ ( x )) 1 j = K Dis ( x j, Γ ( x )) 1 j = K K ( ) f j Γ k f j 2 2 k=1 K f j Γ ( ) k f j 2 2 k=1 (3) in whih, δ is the margin, we set it to 1 in our experiments, and K is the number of nearest neighbours in the nearest neighbour path sets, we fixed it as 5. However, there are two limitations with the above nearest neighbour based learning method. Firstly, as mentioned in [36], the loal path level nearest neighbour searh is likely to be dominated by noisy feature pathes. Thus, some of the searhed nearest neighbours in Equation 3 might not arry disriminative patterns, onsequently the performane will be suppressed. Seondly, it is expensive to searh nearest neighbours from the whole path set. A straight forward solution is applying lustering and using the luster entroids as the exemplars for eah lass [36]. However, onventional lustering methods may onsider noninformative dominant patterns as inliers of lusters, while treating informative lass-speifi patterns as outliers. Thus, we propose a method to selet disriminative exemplars for eah ategory. Inspired by the image-level exemplar seletion method in [37], we propose an exemplar seletion methods that is suitable for path-level patterns. We firstly define the overage set of a path x. Given X as the original global path set, whih is ombined with pathes densely extrated from all the training images. For eah path x X, we searh its M nearest neighbours from X, and define these M pathes as the overage set of x. Then for eah lass, we define their exemplar pathes as the ones that annot be easily overed by pathes from many lasses other than. To reah this goal, we design a path-to-database (P2D) distane to measure the disriminative power of a path x i from lass : P2D ( x ) 1 j = C 1 1 N N n=1 x j x n 2 (4) where x n is a path from lasses,, N is the number of pathes from lasses whose overage sets ontain x j, and C is the number of lasses. If P2D ( xj) is small, it means that x j represents a ommon pattern among many lasses, and should be removed, otherwise, it should be kept as a disriminative

7 Learning Disriminative and Shareable Features for Sene Classifiation 7 Algorithm 1: Disriminative Exemplar Seletion Input: X: Global path set X : Path set of lass ε: Threshold for seleting disriminative exemplars M: Number of pathes in eah overage set Output: E : Exemplars of lass 1. Calulate the overage set of eah path from X for = 1 to C do 2. For eah path from X, alulate its P2D distane based on Equation 4 3. Desendingly rank the pathes from X based on their P2D distanes. 4. Selet the top ε perent ranked pathes as the exemplars E end return E exemplar. For eah lass, we rank the pathes based on their P2D ( ) distanes desendingly, and selet the top 10% of them as disriminative exemplars. The seleting proedures are shown in Algorithm 1. The exemplars will replae the original path set, and be used to searh for the nearest neighbours in Equation 3. Speifially, for eah training path x j, we searh its nearest neighbours set Γ ( xj) from the exemplars in lass, and searh its negative nearest neighbours set Γ ( xj) from the exemplars belonging to lasses other than. 3.2 DSFL Objetive Funtion and Optimization Combining the global unsupervised reonstrution term L u, the shareable onstraint term L sha and the disriminative regularization L dis, we write the objetive funtion of DSFL as: where L u = j=1 min L W,α u + γ C =1 L sha+η N L u (x i, W ) + λ 1 i=1 C =1 N i=1 L dis f i 1 N L sha = L ( sha x j, A W ) + λ 2 α 0 j=1 N L dis = max ( δ + Dis ( x j, Γ ( x ( j)) Dis x j, Γ ( x )) ) j, 0 s.t. α d {0, 1}, d = 1,..., D In Equation 5, when α is fixed, it is onvex in W, and when W is fixed, a suboptimal α an also be obtained. However, the funtion annot be jointly (5)

8 8 Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang optimized. Thus, we adopt an alternating optimization strategy to iteratively update W and eah α. Fix α to update W : min W N L u (x i, W ) + λ 1 i=1 N i=1 C N f i 1 + γ L ( sha x j, A W ) C + η L dis (6) =1 j=1 As mentioned in Setion 3.1, f i 1 degenerates to summation of different dimensions in f i, thus, Equation 6 an be easily optimized by unonstrained solvers, e.g. L-BFGS. Fix W to update α : min α N j=1 =1 L sha ( x j, A W ) + λ 2 α 0 + ηl dis (7) For the optimization of α, we update one α eah time for the -th lass, and fix α ( ). To get suh binary filter seletion indiators, we apply a greedy optimization method. We first set all the elements in α as 0, then we searh for the single best filter that an minimize Equation 7, and ativate that filter by setting the orresponding element in α to 1. Afterwards, based on the previously ativated filters, we searh for next filter that an further minimize the ost funtion. After several rounds of searhing, when the loss L sha is smaller than a threshold, the optimization of α terminates, we stop updating α, and send the renewed α as the input to Equation 6 again to further optimize W. The learning algorithm and initialization proedure are shown in Algorithm 2. The alternative optimization terminated until the values of both W and α onverge (takes about 5 rounds). 3.3 Hierarhial Extension of DSFL DSFL an be easily staked to extrat features at multiple levels. Features at lower level may represent edges and lines, while features at higher level may represent objet parts, et. In our implementation, we stak another layer on the top of the basi DSFL struture 1. In the first layer DSFL network, 400 dimensional features are learned from 16x16 pixel raw images pathes, whih are densely extrated from the original/resized images with step size 4. In the seond layer, another 400 dimensional feature is learned based on first layer features. To get the inputs for the seond layer, we onatenate the first layer features densely extrated within 32x32 image areas. We further proess PCA to redue the dimension to 300 and send it to the seond layer. Finally, we ombine the features learned from both layers as our DSFL feature. 1 Adding more layers an slightly improve the performane, but the omputational ost is high, thus we apply two layer DSFL to reah a ompromise.

9 Learning Disriminative and Shareable Features for Sene Classifiation 9 Algorithm 2: DSFL: Disriminative and Shareable Feature Learning Input: x i: Unlabelled training path x j: Image-level labelled training path from lass D: Number of filters in the global filter bank γ, η, λ 1, λ 2: Trade off parameters for ontrolling weight of shareable term, disriminative term, and sparsity Output: W : Global filter bank (feature transformation matrix) 1. Initialize α = 0 T 2. Set W as a random number D D 0 matrix 3. Learn W with only unsupervised term L u as the initialized W to the DSFL 4. Selet exemplars for eah lass based on Equation 4 5. Searh the positive and negative nearest neighbour exemplar sets for eah x j while W and α not onverge do for = 1 to C do 6. Fix W and solve Equation 7 by updating α end 7. Fix α, = 1,..., C and solve Equation 6 by updating W end return W 4 Experiments and Analysis 4.1 Datasets and Experiment Settings We tested our DSFL method on three widely used sene image lassifiation datasets: Sene 15 [38], UIUC Sports [39], and MIT 67 Indoor [40]. In order to make fair omparisons with other types of features, we only used gray sale information for all these datasets. We tested on all the three datasets with the most standard settings: on Sene 15, we randomly seleted 100 images per ategory for training, and the rest for testing; on UIUC sports, we randomly seleted 70 images per lass as training images, and 60 images per lass as testing images; on MIT 67 Indoor, we followed the original splits in [40], whih used around 80 training images and 20 testing images for eah ategory. For UIUC sports and MIT 67 Indoor, sine the resolution of the original images are too high for learning loal features effiiently, we resized them to have maximum 300 pixels along the smaller axis. For Sene 15 and UIUC sports, we randomly split the training and testing dataset for 5 times. The average auray numbers over these 5 rounds are reported for omparison. For all the loal features, we densely extrated features from six sales with resaling fators 2 i/2, i = 0, 1,..., 5. Speifially, RICA [4] and DSFL features were extrated with step size 3 for the first layer, and step size 6 for the seond layer; SIFT features [1] were extrated from 16x16 pathes with stride 3; HOG2x2 features [41] were extrated based on ells of size 8x8, and the stride is 1 ell; LBP features [13] were extrated from ells of size 8x8.

10 10 Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang Mehods Sene 15 UIUC Sports MIT 67 Indoor GIST [12] 73.28% % CENTRIST [42] 83.10% 78.50% 36.90% SIFT [1] 82.06% 85.12% 45.86% HOG2x2 [41] 81.58% 83.96% 43.76% LBP [13] 82.95% 80.04% 39.25% RICA [4] 79.85% 82.14% 47.89% DSFL 84.19% 86.45% 52.24% DeCAF [3, 10] 87.99% 93.96% 58.52% SIFT [1] + DeCAF [3, 10] 89.90% 95.05% 70.51% DSFL + DeCAF [3, 10] 92.81% 96.78% 76.23% Table 1. Comparison results between our feature and other features. (DeCAF is the feature learned by the deep ConvNets pre-trained on ImageNet) For eah training image, we randomly piked 400 pathes (200 for MIT Indoor), and used them as training data to learn W. In the objetive funtion Equation 5, the value of margin δ was fixed as 1, and we sequentially learnt the weight parameters λ 1, λ 2, γ and η by ross validation. In Algorithm 1, the threshold of exemplar seletion ε was set to 10%, and the overage set size M was set to 10. In Algorithm 2, the maximum number of iterations of updating W and α was set to 5. We tested our loal features based on the LLC framework [43], whih used loality-onstrained linear oding to enode loal features, and performed maxpooling and linear-svm afterwards. The size of the odebook was fixed as 2000, and eah image was divided into 1x1, 2x2, and 4x4 spatial pooling regions [38]. We ve also tested on other frameworks with different oding strategies (e.g. vetor quantization) and pooling shemes (e.g. average pooling), our DSFL an onsistently outperform traditional loal features. 4.2 Comparison with Other Features As shown in Table 1, we ompared our DSFL with popular features whih have shown good performane on sene images lassifiation: SIFT [1], GIST [12], CENTRIST [42], and HoG [2, 44], LBP [13]. Our DSFL feature is able to outperform all of the hand rafted features. We also ompared our DSFL with RICA [4], whih is the baseline unsupervised feature learning method without enoding any disriminative or lass-speifi information. As shown in Table 1, our method onsistently and signifiantly outperforms RICA. We ve also tested the performane of only using the features learned by the first partially onneted layer, and for the three datasets, the results were 82.61%, 83.92%, and 47.16%, whih are less powerful than the two layer features. In Table 1, the DeCAF feature [10] is an implementation of the 7 layer ConvNets [3]. Here we used the 6-th layer DeCAF feature. Aording to [10, 20], empirially the 6-th layer feature will lead to better results than the 7-th layer

89% orridor DSFL: 57.14% DeCAF: 47.62% bowling DSFL: 75.

.81% DeCAF: 76.19% Fig. 2. Comparison results on MIT 67 Indoor.

than DeCAF, the last two rows show the lasses that are better

Combining them an result in better results for sene lassifiation.

On the three datasets, we also tested with the 7-th layer feature,

Thus, the 6-th layer DeCAF features were used for evaluation.

diretly omparing our feature with it is not fair.

we haven t used olor information, and we fous on loal feature

The ConvNets was trained on the ImageNet with a large amount of

We suppose the features learned from these two frameworks should be

2, we tested on MIT 67 to show the omplementary effet.

show the testing images whih were orretly lassified by DSFL, but

In the last two rows, DeCAF outperformed DSFL, and we show the

To quantitatively analyze the omplementation effet, we ombined our

11 Learning Disriminative and Shareable Features for Sene Classifiation 11 auditorium DSFL: 66.70% DeCAF: 38.89% orridor DSFL: 57.14% DeCAF: 47.62% bowling DSFL: 75.00% DeCAF: % wineellar DSFL: 23.81% DeCAF: 76.19% Fig. 2. Comparison results on MIT 67 Indoor. The first two rows show the two ategories on whih DSFL works better than DeCAF, the last two rows show the lasses that are better represented by DeCAF. DSFL and DeCAF are omplimentary. Combining them an result in better results for sene lassifiation. feature. On the three datasets, we also tested with the 7-th layer feature, and got 87.35%, 93.44%, and 58.27% respetively. Thus, the 6-th layer DeCAF features were used for evaluation. Although this pre-trained DeCAF feature is very powerful, yet diretly omparing our feature with it is not fair. We do not utilize the huge amount of image data from ImageNet [11], we haven t used olor information, and we fous on loal feature representation rather than global image representation. The ConvNets was trained on the ImageNet with a large amount of objet images. We suppose the features learned from these two frameworks should be omplementary. In Fig. 2, we tested on MIT 67 to show the omplementary effet. In the first two rows, our DSFL worked better than DeCAF, and we show the testing images whih were orretly lassified by DSFL, but wrongly lassified by DeCAF. In the last two rows, DeCAF outperformed DSFL, and we show the testing images whih our DSFL failed to reognize but DeCAF ould. To quantitatively analyze the omplementation effet, we ombined our DSFL with the DeCAF feature. As shown in the last row of Table 1, we are able to get muh better performane than purely using the powerful ConvNets features and produe the state-of-the-art performane. We also tested the ombination of SIFT and DeCAF. The auray numbers are not as good as those of the ombination of DSFL and DeCAF, whih indiates that our DSFL an learn more effetive omplementary information by onsidering data adaptive infor-

12 12 Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang mation. The traditional hand-rafted features suh as SIFT usually extrated garbor-like features, most of whih an be learned by the lower levels in ConvNets. However, ConvNets adopts bakpropagation for optimization based on huge training datasets, the bottom layers of the network were usually not well trained. In ontrast, we expliitly used supervised information to train bottom layer features. Our method is more suitable for relatively small datasets, as evidened by the experimental results, while previous attempts on trying to train a CNN lassifier on small datasets usually failed. So these two lines of works are expeted to be omplimentary. Mehods Sene 15 UIUC Sports MIT 67 Indoor ROI + GIST [40] % DPM [45] % Objet Bank [46] 80.90% 76.30% 37.60% Disriminative Pathes [47] % LDC [36] 80.30% % marofeatures [48] 84.30% - - Visual Conepts + 3 ombined features [25] 83.40% 84.80% 46.40% MMDL + 5 ombined features [49] 86.35% 88.47% 50.15% Disriminative Part Detetor [27] 86.00% 86.40% 51.40% LSSPM [50] 89.78% 85.27% - IFV [28] % MLrep + IFV [26] % DSFL + DeCAF [3, 10] 92.81% 96.78% 76.23% Table 2. Comparison Results of our method and other popular methods on Sene 15, UIUC sports, and MIT 67 Indoor. We also ompared our method (ombining DSFL and DeCAF) with other methods applied on these three sene datasets. As shown in Table 2, our method ahieved the highest auray on all of the three datasets. Note that Visual Elements [26] utilized numerous pathes extrated at sales ranging from 80x80 to the full image size, and the pathes were represented by standard HOG [2] plus a 8x8 olor image in L*a*b spae, and very high dimensional IFV [28] features. While MMDL [49] ombined 5 types of features on 3 sales. Furthermore, most of the previous works were based on hand-rafted loal feature desriptions, whih means that our learned DSFL features an be ombined with them to ahieve better results. For example, LSSPM [50] foused on oding, whih an be used to enode our DSFL features. 4.3 Analysis of the effet of different omponents In this setion, we aim to ompare our shareable and disriminative learning method to the baseline without enoding suh information, whih is equivalent

Learning Disriminative and Shareable Features for Sene Classifiation 13 to the RICA method in [4]. We first show the visualization of the filters learned from UIUC Sports in Fig. 3(a) and Fig. 3(b).

Visualization of the filters learned by RICA and our DSFL on the UIUC Sports dataset.

13 Learning Disriminative and Shareable Features for Sene Classifiation 13 to the RICA method in [4]. We first show the visualization of the filters learned from UIUC Sports in Fig. 3(a) and Fig. 3(b). We an see that our DSFL is able to apture more sharply loalized patterns, orresponding to more lass-speifi visual information. (a) RICA (b) DSFL Fig. 3. Visualization of the filters learned by RICA and our DSFL on the UIUC Sports dataset. Effet of learning shareable filter bank We tested the DSFL with or without the feature sharing terms, and got the intermediate results in Table 3. The first row of the table shows the baseline unsupervised RICA features learned by solving Equation 1. In the seond row, L u + L sha orresponds to features learned with Equation 2. The improvement in auray shows that learning shareable features is effetive for lassifiation. However, if we removed the global reonstrution error term L u and only kept the shareable terms, as shown in the third row, the performane dramatially dropped. Effet of Disriminative Regularization and Exemplar Seletion Aording to the fourth row and the fifth row of Table 3, we an find that if we didn t selet exemplars for learning, we ould not ahieve muh improvement beause noisy training examples might overwhelm the useful disriminative patterns. However, one we learned using seleted exemplars, our method ould ahieve signifiant improvement in lassifiation auray. This shows that disriminative exemplar seletion is ritial in our learning framework. Furthermore, it s obvious that only using 10% of the whole path set dramatially inreased the effiieny of nearest neighbour searh afterwards. Thus, our exemplar seletion method is both effetive and effiient. Effet of the Size of Filter Bank To further analyze the influene aused by the size of filters, we test on Sene 15 dataset with 128, 256, 512, 1024, and 2048 filters for the DSFL. The results are shown in Fig. 4. At the beginning, when the size is small, the learned features are relatively weak. When the number of

14 14 Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang Mehods Sene 15 UIUC Sports MIT 67 Indoor L u (RICA [4]) 79.85% 82.14% 47.89% L u + L sha 82.01% 83.67% 49.70% L sha 72.69% 72.52% 24.12% L u + L sha + L dis (without Exemplar) 82.50% 83.43% 51.28% L u + L sha + L dis (Full DSFL) 84.19% 86.45% 52.24% Table 3. Analysis of the effet of eah omponents Auray % Number of filters Fig. 4. Results of varying number of filters in Sene 15 filters inreases, and W beomes over-omplete, the performane is substantially improved. Thus, learning over-omplete filter bank does help to obtain better feature representation beause the resulting filter bank aptures more information. However, when the number of filters further inreases, the performane does not hange muh, while the learning proess will be extremely slow. In our experiment, we use 400 as a ompromise between effiieny and auray. 5 Conlusion In this paper, we propose a weakly supervised feature learning method, alled DSFL, to learn a disriminative and shareable filter bank to transform loal image pathes into features. In our DSFL method, we learn a flexible number of shared filters to represent ommon patterns shared aross different ategories. To enhane the disriminative power, we fore the features from the same lass to be loally similar, while features from different lasses to be separable. We test our method on three widely used sene image lassifiation benhmark datasets, and the results onsistently show that our learned features an outperform most of the existing features. By ombining our features with the ConvNets features pre-trained on ImageNet, we an greatly enhane the representation, and ahieve state-of-the-art sene lassifiation results. In the future, we will integrate our learning method with deeper learning struture to extrat multi-level features for more effetive lassifiation.

15 Learning Disriminative and Shareable Features for Sene Classifiation 15 Referenes 1. Lowe, D.G.: Distintive image features from sale-invariant keypoints. International journal of omputer vision 60(2) (2004) Dalal, N., Triggs, B.: Histograms of oriented gradients for human detetion. In: CVPR. (2005) Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet lassifiation with deep onvolutional neural networks. In: NIPS. (2012) Le, Q.V., Karpenko, A., Ngiam, J., Ng, A.Y.: Ia with reonstrution ost for effiient overomplete feature learning. In: NIPS. (2011) Zou, W.Y., Zhu, S.Y., Ng, A.Y., Yu, K.: Deep learning of invariant features via simulated fixations in video. In: NIPS. (2012) Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural omputation (2006) 7. Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. In: International Conferene on Artifiial Intelligene and Statistis. (2011) Sohn, K., Jung, D.Y., Lee, H., Hero, A.O.: Effiient learning of sparse, distributed, onvolutional feature representations for objet reognition. In: ICCV. (2011) Zuo, Z., Wang, G.: Learning disriminative hierarhial features for objet reognition. Signal Proessing Letters 21(9) (2014) Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Deaf: A deep onvolutional ativation feature for generi visual reognition. In: ICML. (2014) Deng, J., Berg, A.C., Li, K., Fei-Fei, L.: What does lassifying more than 10,000 image ategories tell us? In: ECCV. (2010) Oliva, A., Torralba, A.: Building the gist of a sene: The role of global image features in reognition. Progress in brain researh 155 (2006) Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-sale and rotation invariant texture lassifiation with loal binary patterns. Pattern Analysis and Mahine Intelligene, IEEE Transations on 24(7) (2002) Jarrett, K., Kavukuoglu, K., Ranzato, M., LeCun, Y.: What is the best multistage arhiteture for objet reognition? In: ICCV. (2009) Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatiotemporal features. In: ECCV. (2010) Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y.: Building high-level features using large sale unsupervised learning. In: ICML. (2012) 17. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarhial invariant spatiotemporal features for ation reognition with independent subspae analysis. In: CVPR. (2011) Wang, A., Lu, J., Wang, G., Cai, J., Cham, T.J.: Multi-modal unsupervised feature learning for rgb-d sene labeling. In: ECCV. (2014) 19. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated reognition, loalization and detetion using onvolutional networks. arxiv preprint arxiv: (2013) 20. Girshik, R., Donahue, J., Darrell, T., Malik, J.: Rih feature hierarhies for aurate objet detetion and semanti segmentation. arxiv preprint arxiv: (2013)

16 16 Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang 21. Jiang, Z., Lin, Z., Davis, L.S.: Learning a disriminative ditionary for sparse oding via label onsistent k-svd. In: CVPR. (2011) Mairal, J., Bah, F., Pone, J., Sapiro, G., Zisserman, A.: Supervised ditionary learning. In: NIPS. (2008) 23. Yang, M., Zhang, L., Feng, X., Zhang, D.: Fisher disrimination ditionary learning for sparse representation. In: ICCV. (2011) Kong, S., Wang, D.: A ditionary learning approah for lassifiation: separating the partiularity and the ommonality. In: ECCV. (2012) 25. Li, Q., Wu, J., Tu, Z.: Harvesting mid-level visual onepts from large-sale internet images. In: CVPR. (2013) 26. Doersh, C., Gupta, A., Efros, A.A.: Mid-level visual element disovery as disriminative mode seeking. In: NIPS. (2013) Sun, J., Pone, J., et al.: Learning disriminative part detetors for image lassifiation and osegmentation. In: ICCV. (2013) 28. Juneja, M., Vedaldi, A., Jawahar, C., Zisserman, A.: Bloks that shout: Distintive parts for sene lassifiation. In: CVPR. (2013) Song, H.O., Zikler, S., Althoff, T., Girshik, R., Fritz, M., Geyer, C., Felzenszwalb, P., Darrell, T.: Sparselet models for effiient multilass objet detetion. In: ECCV. (2012) Song, H.O., Darrell, T., Girshik, R.B.: Disriminatively ativated sparselets. In: ICML. (2013) Hinton, G.E., Salakhutdinov, R.R.: Reduing the dimensionality of data with neural networks. Siene 313(5786) (2006) Wang, G., Forsyth, D., Hoiem, D.: Improved objet ategorization and detetion using omparative objet similarity. Pattern Analysis and Mahine Intelligene, IEEE Transations on 35(10) (2013) Wang, Z., Gao, S., Chia, L.T.: Learning lass-to-image distane via large margin and l1-norm regularization. In: ECCV. (2012) 34. Boiman, O., Shehtman, E., Irani, M.: In defense of nearest-neighbor based image lassifiation. In: CVPR. (2008) MCann, S., Lowe, D.G.: Loal naive bayes nearest neighbor for image lassifiation. In: CVPR. (2012) Wang, Z., Feng, J., Yan, S., Xi, H.: Linear distane oding for image lassifiation. Image Proessing, IEEE Transations on 22(2) (2013) Yao, B., Fei-Fei, L.: Ation reognition with exemplar based 2.5 d graph mathing. In: ECCV. (2012) 38. Lazebnik, S., Shmid, C., Pone, J.: Beyond bags of features: Spatial pyramid mathing for reognizing natural sene ategories. In: CVPR. Volume 2. (2006) Li, L.J., Fei-Fei, L.: What, where and who? lassifying events by sene and objet reognition. In: ICCV. (2007) 40. Quattoni, A., Torralba, A.: Reognizing indoor senes. In: CVPR. (2009) 41. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-sale sene reognition from abbey to zoo. In: CVPR. (2010) Wu, J., Rehg, J.M.: Centrist: A visual desriptor for sene ategorization. Pattern Analysis and Mahine Intelligene, IEEE Transations on 33(8) (2011) Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Loality-onstrained linear oding for image lassifiation. In: CVPR. (2010) Felzenszwalb, P.F., Girshik, R.B., MAllester, D., Ramanan, D.: Objet detetion with disriminatively trained part-based models. Pattern Analysis and Mahine Intelligene, IEEE Transations on 32(9) (2010)

17 Learning Disriminative and Shareable Features for Sene Classifiation Pandey, M., Lazebnik, S.: Sene reognition and weakly supervised objet loalization with deformable part-based models. In: ICCV. (2011) Li, L.J., Su, H., Fei-Fei, L., Xing, E.P.: Objet bank: A high-level image representation for sene lassifiation & semanti feature sparsifiation. In: NIPS. (2010) 47. Singh, S., Gupta, A., Efros, A.A.: Unsupervised disovery of mid-level disriminative pathes. In: ECCV. (2012) Boureau, Y.L., F. Bah, Y.L., Pone, J.: Learning mid-level features for reognition. In: CVPR. (2010) 49. Wang, X., Wang, B., Bai, X., Liu, W., Tu, Z.: Max-margin multiple-instane ditionary learning. In: ICML. (2013) 50. Gao, S., Tsang, I.H., Chia, L.T.: Laplaian sparse oding, hypergraph laplaian sparse oding, and appliations. Pattern Analysis and Mahine Intelligene, IEEE Transations on 35(1) (2013)

The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines

The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines The Minimum Redundany Maximum Relevane Approah to Building Sparse Support Vetor Mahines Xiaoxing Yang, Ke Tang, and Xin Yao, Nature Inspired Computation and Appliations Laboratory (NICAL), Shool of Computer