Adaptive Feature Selection via Boosting-like Sparsity Regularization

Adaptive Feature Selection via Boosting-like Sparsity Regularization Libin Wang, Zhenan Sun, Tieniu Tan Center for Research on Intelligent Perception and Computing NLPR, Beijing, China Email: {lbwang, znsun, tnt}@nlpr.ia.ac.cn Abstract In order to efficiently select a discriminative and complementary subset from a large feature pool, we propose a two-stage learning strategy considering both samples and their features simultaneously, namely sample selection and feature selection. The objective functions of both stages are consistent with a large margin loss. At the first stage, the support samples are selected by Support Vector Machine (SVM). At the second stage, a Boosting-like Sparsity Regularization (SRBoost) algorithm is presented to select a small number of complementary features. In detail, a weak learner is composed of a few features, which are selected by a sparsity enforcing model, and an intermediate variable is gracefully used to reweight the corresponding sample. Extensive experimental results on the CASIA-IrisV4. database demonstrate that our method outperforms the state-of-the-art methods. 1 Support Samples 2 Sparsity reweight S 1 3 Boosting-Like S 2 4 Keywords-feature selection; Boosting; sparse; Sample Selection Feature Selection I. INTRODUCTION Feature selection aims to select a small subset of compact and discriminative features. In biometrics, object classification and recognition, one image is usually represented by local feature descriptors, such as SIFT [1], and Ordinal Measures (OM) [13], which are extracted at every pixel by certain filters. So a large feature pool is generated, and it is over complete to describe the image itself. In this case, feature selection is brought to deal with high dimensional data. Many related algorithms have been presented during the past several decades. Among them, AdaBoost [6], [14] is a class of successful methods. They select a new feature (weak learner) on the reweighted samples heuristically. Recently, sparsity enforcing models [7][15][11] have attracted great attention, and achieved competitive performance especially in the case of small scale of training samples. These sparse models address the feature selection problem as an l or l 1 regularization optimization in common, which enforces the weights of features sparse. Besides the regularization term, the loss function is another significant element. Destrero et.al. [5] adopt Least Squares (LS) directly as a loss function with application to face detection. He et.al. [7] propose a correntropy based robust estimation loss to tackle the non-gaussian noise. And Wang et.al. [15] formulate feature selection as a Linear Programming (LP) model with a large margin loss, which is robust to noise and outliers as well. In the above models, complementary features are not explicitly taken into consideration, such Figure 1: Flowchart of the proposed method. (1) Sample selection ( 1-2 ). The points inside the dash ellipse line are selected samples. (2) Feature selection ( 3-4 ) by SRBoost. The bold dash lines (S 1, S 2 ) are the selected features by the Simplex algorithm (the polyhedron) respectively. that similar features will share large weights simultaneously. Moreover, the optimization of a sparse model usually involves the computation of matrix [11][7], which is timeconsuming in general. In summary, although the sparsity regularization methods have achieved promising performance, there are some limitations, e.g., they are lacking of considering the distribution of samples. And the complementation of features is not explicitly taken into account. To regard the above problems, in this paper, we propose a two-stage learning strategy, including sample selection and Boosting-like learning. The loss functions of two stages are consistent to promote the performance. At the first stage, the support samples are selected by SVM, which has the Hinge loss function with a large margin principle. At the second stage, a Boostinglike sparsity regularization (SRBoost) algorithm is designed to select the complementary features. In detail, SRBoost iteratively selects a small number of features by sparsity enforcing model, in which a large margin loss is added as well. And the complementary features are selected, because the features selected in each iteration classify training samples with different weights. In general, the proposed model

with the large margin loss can be formulated as a linear programming problem, which can be efficiently solved by an iterative Simplex algorithm. Figure 1 illustrates the flowchart of the proposed method. II. BOOSTING-LIKE SPARSITY REGULARIZATION MODEL A. Notations and primary settings Without loss of generality, we consider a binary classification problem, because multi-class problem can be transferred into intra-matching class and inter-matching class for feature selection problem. Assuming that class labels are linear mapping results of feature spaces, we learn the linear function by minimizing the mean squared error. Here {y y j, y j {+1, 1}} denotes the class label, and {X x, x {x +, x }} denotes a data set of D dimensional features, wherein {x +, x } represents the positive and negative samples respectively. The linear decision hyperplane is y Xw =, where w represents the weight vector. B. Sample selection In the first stage, a preprocessing strategy of sample selection is applied to reduce the scale of training samples, simultaneously, the distribution of samples will be maintained for classification. Sampling technique may be an offthe-shelf solution. The classic statistic bootstrap and n out of m bootstrap are important general resampling approaches. However, in order to hold the distribution of the original data, sufficient times of sampling should be taken. From another perspective, in this paper, we take good advantage of the Support Vector Machine (SVM). As we know, the output of SVM is a function of selected support vectors, therefore, sample selection here is implemented in this supervised way. Considering the efficiency, we use the linear SVM without kernels to generate samples. The objective function takes the form [2]: L(w, b, a) = 1 N 2 w 2 2 a n {y n (w T x n + b) 1} (1) n=1 And the linear SVM can be efficiently solved by the Sequential Minimal Optimization (SMO) algorithm [4], [3]. In addition, the support vectors have a good property that they are close to the decision boundary. And they reflect the distribution of samples relatively hard to be classified. Thus it is reasonable that the following feature selection can be just deployed on the selected samples. Figure 1( 1-2 ) shows the process of the sample selection. Generally, the time complexity of sample selection is worthy compared to the following selection step. It is worth mentioning that the Hinge loss of SVM is provided with large margin criterion, which has a close relationship with the following feature selection. Furthermore, the rest of training samples can be cast as a validation set for cross validation. C. SRBoost 1) learning weak classifiers: This step aims to construct weak learners as Adaboost, the difference lies in the learning approach rather than hand-craft one. Specifically, The first step of the second learning stage is to select features by sparsity regularization, the few learnt features constitute one weak classifier. As previously mentioned, the linear decision hyperplane is y Xw =, The original sparsity enforcing methods [5][9] can be summarized as: w = arg min y Xw 2 2 + λ w 1 (2) w To further improve the performance, a robust estimator φ( ) is introduced to deal with non-gaussian noise [7]: w = arg min w N φ((y i X i w)) + λ w 1 (3) i=1 The robust functions have the property that φ(x) is stable even if the independent variable x is very large, e.g., φ(x) = 1 exp( x 2 ) [7], which is different from original least squares loss. In order to be consistent with the objective function (Equation (1)) of sample selection, we still employ a large margin loss function [15] for learning weak classifiers. And considering the Boosting-like strategy, a weighted term k is gracefully introduced to update samples. Therefore, in this paper, we present a sparsity regularization model as: min w T 1 + λ (k T ξ) w T x + j C+ + ξ j, j = 1...N +. s.t. w T x j C ξ j j = 1...N. w i, ξ j i = 1...D, j = 1...N. (4) where 1 is a vector whose elements are all 1. C + and C are determined empirically constants, λ is a regularized parameter balancing the two parts of the objective function. k is a weighted term, it can control the variation of learnt ξ. The objective function is an l 1 minimization with nonnegative constraints, therefore, the weight w is sparse. The constraint terms are to classify the samples in a supervised way, additionally, loss functions with constraints are also yielded under large margin principle [15], which are elegantly consistent with that of the above sample selection step. Finally, the features are selected according to the weights. The important term in our model is the slack variable ξ. Naturally, it has almost the same effect with robust estimator φ, but it is an adaptively learning-based term. For example, if one sample is interfered with large noise, then the corresponding slack variable ξ is also large automatically. In this case, the response values of noisy samples will be suppressed to ensure the learning performance.

Algorithm 1 Boosting-like Sparsity Regularized Feature Selection (SRBoost) 1: Input: Data X = {X + R N + D, X R N D }. Output: Weight vector W R D. Initialization: k = 1/N. 2: sample selection: ˆX solving Eqn. (1); 3: for m = 1 : M do 4: w (m) solving Eqn. (4) by ˆX; 5: k (m+1) solving Eqn. (5) or (6); 6: end for 7: W = M w (m) m=1 From another perspective, this slack variable can be viewed as the classification rate of training samples. The smaller the slack variable is, the more confidence we have to classify the corresponding samples correctly. This is key to the idea of the following Boosting-like strategy. 2) Sample reweighting: The goal of this step is to reweight the training samples as Adaboost. Specific speaking, the second step of the second learning stage is boosting the selected features. Complementary features are not explicitly taken into consideration in sparsity enforcing selection methods, thus, there are some of the similar features sharing almost the same large weight. Therefore, complementary analysis is necessary to be added directly to reduce the redundancy. Generally, a pair of complementary features can classify different training samples. In order to implement the above idea, we adopt the sample reweighting strategy inspired by the success of Adaboost [6]. Here we deploy two different functions to update the weights of samples, i.e., linear and exponential penalty: k (m+1) = ξ (t) /(1 T ξ (m) ) (5) k (m+1) = exp(ρξ (m) )/Z (6) where ρ is a learning rate, and Z is the factor of normalization. From the objective function of (4), minimizing k j ξ j, we can see that the larger k j is, the smaller ξ j is learned, which means the samples with large weight should be correctly classified. And under the condition of Equation (6) or (5), the samples with large classification error ξ (m) j will have large k (m+1) j in the previous iteration, thus larger weights are added to these samples in the next round, so that these samples are enforced classified with smaller error ξ (m+1) j in the current iteration. In other words, the selected features in different iterations are complementary to deal with different samples. In the linear case, the update rate of samples is fixed without parameters. But the exponential penalty is more flexible with a tunable learning ratio. In the exponential case, different values of the update ratio ρ show different levels of penalty. If ρ is small, it updates the weights more gentle than the linear case. And if ρ is large, it updates the weights more severe than the linear case. Figure 1( 3-4 ) shows the flowchart of SRBoost. Finally, the selected features are the intersection of the results of M interations. The entire model is a LP problem, thus it can be efficiently solved by the Simplex algorithms. Algorithm 1 describes the proposed SRBoost in brief. In summary, the proposed two-stage learning strategy considers both efficiency and effectiveness, as shown in Algorithm 1. Sample selection is introduced to reduce the computation complexity of feature selection, and meanwhile ensures the local distribution of samples close to the decision boundary. Especially, the second stage SRBoost combines the advantages of sparsity regularization and AdaBoost like methods. In addition, the two stages have a close relationship sharing with a consistent large margin loss function. III. EXPERIMENTAL RESULTS In order to verify the performance of our method, we conduct experiments on feature selection of iris images [12], [15], because the local feature descriptors of biometrics are typically high dimensional. A. Datasets. We evaluate our method on two subsets of CASIA- Iris-V4. database [1]. CASIA-Iris-Thousand (Thousand) contains 2, iris images from 1, subjects. We use Distance subset to verify the generalization of these methods. They are both challenging databases. B. Settings. We use the same settings as in [15]. The iris images are all normalized to the size of 7 54 without preprocessing. 5 iris images from 25 subjects (1 images per eye) in the Thousand database are used for training. We generate 2,25 intra-class matching scores as the positive samples, and 4,9 inter-class matching scores as the negative samples, and the rest of the Thousand subset serves as the test set. We adopt the regional OM [13][8] as our local feature, and the matching scores are computed by Hamming distance. 47,42 regional OM features are extracted for selection. We select 15 features for comparison, and they are enough for a competitive performance. C + and C are set to be.4 and.8 respectively. The algorithms involved in comparison are GentleBoost [6], traditional l 1 regularized sparse methods [5], RRLP [15]. C. Evaluations. We will analyze the experimental results from the following two aspects: learning stage and analysis of the results including parameter selection, accuracy and efficiency.

The weight of features 5 Results of the 1st iteration Results of the 2nd iteration 1 2 3 4 5 The index of features x 1 4 Figure 2: The results of feature selection at the first two iterations. Table I: Comparative results on the Thousand database Methods EER @=% GentleBoost [6].96.297 l 1 [5].9.285 RRLP [15].85.244 Proposed SRBoost.42.77 Table II: Comparative results on the Distance database Methods EER @=% GentleBoost [6].756 275 l 1 [5].689 291 RRLP [15].641 197 Proposed SRBoost.613 898 1) Learning stage: We train the models to select several numbers of features on Thousand database for iris recognition. Sample selection: The first step is sample selection. Totally, the number of training samples N is 715, including N + = 225, N = 49. The support samples are selected by linear SVM [3] with default parameters. Here we only focus on the samples rather than the performance of SVM. Then 147 support vectors are extracted which are only about 2% of the original training samples. The other samples distributed beyond the decision boundary are not so crucial for feature selection. Intuitively, they are used as an validation set in the following stage to select optimal parameters of SRBoost. SRBoost learning: In the inner loop, a sparsity regularized feature selection via LP is implemented with fixed C + and C. Initially, the weights of samples are set to be 1/N. Then the samples weights are updated according to the learned slack variable ξ via Equation (5) or (6) at each iteration. The former linear update function is convenient without extra parameters, however the latter exponential function is more flexible with tunable update ratio ρ. The large ρ is suitable for the cases with a small number of Boosting-like iterations. In order to compute conveniently, we implement two rounds of iterations, which is also enough for competitive results. For example, if ρ = 3, then the numbers of features selected at two iterations are 27 and 19, respectively. Figure 2 illustrates the feature selection results at the first two iterations. As shown in Figure 2, the features are different and complementary to some extent. Finally, to fairly compare the performance, we select 15 features for all algorithms. For Lasso and RRLP, we select top 15 features by the absolute value of weights. then in our proposed SRBoost algorithm, we select 8 and 7 features in the two iterations respectively. And SVM is applied as the classifier for iris recognition. 2) Performance analysis: In biometrics, ROC curves and Equal Error Rate (EER) are usually employed as measurements of performance. EER is the rate where False Accept Rate () and False Reject Rate () are equal in the ROC curve. The smaller EER is, the better the performance is. Parameter selection: Firstly, we study the impact of different update functions on the performance. Three models are trained with linear case and exponential case (ρ = 1, ρ = 3) respectively. For simplification, the first 5 classes (left eyes and right eyes of 25 subjects) in Thousand database are selected to test the performance except for the training data. As show in Figure 3(a), the exponential update function performs generally better than the linear case, and the large update ratio obtains the best results, that is because only two Boosting iterations are carried, the severe update function of samples ensures the better complementarity of features. The features selected at the second iteration are more prone to classify the hard samples misclassified at previous iteration. To further prove the explanation, the same three models are fed to the Distance database, which is different from the training data. From Figure 3(b), we can see the similar results. The performance of the three models is closer than results on the Thousand database because of the generalization, i.e., the capacity of the training models is not so strong, which suggests that the quality of two subset is of great differences. Comparative results: Secondly, we compare the proposed method to other three state-of-the-art algorithms. The rest images of Thousand database are all testing set, which has 8775 intra-class matching and 19275 interclass matching. The sufficient number of samples is enough for testing the algorithms. We adopt exponential update function (ρ = 3) due to the analysis above. As shown in Figure 3(c), RRLP has better results than l 1 sparse method, because it deploys a more robust large margin based loss function. Then the proposed SRBoost performs best, which explains that the Boosting strategy works better, it is necessary to explicitly consider the complementarity of features. The EER and at =% are illustrated in Table I. EER of our method is improved by nearly 5% compared with other methods. Considering the applicability, at =% is.77, which is much lower than

.6.4.3.2.1 ρ=1 exp ρ=3 exp Linear EER curve ρ=1 exp ρ=3 exp Linear EER curve.8.6.4.2 GentleBoost L1 RRLP SRBoost EER Curve GentleBoost L1 RRLP SRBoost EER Curve 1 4 1 2 1 1 4 1 2 1 1 6 1 4 1 2 1 1 4 1 2 1 (a) (b) (c) (d) Figure 3: The ROC curves of feature selection under two kinds of update functions (a) on the Thousand database and (b) on the Distance database. The ROC curves of feature selection compared with other methods (c) on the Thousand database and (d) on the Distance database. The training data sets are from the Thousand database. classical methods. In order to verify the generalization, we also conduct the same experiments on the Distance database, and on this database, we generate 4766 intra-class matching and 32896 inter-class matching. ROC curves are shown in Figure 3(d). The recognition rates are consistent with the results on the Thousand database, SRBoost also obtains the best performance, although the results on the Distance database are not so good as those on the Thousand database, our algorithm still shows its potential of good generalization. IV. CONCLUSION In this paper, we have proposed a two-stage learning strategy, including sample selection and feature selection, to select features. Our method considers the samples and their high dimensional features simultaneously, additionally the loss functions of both stages are gracefully consistent based on large margin principle. At the first stage, the support samples are selected by SVM regarding the distribution of training samples. At the second stage, a Boosting-like sparsity regularization (SRBoost) algorithm is presented to select a small number of complementary features. The experimental results on the CASIA-IrisV4. database have demonstrated that our method outperforms the state-of-theart methods. ACKNOWLEDGMENT This work is funded by the National Basic Research Program of China (212CB3163), National Natural Science Foundation of China (Grant No. 61273272, 6113155), International S&T Cooperation Program of China (Grant No. 21DFB1411) and Instrument Developing Project of the Chinese Academy of Sciences (Grant No. YZ21266). REFERENCES [1] CASIA Iris-V4. Database, http://biometrics.idealtest.org/. [2] M. Bishop, in Pattern Recognition and Machine Learning, 26. [3] C.-C. Chang and C.-J. Lin, Libsvm: A library for support vector machines, ACM TIST, vol. 2, no. 3, p. 27, 211. [4] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 2, no. 3, pp. 273 297, 1995. [5] A. Destrero, C. De Mol, F. Odone, and A. Verri, A regularized framework for feature selection in face detection and authentication, IJCV, vol. 83, pp. 164 177, 29. [6] J. Friedman, T. Hastie, and R. Tibshirani, Additive logistic regression: a statistical view of boosting, Annals of Statistics, vol. 28, p. 2, 1998. [7] R. He, T. Tan, L. Wang, and W.-S. Zheng, l2, 1 regularized correntropy for robust feature selection, in CVPR, 212, pp. 254 2511. [8] Z. He, Z. Sun, T. Tan, X. Qiu, C. Zhong, and W. Dong, Boosting ordinal features for accurate and fast iris recognition, in CVPR, june 28, pp. 1 8. [9] Y. Liang, S. Liao, L. Wang, and B. Zou, Exploring regularized feature selection for person specific face verification, in ICCV, nov. 211, pp. 1676 1683. [1] D. G. Lowe, Distinctive image features from scale-invariant keypoints, IJCV, vol. 6, no. 2, pp. 91 11, 24. [11] F. Nie, H. Huang, X. Cai, and C. H. Q. Ding, Efficient and robust feature selection via joint ;2, 1-norms minimization, in NIPS, 21, pp. 1813 1821. [12] J. Pillai, V. Patel, R. Chellappa, and N. Ratha, Secure and robust iris recognition using random projections and sparse representations, TPAMI, vol. 33, no. 9, pp. 1877 1893, 211. [13] Z. Sun and T. Tan, Ordinal measures for iris recognition, TPAMI, vol. 31, no. 12, pp. 2211 2226, dec. 29. [14] P. A. Viola, M. J. Jones, and D. Snow, Detecting pedestrians using patterns of motion and appearance, IJCV, vol. 63, no. 2, pp. 153 161, 25. [15] L. Wang, Z. Sun, and T. Tan, Robust regularized feature selection for iris recognition via linear programming, in ICPR, Nov. 212, pp. 3358 3361.