Multi-task Joint Feature Selection for Multi-label Classification

Size: px

Start display at page:

Download "Multi-task Joint Feature Selection for Multi-label Classification"

Jayson Willis
5 years ago
Views:

1 Chinese Journal of Electronics Vol.24, No.2, Apr. 205 Multi-task Joint Feature Selection for Multi-label Classification HE Zhifen,2, YANG Ming,2 and LIU Huidong 2 (. School of Mathematical Sciences, Nanjing Normal University, Nanjing 20023, China) (2. School of Computer Science and Technology, Nanjing Normal University, Nanjing 20023, China) Abstract Multi-label learning deals with each instance which may be associated with a set of class labels simultaneously. We propose a novel multi-label classification approach named MFSM (Multi-task joint feature selection for multi-label classification). In MFSM, we compute the asymmetric label correlation matrix in the label space. The multi-label learning problem can be formulated as a joint optimization problem including two regularization terms, one aims to consider the label correlations and the other is used to select the similar sparse features shared among multiple different classification tasks (each for one label). Our model can be reformulated into an equivalent smooth convex optimization problem which can be solved by the Nesterov s method. The experiments on sixteen benchmark multi-label data sets demonstrate that our method outperforms the state-of-the-art multi-label learning algorithms. Key words Multi-label learning, Multi-task learning, Feature selection, Label correlations. I. Introduction Multi-label learning (MLL) is a hot research topic in the area of machine learning, pattern recognition, etc. In MLL, each instance is represented by a feature vector and may belong to multiple labels. The task of MLL is to predict a set of class labels for the unseen instance. Single label learning (SLL) including two-class and multi-class learning, where each object is represented by a single instance and only associated with one class label (no matter the number of class label is two or more). Thus, SLL is essentially a degenerated version of MLL by restricting that each instance only belongs to one class label. The conception of MLL is originated from the research of document categorization [,2], where a document is possibly assigned to several predefined categories at the same time, such as diet and health. In recent years, many other real-world application domains are also involved in MLL. For example, in automatic image annotation, each image may be annotated with a set of semantic concepts [3,4], such as sunset, beach and sea. In bioinformatics, each gene sequence may be correlated with multiple functional classes [5,6], such as transcription, metabolism and protein synthesis. In music categorization, one music clip may be tagged with more than one concept labels [7], such as happy, joyful and so on. Numerous algorithms have been put forward for MLL, such as Boosting-based text categorization algorithm (BoosTexter) [], Kernel method for multi-label classification (RankSvm) [6], Multi-label neural network (BPMLL) [5], Multi-label lazy learner (MLkNN) [8], Multi-label naive bayes (MLNB) [9], Multi-label core vector machine (Rank-CVM) [0] and so on. The most common and direct approach is to transform the MLL problem into a series of independent binary classification problems [5,6,0,]. However, it ignores the label correlations. It is widely accepted that exploiting label correlationsisanimportantissueinmll [5,6,2 5]. For instance, a document tagged with Olympic Games and Basketball would be likely labeled as Sports. A music clip assigned with happy would be unlikely labeled as sad. There are many approaches for exploiting label correlations, which can be roughly categorized into three groups according to the order of label correlations, namely, first-order strategy, second-order strategy and high-order strategy [2,6]. For example, Fürnkranz et al. [7] considered pairwise correlations between labels. Huang et al. [4] tried to discover label correlations automatically and concluded that the correlations among labels are usually asymmetric. Like in SLL, MLL may also encounter the curse of dimensionality because it often involves high-dimensional data sets. Zhang et al. [9] adapted the naive Bayes algorithm for MLL, where feature extraction strategies based on principal component analysis and feature selection techniques based on a genetic algorithm are incorporated into it. Ji et al. [8] considered multi-label classification and dimensionality reduction simultaneously in the objective function. Nevertheless, the label Manuscript Received Nov. 203; Accepted June 204. This work is supported by National Natural Science Foundation of China (No , No.60036), Natural Science Foundation of Jiangsu Province of China (No.BK20782), and Key (Major) Program of Natural Science Foundation of Jiangsu Province of China (No.BK20005). c 205 Chinese Institute of Electronics. DOI:0.049/cje

2 282 Chinese Journal of Electronics 205 correlations are not taken into consideration. Zhang et al. [9] studied the problem of dimensionality reduction in MLL. The dependence between the instances and the corresponding labels are maximized by the Hilbert-schmidt independence criterion in the lower-dimensional feature space. Nevertheless, it only conducts dimensionality reduction without concerning with the construction of multi-label classifier. Therefore, there exists three major challenges in MLL, namely, a) how to build an effective multi-label classifier to predict the label set of the unknown instance, b) howtoeffectively exploit label correlations and c) how to reduce the dimensionality of high-dimensional data to improve the generalization performance of MLL system. In this paper, we try to address the above three problems. We introduce a novel multi-label classification algorithm named MFSM. The high-order asymmetric label correlations are obtained by l sparse coding in the label space at first. Then, the MLL problem is formulated as a joint optimization problem by incorporating label correlations term, and the l 2,- norm regularization term which isusedforselectingthecom- mon features shared among multiple classifiers. At last, since optimization problem is a non-smooth convex problem, we reformulate it as an equivalent smooth convex problem [20] which can be solved by the Nesterov s method [2]. In summary, the major contributions of our paper include three aspects. a) MFSM has enriched the research of MLL by proposing a novel multi-label classifier. b) The technique of multi-task feature learning is successfully applied to MLL. c) The proposed formulation incorporates the correlations among class labels. II. Related Works In the past decade years, MLL which deals with instances having multiple labels has been received much concern in machine learning, data ing, etc. A number of MLL approaches have been developed and they can be roughly divided into two categories [2] : ) Problem transformation methods: These methods convert the MLL problem into other well-established learning problems such as binary classification problem, multi-class problem, label ranking problem and so on. Boutell et al. [3] introduced the Binary relevance (BR) method for MLL. BR s mehods can be parallelized. However, there exists many drawbacks such as the label correlations are not taken into account and the problem of large scale label space. Read et al. [22] developed the Classifier chains (CC) algorithm based on the BR method. This method takes into consideration the label correlations in the label space, but the label correlations are random due to the randomly permutated chain. Furthermore, there possibly exists the effect of error propagation along a random chain if the performance of the first one or more classifiers are poor. To overcome the shortcogs, an Ensemble of classifier chains (ECC) [22] is proposed by employing an ensemble of chains. Label powerset (LP) method considers each distinct subset of label sets that exists in the training set as a different class value and then constructs a multi-class classifier. This method takes into account the label correlations, but suffers from the large number of class labels and can not predict unseen label sets. To deal with these problems, Tsoumakas et al. [23] built an ensemble of LP classifiers (RAkEL). In the beginning, a number of small subsets are randomly picked from the initial set of labels. Multiple multi-class classifiers are constructed by utilizing the LP method subsequently. In the end, the outputs are predicted by combining multiple classifiers and classified by majority voting or threshold for each instance. The basic idea of Label ranking (LR) [24] is to convert the MLL problem into the label ranking problem by pairwise comparison. Nevertheless, it is difficult to detere the threshold value to correctly estimate the label sets of the predicted instance. Fürnkranz et al. [7] attempted to add an extra virtual class label, which plays the part of a bi-partition point between the relevant and irrelevant class labels, for the class label set of each instance on the basis of the LR method. 2) Algorithm adaptation methods: These methods are the extension of some well-known learning algorithms to handle the multi-label data sets directly. Schapire et al. [] extended the popular Adaboost algorithm to text categorization, where Adaboost.MH and Adaboost.MR algorithm are proposed by imizing hamg loss and ranking loss based on AdaBoost, respectively. Zhang et al. [8] adapted the classic k-nearest neighbor algorithm to deal directly with the multi-label data sets. Many approaches are the variant of the classic support vector machine, such as RankSvm [6],RankCVM [0],OVR-ESVM [], MLLOC [3] and so on. Zhang et al. [5] designed the multi-label algorithm based on the traditional back-propagation neural network. Multi-task learning (MTL) [20,25 27] learns multiple related tasks simultaneously rather than learning each task independently. It aims to learn the common information across multiple related tasks to improve the performance of MTL system. Thus, exploiting the relatedness among multiple tasks is a key issue in MTL [27]. In this paper, we decompose the MLL problem into multiple binary problems and treat each binary classification problem as a learning task with the same input features. Meanwhile, we consider the high-order label correlations and use l 2,-norm to capture the similar sparse structures shared among multiple tasks. Our proposed approach is presented in the next section. III. The Proposed Framework In this section, let us begin by introducing some notations. Given a training set D = {(x, Y ), (x 2, Y 2),...,(x n, Y n)} with n instances, where x i R d represents a single instance, Y i {+, } K is a binary label vector representation corresponding to x i, Y ij = + if x i belongs to the jth label and otherwise, and K denotes the number of class labels. We denote by X = [x,x 2,...,x n] T R n d, Y =[Y, Y 2,...,Y n] T R n K the data matrix and the class label indicator matrix, respectively.. Constructing the label correlation matrix Sparse representation (SR) is firstly used for signal representation and compression, and then widely applied in signal processing, machine learning, etc. Given a signal c R n and

3 Multi-task Joint Feature Selection for Multi-label Classification 283 amatrixc =[c, c 2,...,c K] R n K,SRaimstorepresentc using as few elements in C as possible [28], the objective function can be defined as follows: s s 0 s.t. c = Cs where s R K is the sparse coefficient vector. Unfortunately, the problem in Eq.() is not convex. Some studies [28,29] showed that if the solution of Eq.() is sparse, Eq.() can be transformed into Eq.(2): s s s.t. c = Cs Generally, the constraint in Eq.(2) does not always hold due to noises. One of the robust extension is to add a noise term [30] ɛ R n, i.e. c = Cs + ɛ. Meanwhile, the constrained optimization problem can be rewritten to an unconstrained optimization problem by adding a tradeoff parameter. Thus, the reformulated model is given by Eq.(3): es 2 c C e es 2 + α es (3) where e C = [C, I n] R n (K+n), I n is the n n identity matrix, es = [s T,ɛ T ] T R K+n and α is the regularization parameter which is utilized to trade off the two terms. The label correlations are usually asymmetric [4] in MLL. Moreover, not all the labels are related [3]. In many cases, only part of labels are relevant, especially when the number of class labels is very large. So we try to obtain the label sparse representation for each label vector by using Eq.(3) to characterize the label correlations of each label against the rest labels. In this work, we assume that the self-label-correlation coefficient is zero for each label. Thus, the label correlation matrix W with all diagonal elements being zero can be constructed according to the sparse representation vector of each label. The complete description is illustrated in Algorithm. Algorithm Learning the label correlation matrix Input: The label indicator matrix of the training set Y Output: Label correlation matrix W R K K with all diagonal elements being zero : Set C =[c, c 2,...,c K ] R n K,wherec ik =ify ik =+ and 0 otherwise. And then each element c k of C is normalized to be c k / c k, k =, 2,...,K 2: For k =tok 3: The sparse representation s R K+n with the kth element being zero for each label vector c k is obtained by solving the following optimization problem: s 2 c k Cs e 2 + α s s.t. s k =0 4: Set W lk = s l for l K 5: End for 2. The MFSM approach In this work, we learn a set of K linear functions: {f,f 2,...,f K} (each for one label), where f k (x) =a T k x + b k, a k R d and b k R denote the weight vector and bias for the kth class label, respectively, k =, 2,...,K. Assume that both the data matrix X and the label indicator matrix Y are () (2) centered. Therefore, all the bias terms {b k } are zero. We first consider the construction of each linear classifier as a learning task, and then train K linear classifiers in a joint optimization problem to learn multiple tasks simultaneously. Furthermore, to effectively exploit label correlations, a regularization term is added. Thus, the objective function is given by Eq.(4): {a k } K 2 k= nx i= k= + λ 2 (a T k x i Y ik ) 2 Xa k W jk Xa j 2 2 k= j= where λ is the regularization parameter that balances the two terms. The first term is the least square loss function and the second is the reconstruction error term which ensures that each label can be linearly represented by other related class labels. As we know, the curse of dimensionality may also occur in MLL. Generally, in the high-dimensional feature space, only a small subset of features is useful for classifier building. In this paper, multiple tasks with all the instances sharing the same feature space are trained simultaneously. In essence, there is a certain relationship among multiple tasks. By adding a l 2,- norm penalty to select the common discriative features across multiple linear classifiers. MFSM can be expressed as Eq.(5): {a k } K 2 k= + λ 2 nx i= k= (a T k x i Y ik ) 2 k= (4) (5) Xa k W jk Xa j λ 2 A 2, j= where A =[a,a 2,...,a K] = [a,a 2,...,a d ] T R d K and A 2, = P d i= ai 2 is the l 2,-norm of the matrix A which is used for enforcing the similar sparsity features shared among K binary classifiers, and λ and λ 2 are the regularization parameters that balance the three terms. Eq.(5) can be rewritten as Eq.(6): q(a)+λ2 A 2, (6) A R d K where q(a) = 2 XA Y 2 F + λ 2 tr(xamat X T ), M = (I K W )(I K W ) T, W stands for the label correlation matrix, I K is the K K identity matrix, tr( ) denotes the trace of a matrix. 3. Problem solution by the Nesterov s method The problem in Eq.(6) is a non-smooth convex problem because A 2, is non-differentiable. But it can be reformulated as an equivalent constrained smooth convex optimization problem, which is then solved by the Nesterov s method [2],by adopting the method of Ref.[20]. g(u, A) =q(a)+λ 2 (u,a) Ω dx u i (7) where u =[u,u 2,...,u d ] T and Ω = {(u, A) a i u i, i =, 2,...,d}. Moreover, since q(a) is a smooth convex function, the optimization problem in Eq.(7) is close and convex [20]. The Nesterov s method is based on the sequences of approximate solutions {(u t, A t)} and search points {(v t, S t)}, i=

4 284 Chinese Journal of Electronics 205 where t denotes the t-th iteration. The search point (v t, S t)is defined as (v t, S t)=(u t + α t(u t u t ), A t + α t(a t A t )) (8) where α t is the combination coefficient. The approximate solution (u t+, A t+) can be computed by (u t+, A t+) =π Ω(v t γ t vt g(v t, S t), S t γ t St g(v t, S t)) (9) where γ t is the stepsize, π Ω(v, S) is the Euclidean projection [32] of (v, S) onto the convex set of Ω: π Ω(v, S) =arg (u,a) Ω 2 A S 2 F + 2 u v 2 2 (0) vg(v, S) is the partial derivative of g(v, S) withrespectto v: vg(v, S) =λ 2 () R d is a vector of all ones and S g(v, S) is the partial derivative of g(v, S) withrespecttos: S g(v, S) =X T (XS Y )+λ X T XSM (2) The algorithm description for solving Eq.(7) by the Nesterov s method is given in Algorithm 2. And the definition of g γ,v,s(u, A) isgivenby: g γ,v,s (u, A) =g(v, S)+ vg(v, S), u v dx + ( S g(v, S)) ij(a S) ij i= j= + 2γ u v γ A S 2 F (3) The time complexity of the optimization problem in Eq.(7) which is solved by the Nesterov s method is O( ε (ndk+dk)), where n, d, K denote the number of instances, features and class labels, respectively. ε is the error tolerance. The detailed analysis please see Ref.[2]. Algorithm 2 Problem solution of MFSM Input: g(, ), Ω, γ 0 > 0, (u 0, A 0 ) Output: (u, A), where u R d and A R d K : Initialize (u, A )=(u 0, A 0 ), β =0,β 0 = 2: For t = to...do 3: Set α t = β t 2 β t 4: Compute (v t, S t)usingeq.(9) 5: For i = 0 to...do 6: Set γ =2 i γ t,v t = vt γ vt g(v t, S t), S t = St γ S t g(v t, S t) 7: Compute (u t+, A t+ )=π Ω (v t, S t ) 8: If g(u t+, A t+ ) g γ,vt,st (u t+, A t+ )then 9: γ t = γ, break 0: End if : End for q + +4βt 2 2: Set β t = 2 3: If convergence then 4: (u, A) =(u t, A t), terate 5: End if 6: End for In the testing phase, given an unseen instance x, itslabel sets can be predicted by: h(x) ={k f k (x) t(x), k K} (4) where t(x) is a threshold function. There are some ways to choose the threshold function t(x) [3,5,6,8]. A simple method is that t(x) is set to be a constant function [3,8]. However, motivated by Refs.[5,6], we here select t(x) using a linear regression function. IV. Experiments. Experimental setting To validate the effectiveness of our proposed approach, we compare MFSM with eight state-of-the-art MLL algorithms including MLkNN [8], RankSVM [6], BPMLL [5], MLNB [9], MDDM [9],ECC [22],RAkEL [23] and MAHR [4],whereECC and RAkEL are implemented on MULAN library [33] while the other algorithms are implemented in Matlab. These algorithms are performed on sixteen multi-label data sets where Yahoo includes eleven data sets. The data sets are summarized in Table. Table. Characteristics of the experimental data sets Data set Instances Train Test Features Labels Domain Medical text Slashdot text Langlog text Human biology Plant biology Yahoo text In our experiment, five commonly-used multi-label evaluation criteria are employed, i.e., Average precision, One-Error, Micro-F, Macro-F, Macro-AUC. The first two are instancebased and the last three are label-based. The detailed definitions of these evaluation criteria can be seen in Ref.[2]. 2. Parameters selection The optimal parameters of eight comparative approaches are suggested in the corresponding literatures. For MFSM, three parameters need to be tuned, i.e. α, λ and λ 2.Inthis paper, λ is selected from {0.000, 0.00, 0.0, 0., 0.2,...,}, α and λ 2 are varied from 0 to with an interval of 0.. We perform MFSM on some data sets with five-fold crossvalidation by grid-research. Experimental results show that MFSM achieves stable performance with α =0.5, λ =0.00 and λ 2 =0.7, respectively. Hence, MFSM is implemented with α =0.5, λ =0.00 and λ 2 =0.7, respectively, for all the data sets in experiments. Furthermore, the common features shared among multiple classifiers are learned by the l 2,-norm regularization. The features are selected when the corresponding rows of matrix A are not identically equal to zero. Thus, the number of selected features is not a parameter but controlled by the parameter λ Experimental results Tables 2 6 present the experimental results of MFSM and the other eight well-known approaches on sixteen multi-label data sets in terms of five commonly-used multi-label evaluation metrics, where the best result on each data set among nine algorithms is highlighted in boldface. Due to space limit,

5 Multi-task Joint Feature Selection for Multi-label Classification 285 we only list the average results over the eleven data sets for Yahoo. The experimental results in Tables 2 3 show that our proposed method MFSM gets significantly better performance than the other compared methods on all the data sets in terms of two instance-based evaluation criteria. Particularly, MFSM outperforms the high-order algorithms (ECC, RAkEL and MAHR). The comparison results in terms of three label-based evaluation criteria are summarized in Tables 4 6. In Table 4, MFSM gets the second place on medical data set, but the difference is very small compared with MAHR. In addition, it is worth noting that MFSM achieves the best performance on the rest data sets. As shown in Table 5, MFSM gets better performance than BPMLL on most of the data sets and the difference is less than 0.0 in terms of Macro-F on Slashdot data set. On Medical data set, MFSM is only inferior to MAHR. For the other data sets, MFSM is obviously superior to all the compared algorithms. In Table 6, there is little difference between the results of MFSM and MAHR on Medical data set. Although MFSM ranks the third on Langlog data set, the difference between MFSM and the best algorithm is not significant, and MFSM is better than the other six methods. Table 2. Experimental results in terms of Average precision (the bigger the better) Medical Slashdot Langlog Human Plant Yahoo Table 3. Experimental results in terms of One-Error (the smaller the better) Medical Slashdot Langlog Human Plant Yahoo Table 4. Experimental results in terms of Micro-F (the bigger the better) Medical Slashdot Langlog Human Plant Yahoo Table 5. Experimental results in terms of Macro-F (the bigger the better) Medical Slashdot Langlog Human Plant Yahoo Table 6. Experimental results in terms of Macro-AUC (the bigger the better) Medical Slashdot Langlog Human Plant Yahoo To further do a comparative analysis on statistics, we compare all methods against each other over multiple data sets by Friedman test [34] with 5% significance level. For each evaluation criterion, the null-hypothesis that all the algorithms have equal performance is rejected. So we should use post-hoc tests to detere which algorithms significantly different. In this paper, the Nemenyi test is used where the performance of two algorithms is significantly different if the difference be-

6 286 Chinese Journal of Electronics 205 tween their average ranks is not less than the Critical difference (CD) [6,34]. Fig. gives the statistical comparisons of all algorithms against each other over multiple data sets in terms of five evaluation criteria, where the critical difference (CD=3.00, 9 algorithms, 6 data sets) is drawn above the axis and the average rank of each algorithm is plotted on the axis (higher ranks to the left) in each subfigure. The groups of algorithms that are not significantly different are connected with a bold line. For each algorithm, it exists 40 comparisons (8 comparing algorithms, 5 criteria). For MFSM, it achieves statistically comparable performance in 30% cases and better performance in 70% cases, and it is not inferior to the other algorithms, i.e. no algorithms outperform MFSM. The above experimental results and statistical analysis indicate that MFSM obtains better performance than the other state-of-the-art MLL algorithms. Fig.. Performance comparisons of all algorithms against each other in terms of each evaluation criterion. (a) Average precision; (b) One-Error; (c) Micro-F ; (d) Macro-F ; (e) Macro-AUC V. Conclusions This paper presents a novel effective method for multi-label classification named MFSM. MFSM considers the construction of multi-label classifier, label correlations and feature selection simultaneously. Firstly, by using l sparse coding in the label space, the high-order asymmetric label correlations are obtained. Secondly, by taking into account the label correlations, the major shortcog that converting the MLL problem into multiple binary classification problems is overcame. Thirdly, by incorporating the l 2,-norm regularization term, the similar sparsity patterns among multiple tasks are selected. Finally, a joint optimization formulation is constructed and reformulated as an equivalent smooth convex optimization problem and then solved by the Nesterov s method. The experimental results on sixteen data sets verify the effectiveness of our proposed method. References [] R.E. Schapire and Y. Singer, Boostexter: A boosting-based system for text categorization, Machine Learning, Vol.39, No.2/3, pp.35 68, [2] Y.Y. Jiang, P. Li and Q. Wang, Labeled LDA model based on shared background topics, Acta Electronic Sinica, Vol.4, No.9, pp , 203. (in Chinese) [3] M.R. Boutell, J. Luo, et al., Learning multi-label scene classification, Pattern Recognition, Vol.37, No.9, pp , [4] C. Wang, S. Yan, et al., Multi-label sparse coding for automatic image annotation, IEEE Conference on Computer Vision and Pattern Recognition, Miami, Florida, USA, pp , [5] M.L. Zhang and Z.H. Zhou, Multilabel neural networks with applications to functional genomics and text categorization, IEEE Transactions on Knowledge and Data Engineering, Vol.8, No.0, pp , [6] A. Elisseeff and J. Weston, A kernel method for multi-labelled classification, Proc. of the Fifteenth Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, pp , 200. [7] K. Trohidis, G. Tsoumakas, et al., Multilabel classification of music into emotions, Proc. of the 9th International Conference on Music Information Retrieval, Philadephia, PA, USA, pp , [8] M.L. Zhang and Z.H. Zhou, ML-KNN: A lazy learning approach to multi-label learning, Pattern Recognition, Vol.40, No.7, pp , [9] M.L. Zhang, J.M. Peña and V. Robles, Feature selection for multi-label naive Bayes classification, Information Sciences, Vol.79, No.9, pp , [0] J.H. Xu, Fast multi-label core vector machine, Pattern Recognition, Vol.46, No.3, pp , 203. [] J.H. Xu, An extended one-versus-rest support vector machine for multi-label classification, Neurocomputing, Vol.74, No.7, pp , 20. [2] M.L. Zhang and Z.H. Zhou, A review on multi-label learning algorithms, IEEE Transactions on Knowledge and Data Engineering, Vol.26, No.8, pp , 204. [3] S.J. Huang and Z.H. Zhou, Multi-label learning by exploiting label correlations locally, Proc. of the 26th AAAI Conference on Artificial Intelligence, Toronto, Canada, pp , 202.

Multi-task Joint Feature Selection for Multi-label Classification 287 [4] S.J. Huang, Y. Yu and Z.H. Zhou, Multi-label hypothesis reuse, Proc.

Zhang, Multi-label learning by exploiting label dependency, Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C., USA, pp.999 008, 200. [6] M.

Fürnkranz, E. Hüllermeier, et al., Multilabel classification via calibrated label ranking, Machine Learning, Vol.73, No.2, pp.33 53, 2008. [8] S.W. Ji and J.P.

7 Multi-task Joint Feature Selection for Multi-label Classification 287 [4] S.J. Huang, Y. Yu and Z.H. Zhou, Multi-label hypothesis reuse, Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China, pp , 202. [5] M.L. Zhang and K. Zhang, Multi-label learning by exploiting label dependency, Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C., USA, pp , 200. [6] M.L. Zhang, LIFT: Multi-label learning with label-specific features, Proc. of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spanish, pp , 20. [7] J. Fürnkranz, E. Hüllermeier, et al., Multilabel classification via calibrated label ranking, Machine Learning, Vol.73, No.2, pp.33 53, [8] S.W. Ji and J.P. Ye, Linear dimensionality reduction for multilabel classification, Proc. of the 2st International Joint Conference on Artifical Intelligence, California, USA, pp , [9] Y. Zhang and Z.H. Zhou, Multilabel dimensionality reduction via dependence maximization, ACM Transactions on Knowledge Discovery from Data, Vol.4, No.3, pp. 2, 200. [20] J. Liu, S.W. Ji and J.P. Ye, Multi-task feature learning via efficient l 2, -norm imization, Proc. of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, pp , [2] Y. Nesterov and I.U.E Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer Academic Publishers, Holland, [22] J. Read, B. Pfahringer, G. Holmes, et al., Classifier chains for multi-label classification, Machine Learning, Vol.85, No.3, pp , 20. [23] G. Tsoumakas, I. Katakis and I. Vlahavas, Random k-labelsets for multilabel classification, IEEE Transactions on Knowledge and Data Engineering, Vol.23, No.7, pp , 20. [24] E. Hüllermeier, J. Fürnkranz, et al., Labelrankingbylearning pairwise preferences, Artificial Intelligence, Vol.72, No.6, pp , [25] R. Caruana, Multitask learning, Machine Learning, Vol.28, No., pp.4 75, 997. [26] A. Evgeniou and M. Pontil, Multi-task feature learning, Proc. of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, B.C., Canada, pp.4 48, [27] S. Ben-David and R. Schuller, Exploiting task relatedness for multiple task learning, Proc. of the 6th Annual Conference on Learning Theory, Washington, D.C., USA, pp , [28] L.S. Qiao, S.C. Chen and X.Y. Tan, Sparsity preserving projections with applications to face recognition, Pattern Recognition, Vol.43, No., pp.33 34, 200. [29] D.L. Donoho, Compressed sensing, IEEE Transactions on Information Theory, Vol.52, No.4, pp , [30] J. Wright, A.Y. Yang, et al., Robust face recognition via sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.3, No.2, pp , [3] Y.H. Guo and W. Xue, Probabilistic multi-label classification with sparse feature learning, Proc. of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, pp , 203. [32] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, UK, [33] G. Tsoumakas, E.S. Xioufis, J. Vilcek, et al., Mulan: A java library for multi-label learning, The Journal of Machine Learning Research, Vol.2, No.7, pp , 20. [34] J. Demsar, Statistical comparisons of classifiers over multiple data sets, The Journal of Machine Learning Research, Vol.7, pp. 30, HE Zhifen was born in 988. She is currently pursuing the Ph.D degree in the school of mathematical sciences at Nanjing Normal University. Her research interests include machine learning and pattern recognition. ( hzfnjnu@gmail.com) YANG Ming (corresponding author) was born in 964. He received the Ph.D. degree from Southeast University in He is currently a professor at the school of computer science and technology of Nanjing Normal University. His research interests include data ing, pattern recognition and machine learning. ( m.yang@njnu.edu.cn) LIU Huidong was born in 987. He received the M.S. degree from Nanjing Normal University in 203. His research interests include machine learning, pattern recognition and their applications to face recognition, image processing, etc.

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece