IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1. Adaptive Unsupervised Feature Selection with Structure Regularization

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1. Adaptive Unsupervised Feature Selection with Structure Regularization"

Holly Burns
6 years ago
Views:

1 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Adaptive Unsupervised Feature Selection with Structure Regularization Minnan Luo, Feiping Nie, Xiaojun Chang, Yi Yang, Alexander G. Hauptmann and Qinghua Zheng Abstract Feature selection is one of the most important dimension reduction techniques for its efficiency and interpretation. Since practical data in large scale are usually collected without labels, and labelling these data is dramatically expensive and time-consug, unsupervised feature selection has become a ubiquitous and challenging problem. Without label information, the fundamental problem of unsupervised feature selection lies in how to characterize the geometry structure of original feature space and produce a faithful feature subset which preserves the intrinsic structure accurately. In this paper, we characterize the intrinsic local structure by an adaptive reconstruction graph and simultaneously consider its multi-connected-components (multicluster) structure by imposing a rank constraint on the corresponding Laplacian matrix. To achieve a desirable feature subset, we learn the optimal reconstruction graph and selective matrix simultaneously, instead of using a predetered graph. We exploit an efficient alternative optimization algorithm to solve the proposed challenging problem, together with the theoretical analyses on its convergence and computational complexity. Finally, extensive experiments on clustering task are conducted over several benchmark datasets to verify the effectiveness and superiority of the proposed unsupervised feature selection algorithm. Index Terms Dimension Reduction, Unsupervised Feature Selection, Adaptive Neighbors, Local Linear Embedding, Structure Regularization. I. INTRODUCTION HIGH-DIMENSIONAL data have been generated commonly in many scientific domains such as visual category recognition, gene expression array analysis and time series prediction. However, it is usually computationally expensive to directly handle high-dimensional data with massive storage cost. In fact, the intrinsic dimensionality [1] of data is often suspected to be much lower for some reasons, such as the irrelevant and correlated features involved in highdimensional feature space [1], [2], [3], [4]. Two kinds of This work was supported in part by the National Science Foundation of China under Grant , Grant , and Grant , in part by the Data to Decisions Cooperative Research Centre in part by the National Key Research and Development Program of China under Grant 2016YFB , and in part by the China Post-Doctoral Science Foundation under Grant 2015M Corresponding authors: Feiping Nie, Xiaojun Chang. Minnan Luo and Qinghua Zheng are with SPKLSTN Lab, Department of Computer Science, Xi an Jiaotong University, Shaanxi , China, {nluo,qhzheng}@mail.xjtu.edu.cn Feiping Nie is with the Center for OPTical Imagery Analysis and Learning, Northwestern Polytechnical University, China, feipingnie@gmail.com. Xiaojun Chang and Alexander G. Hauptmann are with the School of Computer Science, Carnegie Mellon University, PA, USA, cxj273@gmail.com, alex@cs.cmu.edu Yi Yang is with Centre for Quantum Computation and Intelligent Systems, University of Technology Sydney, Australia, yee.i.yang@gmail.com. approaches, namely feature extraction and feature selection, are typically investigated to learn a low-dimensional compact representation of data [5], [6], [7], [8], [9]. The former reduce the dimension by combining the original features to construct new features while the latter consists in dropping some irrelevant or weakly relevant features. In this paper, we focus on feature selection because it not only promote the computational efficiency and improve the performance significantly but also keep the interpretation of reduced description [10], [11], [12], [13]. From the perspective of utilizing label information, feature selection algorithms are mainly categorized as supervised learning [14], [15], semi-supervised learning [16], [17], and unsupervised learning [18], [19], [15]. The first two types of feature selection, to some extent, depend on label information to guide the feature evaluation by encoding features discriative information in labels. However, data in large scale are usually collected without labels for many practical applications. Annotating these unlabeled data is dramatically expensive and time-consug [20]. As a result, unsupervised feature selection becomes more ubiquitous and challenging for many real world applications [21]. In the past decades, unsupervised feature selection has attracted much attention of researchers and significantly facilitates many machine learning tasks such as classification, clustering, retrieval and ranking [22], [23], [24], [25], [26], [27], [28]. Without label information, the fundamental issue in unsupervised feature selection is how to model the manifold geometry structure of the whole feature set and produce a faithful feature subset which preserves the intrinsic structure accurately [29], [30]. Typically, the structure is characterized in the form of graph, include, but not limited to, pairwise similarity graph [31], [32], the K-nearest neighbor (KNN) graph [33], local total scatter and between-class scatter matrix [34], local symmetries of linear reconstruction for Locally Linear Embedding (LLE) [35], [36], [37]. Note that these graph-based structures are commonly calculated in the original high-dimensional feature space. Once the graph is detered, it is fixed in the procedure of feature selection. As a result, the feature selection models rely so heavily on the predetered graph [32]; Especially, the similarity graph based on pairwise distance might not be the optimal one and fail to capture the underlying geometry structure accurately for the following two reasons. On one hand, the similarity graph is usually constructed through Gaussian function. It leads the model performs much sensitively to the parameter tuning [38], [39]; On the other hand, the pairwise distance measured in high-dimensional feature space might not be qualitatively

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2 meaningful due to the curse of dimensionality, i.e., data points in high-dimensional space are approximately equidistant from each other [40], [41], [42]. Instead of estimating pairwise distances, LLE characterizes the local geometry properties by linear coefficients that reconstruct each data point from its neighbors, namely reconstruction weights [35]. However, the reconstruction weights based graph assumes that all the data points are connected as just one connected component, i.e., the structure information of graph is ignored completely. Especially, in the clustering task to partition the data into c clusters, the desire neighbors assignment is that the connected components in the graph are exact c [39]. In this paper, we follow LLE and propose a novel unsupervised feature selection model from a new point of view. Our model characterizes the intrinsic geometric properties of each neighborhood by an adaptive reconstruction weights instead of a predetered one, and simultaneously we impose a rank constraint on the Laplacian matrix of reconstruction weight based graph to achieve an ideal neighbors assignment. In such a way, the connected components in the graph are exact the cluster, and each connected component corresponds to one cluster. Additionally, we introduce a selective matrix to specify the desired feature subset by equipping the general transformation matrix with an explainable structure. This strategy makes the regularization on general transformation matrix unnecessary and eliates the difficulties in tuning the trade-off parameter. Extensive experimental results on several benchmark datasets demonstrate the effectiveness and superiority of the proposed method. In summary, we describe the main contribution of this paper as follows: 1) We characterize the intrinsic geometry structure of each neighborhood through a reconstruction nonnegative weights graph and impose a rank constraint on its Laplacian matrix to achieve an idea neighbors assignment. 2) Instead of using a predetered reconstruction graph for feature selection, we introduce a selective matrix without regularization and learn the optimal feature subset together with the adaptive reconstruction graph simultaneously. 3) An efficient alternative optimization algorithm is exploited to solve the proposed challenging problem with the theoretical analyses on its convergence and computational complexity. The remainder of this paper is organized as follows. We briefly review some related works on unsupervised feature selection in Section II. In Section III, we firstly introduce a selective matrix to specify the selected feature subset and then propose a novel model that learn the optimal feature subset with structure regularization of adaptive reconstruction weights matrix. We exploit an efficient alternating algorithm in Section IV to solve the proposed challenging problem, followed by theoretical analyses on its convergence and computational complexity in Section V. In Section VI, we conduct extensive experiments over several benchmark data sets to illustrate the effectiveness and superiority of the proposed method. Section VII concludes this work. Notations and Definitions We employ the notations as usual throughout this paper. Reals are written as lowercase letters; Vectors are denoted by boldface lowercase letters while matrices are presented as uppercase letters. 0 and 1 are column vectors whose elements are all zero and one, respectively. For any d-dimensional vector z = [z 1, z 2,, z d ] R d, ( d ) 1 its l p -norm is defined as z p = z i p p. For any matrix A = (a ij ) p q R p q, we denote its i-th row and the j-th column as a i and a j, respectively. The Frobenius i j a2 ij = norm of matrix A is defined as A F = T r(aa ), where T r( ) refers to the trace of matrix. D = diag(d 1, d 2,, d p ) R p p is a diagonal matrix with its (i, i)-th diagonal element d i ( i). We suppose dataset X = {x i : i = 1, 2,, n} consists of n data points, where each data point x i is presented by a d-dimensional feature vector. For each data point x i (i = 1, 2,, n), we denote its K-nearest neighbors set by N K (x i ) = {x i(k) : k = 1, 2,, K}. Let F = {f 1, f 2,, f d } be the set of d features for dataset X. We assume the selected feature subset with size m (m d) is denoted by F I = {f I(1), f I(2),, f I(m) } F, where I is a permutation of 1, 2,, d. II. RELATED WORKS A. Unsupervised Feature Selection Unsupervised feature selection aims to find a relevant subset of features that contains the most discriative information and preserves the intrinsic geometric structure as much as possible [33], [43]. In the past decades, studies on unsupervised feature selection can be mainly categorized into the following three groups, including filter, wrapper and embedded based approaches [44], [18]. Algorithms based on the filter model separate the evaluation of features from the learning tasks, i.e., they select the optimal features only exaing intrinsic properties of the data. For example, under the assumption that two data points belong to the same class if they close to each other, He et al. [33] defined a Laplacian Score to describe the locality preserving power of feature. However, this method neglects the possible correlation between different features [43]. Zhao et al. [15] exploited a unified framework for supervised and unsupervised feature selection based on spectral graph theory. Liu et al. [45] based on information measurement and performed the selection procedure as feature clustering in a hierarchically agglomerative way. Yang [34] incorporated discriative analysis and l 2,1 -norm imization into a joint framework for unsupervised feature selection; Li et al. [46] further performed spectral clustering with a nonnegative constraint for more accurate cluster labels; Consider the unreasonable constraint on the feature selection transformation matrix in [34], [46], Qian et al. [47] proposed a pseudo cluster labels based feature selection with local learning regularized robust non-negative matrix factorization. Wang et al. [48] embed feature selection into a clustering algorithm through sparse learning without the transformation. However, these methods might fail to capture the intrinsic structure accurately since they depend on a predetered similarity matrix computed in the original

3 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 3 feature space with Gaussian functions. Additionally, Masaeli et al. [49] convert transformation-based methods, including linear discriant analysis (LDA) and Hilbert-Schmidt Independence Criterion (HSIC), to two new feature selection algorithms through l 1 /l regularization. Tabakhi et al. [50] proposed a filter-based multivariate method which uses ant colony optimization to compute the similarity between features. Note that the filter model is relatively computationally efficient but might fail to select the most informative features for a particular learning task [19]. Instead, the approaches based on wrapper model requires a predetered learning algorithm and wraps the procedure of feature selection to evaluate relevant features. For example, Guyon et al. [51] refine the optimum feature subsets by combinatorial search using SVMs in a wrapper approach; Maldonado et al. [52] based on SVMs with kernel functions and removed the redundant features according to the number of errors in a validation subset; Dy et al. [21] exploited an Expectation-Maximization (EM) clustering algorithm and evaluated the candidate feature subsets through two different performance criteria, i.e., scatter separability and maximum likelihood. Wrapper-based methods perform better than filter models [53]. However, as argued in [19], [46], the wrapper models are usually with more expensive computation and prone to the issue of over-fitting. Taking the advantages of both filter and wrapper based methods, embedded based methods involve the feature selection into a joint framework of model construction [48], [19]. For example, Maugis et al. [54] took relation between the irrelevant variables and the clustering variables into consideration, and regard the feature selection as a model-based clustering problem. Cai et al. [43] developed a multi-cluster feature selection algorithm based on manifold learning and l 1 -norm regularization. However, this method evaluates features individually and cannot handle redundant features [55]. Weston et al. [56] formulated feature selection and pattern classification objectives in a single optimization by imizing a l 0 -norm regularizer. Zhao et al. [55] proposed a spectral feature selection to identify feature redundancy based on a sparse multioutput regression with l 2,1 -norm constraint. However, this method might deteriorate the performance since it conducts spectral regression after manifold learning. For this issue, Zhu et al. [57] performed unsupervised spectral feature selection with manifold learning and regression simultaneously, and further developed a joint graph sparse coding model for robust unsupervised feature selection [19]. However, this method employed a predetered radial basis kernel function to measure the similarity between two samples, which might fail to characterize the intrinsic local structure of data [35]. Zeng et al. [58] incorporate the updating of graph Laplacian into the built-in regularization of learning-based clustering algorithm to improve the quality of structure learning. Instead of using all input features to characterize the underlying structure of data, Du et al. [32] perform feature selection iteratively with a refined structure of data. However, the computational cost of this method is daunting in large-scale data since it concerns the pairwise similarity matrix as well as the reconstruction structure of each data point from the other all data. B. Locally Linear Embedding (LLE) The primary focus of LLE is to recover global nonlinear structure of high-dimensional data from linear fits and learn a neighborhood preserving embedding via exploiting the local symmetries of linear reconstructions [35] Specifically, LLE characterize the local geometry of the neighborhoods by linear coefficients v (i) = [v i(1), v i(2),, v i(k) ] R K that reconstruct the i-th data point from its neighbors. In other words, the coefficient v i(k) summarizes the contribution of neighbor x i(k) to the reconstruction of data point x i. The weights v i(k) is estimated by imizing the reconstruction errors over all data points, i.e., K k=1 v i(k)=1( i) T (V ) = K x i v i(k) x i(k) 2 2. (1) k=1 where sparse weight matrix V = [v 1, v 2,, v n ] R n n with v i = [v i1, v i2,, v in ] R n, collects all contribution coefficients such that v ij = v i(k) if j = i(k) and x i(k) N K (x i ), v ij = 0 otherwise. It has been pointed out in [35] that the weights matrix V obey an important symmetry, and thus the reconstruction weights characterize the intrinsic geometric properties of each neighborhood effectively. By symmetry, we define an undirected graph G = (X, V ), where dataset X denotes the set of vertices and V collects all edges (weights), i.e., an edge that connects vertices x i and x j is represented by the contribution coefficient v ij (i, j = 1, 2,, n). LLE assume that the same weights that reconstruct the data points in high dimensional space R d should also reconstruct its embedded manifold coordinates in low-dimensional space R m (m d). As a result, it deteres the corresponding m-dimensional vector y i for data point x i R d by solving the following optimization problem y i( i) y i v ij y j 2 2. (2) j=1 However, this strategy estimates the input-to-manifold mapping solely based on the training data points. As a result, it fails to project new data points out of the training dataset [36]. For this issue, some studies [36], [32] introduce a transformation matrix W R d m to project d-dimensional data point to an m dimensional one through y i = W x i, where W is estimated by solving the following optimization problem according to the fixed weights matrix V, W W x i v ij W x j 2 2. (3) j=1 However, these approaches employ a general transformation matrix to reduce feature dimension by extracting new coordinates (features). For the task of unsupervised feature selection, we aim to select a discriative original feature subset from a larger feature pool for a better interpretation. Additionally, it is noteworthy that previous approaches based on LLE usually ignore the structure of graph G; especially, in the clustering task to partition the data into c clusters, the desire neighbors assignment is that the connected components in the graph

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 4 G are exact c [39]. In the other word, the multi-connectedcomponents information is completely ignored by the previous unsupervised feature selection algorithms based on LLE. III. PROPOSED METHODOLOGY In this section, we exploit a novel model for unsupervised feature selection with adaptive reconstruction weights graph. This method replaces the transformation matrix with a selective matrix and incorporates the adaptive reconstruction graph into the procedure of feature selection. A. Selective Matrix To specify the selected features in the procedure of feature selection, we equip the conventional transportation matrix with an explainable structure to specify the selected features, namely selective matrix. Formally, for feature subset F I, we define the corresponding selective matrix as W I = [w I(1), w I(2),, w I(m) ] {0, 1} d m where w I(i) R d (i = 1, 2,, m) refers to a column vector whose components are all 0 expect 1 for the I(i)-th one. We observe that W I is indeed a column-full-rank transformation matrix. With the selective matrix W I, the procedure of original feature selection is expressed as transforg a d-dimensional data point x to an m-dimensional data point y through y = W I x. (4) where m d. Thanks to the particular structure, the selective matrix W I demonstrates three significant advantages. Firstly, it leads the feature selection algorithm faster due to its extreme sparsity [59]. Secondly, the feature selection results based on W I is much more explainable than the general transformation matrix. Last but not the least, the particular structure makes the norm based regularization on transformation matrix unnecessarily, which explicitly avoids the problem of trade-off parameter tuning with expensive computation. Based on the selective matrix W I R d m, we formulate the reconstruction error of original feature subset F I as L(W I, V ) = = K WI x i v i(k) WI x i(k) 2 2 (5) k=1 WI x i WI X i v (i) 2 2 (6) where X i = [x i(1), x i(2),, x i(k) ] R d K collects the K nearest neighbors of data points x i ; Coefficient v (i) = [v i(1), v i(2),, v i(k) ] R K represents the contribution weights of K nearest neighbors for data points x i. B. Structure Regularization of Reconstruction Graph For a desirable neighbors assignment (mentioned in the previous section), the contribution weights V should be constrained, such that the connected components of graph G = (X, V ) are exact c [39]. In fact, we have the following important property to describe the information of connected components for a graph with nonnegative weights. Theorem 1. [60] A graph G = (X, V ) with v ij 0 ( i, j) has c connected components if and only if the algebraic multiplicity of eigenvalue 0 for the corresponding Laplacian matrix L V is c, where the Laplacian matrix L V is defined as L V = D V (V + V )/2 R n n ; D V R n n is a diagonal matrix whose (i, i)-th diagonal element is calculated by n j=1 (v ij + v ji )/2 for i = 1, 2,, n. According to Theorem 1, if there are c connected subgraphs (clusters) among graph G = (X, V ), the equation Rank(L V ) = n c should be satisfied for the corresponding Laplacian matrix L V. Motivated by Property 1, we impose the constraint Rank(L V ) = n c on the Laplacian matrix of graph G, and formulate the idea of adaptive unsupervised feature selection with structure regularization as the following optimization problem: L(W I, V ) = W I,V s.t. WI x i WI X i v (i) 2 2 (7) v (i) 0, v (i) 1 = 1 ( i) Rank(L V ) = n c where 0 and 1 denote K-dimensional vectors whose entries are all 0 and 1, respectively. This idea characterizes the intrinsic structure of unsupervised data through geometric reconstruction similar to LLE instead of pairwise similarity matrix based on Gaussian functions. Beyond LLE, we impose the nonnegative constraint on the contribution coefficients, which is more advantageous since it further ensures the sparsity of the coefficients [61]. Importantly, we concerns the structure of the reconstruction graph such that the connected components in the graph are exact the number of clusters. Due to the difficulties in handling the rank constraint on the structure of graph G, subsequently, we follow the strategy used in [39] and reformulate the optimization problem (7) as an easier one. Let σ i (L V ) 0 be the i-th smallest eigenvalue of Laplacian matrix L V (i = 1, 2,, n). Then the constraint Rank(L S ) = n c could be satisfied if the smallest c eigenvalues of L V are zero. As a result, optimization problem (7) is equivalent to the following problem with respect to a large enough value of λ: W I,S s.t. K WI x i WI X i v (i) λ σ i (L V ) (8) v (i) 0, v (i) 1 = 1 ( i) Furthermore, we assign each data point x i in the graph G a vector q i R c (i = 1, 2,, n), and collect all the vectors into matrix Q = [q 1, q 2,, q n ] R n c. According to the Ky Fan s Theorem [62], we have K σ i (L V ) = T r(q L V Q), (9) Q Q=I c where I c is an c-by-c identical matrix. As a result, we introduce variable Q into optimization problem (8) and further reformulate it as the following problem, WI x i WI X i v (i) λt r(q L V Q) (10) W I,Q,V

5 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 5 s.t. Q Q = I c ; v (i) 0, v (i) 1 = 1 ( i) In summary, for the clustering task with unsupervised feature selection, optimization problems (7) and (10) demonstrate their significance in the two-folds. On one hand, it explicitly takes the structure of graph G into consideration and potentially learn the optimal local geometry structure through adaptively assigning neighbors to each data points. On the other hand, instead of using a predetered weights matrix V to characterize the intrinsic manifold structure of the whole feature space, the proposed model attempts to learn the contribution weights and the most informative feature subsets simultaneously. IV. OPTIMIZATION PROCEDURE In this section, we exploit an efficient alternative optimization algorithm to address the proposed challenging problem. A. Optimize Q With fixed variables V and W I, optimization problem (10) becomes the following problem with respect to variable Q, T r(q L V Q) (11) Q Q=I c Subject to the constraint Q Q = I c, we arrive at the following eigenvalue problem by Lagrange optimization L V Q = ΛQ (12) where Λ is the diagonal Lagrange multiplier matrix [36]. As a result, the optimal solution Q for problem (11) can be formed by the c eigenvectors of the current Laplacian matrix L V. B. Optimize W I With fixed variables V and Q, optimization problem (10) turns to seek feature subset F I F by solving the following problem with respect to variable W I {0, 1} d m, W I L(W I ) = WI x i WI X i v (i) 2 2. (13) Due to the constraint v(i) 1 = 1 ( i), the reconstruction error in terms of feature subset F I is rearranged as L(W I ) = K v i(k) (WI x i WI x i(k) ) 2 2 k=1 = T r [ WI ] X V W I (14) m = wi(k) X V w I(k) (15) k=1 where X V = X(I n V ) (I n V )X R d d. Considering the specific structure of selective matrix W I, for each feature f i F, we define its feature-lever reconstruction score based on the current graph G = (X, V ) by score(f i ) = w i X V w i (16) where w i {0, 1} d (i = 1, 2,, d) is an indicator vector whose components are all 0 expect 1 for the i-th one. Based on this definition, the task of feature selection with optimization problem (13) is to seek the feature subset F I = {f I (1), f I (2),, f I (m)} F with the imum score. Recall the definition of vector w i, we find that the reconstruction score of feature f i indeed equals to the (i, i)- th diagonal element of matrix X V, denoted by (X V ) i,i (i = 1, 2,, d). As a result, the global optimal feature subset F I is calculated by F I = arg score(f j ) (17) F I F, F I =m f j F I m = arg (X V ) I(j),I(j) (18) F I F, F I =m j=1 It infers that the global optimum of problem (13) W I can be obtained efficiently by locating the first m smallest diagonal elements of matrix X V. C. Optimize V When variables W I and Q are fixed, the optimization problem (10) turns to the following problem with respect to sparse weights matrix V, WI x i WI X i v (i) λt r(q L V Q) (19) V s.t. v (i) 0, v (i) 1 = 1 ( i) To solve this problem, we first rearrange the second term of objective function (19) as T r(q L V Q) = 1 q i q j 2 2 2v ij = 1 2 j=1 k=1 K q i q i(k) 2 2v i(k). Due to the independence between different data points, we solve optimization problem (19) with respect to each variable v i (i = 1, 2,, n) individually by the following problem V W I x i W I X i v (i) λ 2 s.t. v (i) 0, v (i) 1 = 1 K q i q i(k) 2 2v i(k) k=1 Let the objective function of this optimization problem be g(v (i) ) and p i = [p i(1), p i(2),, p i(k) ] R K with p i(k) = q i q i(k) 2 2 for k = 1, 2,, K. Then the equation q i q i(k) 2 2v i(k) = v (i) p i holds for i = 1, 2,, n. We denote the terms in g(v (i) ) which are independent of v (i) by H, and rearrange objective function g(v (i) ) as follows, g(v (i) ) = W I x i W I X iv (i) λ 2 v (i)p i = v (i)x i W IW I X iv (i) 2v (i)x i W IW I x i + λ 2 v (i)p i + H = v (i)av (i) λv (i)b + H

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 6 where A = X i W IW I X i R K K and b = 2 λ X i W IW I x i 1 2 p i R K. As a result, the optimization problem (19) becomes the following n sub-problems in a simple manner ϕ i (v (i) ) = v v (i) 0,v (i) 1=1 (i) Av (i) λv(i) b (20) (i = 1, 2,, n). To solve this optimization problem, we exploit an Accelerated Projected Gradient (APG) algorithm for its simplicity and efficiency [63]. Based on the APG method, we introduce an auxiliary variable z to convert the optimization problem (20) into an easier one, meanwhile, make the auxiliary variable approximate and converges to the solution v (i) during the procedure of iterative optimization. Specifically, in the t-th iteration, let the second order Taylor expansion of objective function ϕ i ( ) about an auxiliary variable z t be ϕ L i (v (i) ) = ϕ(z t ) + ϕ i (z t ), v (i) z t + L 2 v (i) z t 2 2 where L 0 is a constant. We update v (i) by solving an easier optimization problem v (i) 0,v (i) 1=1 ϕl i (v (i) ). (21) It is indeed equivalent to a Euclidean projection problem on the simplex space, i.e., v t+1 (i) = v (i) 0,v (i) 1=1 v (i) h t i 2 2 (22) where h t i = z t 1 L ϕ i(z t ) = z t 1 L (2Av (i) λb) R K. Note that there have been several efficient algorithms for the Euclidean projection problem on the simplex space in the past decades. Interesting readers please refer to [64], [65], [66], [67] for more details. In this paper, we solve this problem according to the fast algorithm proposed in [64]. To approximate the auxiliary variable to v (i), we update z t according to the following formula z t+1 = v t+1 (i) + τ t 1 τ t+1 (v t+1 (i) v t (i) ) (23) where the acceleration coefficient τ is updated through τ t+1 = 1 2 ( τ 2 t ). (24) We describe the APG algorithm for optimization problem (20) in Algorithm 1, where the convergence criteria is that the relative change of v (i) 2 is small enough. It has been pointed out in [63] that the APG algorithm converges both theoretically and experimentally. In summary, we demonstrate the overall alternative optimization algorithm for the proposed unsupervised feature selection in Algorithm 2, where the initial weights matrix V 0 is calculated via solving the optimization problem (1) based on LLE. V. THEORETICAL ANALYSIS ON CONVERGENCE AND COMPLEXITY To illustrate the efficiency of the proposed Algorithm 2, in this section, we theoretically analyze its convergence and computational complexity. Algorithm 1 APG algorithm for optimization problem (20). Input: X,W I,Q. Output: v (i) (i = 1, 2,, n). Initialize: v(i) 0 = z 0, τ 0 = 1. 1: while not converge do 2: Update v (i) ( i) by solving problem (22); 3: Update τ t+1 = 1 2 ( τt 2 ); 4: Update z t+1 = v t+1 (i) + τt 1 τ t+1 (v t+1 (i) v(i) t ); 5: end while Algorithm 2 Unsupervised feature selection with adaptive neighbors. Input: X, m, c, K, λ Output: F I. Initialize: t = 0; V 0 1: while not converge do 2: Update Q t+1 as the c eigenvectors of the current Laplacian matrix L V t; 3: Update W t+1 I by locating the the first m smallest diagonal elements of matrix X V t = X(I n V t ) (I n V t )X. 4: Update V t+1 by APG Algorithm 1; 5: t = t + 1; 6: end while A. Convergence Analysis The convergence of Algorithm 2 is given through the following Theorem 2. Theorem 2. The alternate updating rules in Algorithm 2 monotonically decrease the objective function value of optimization problem (10) in each iteration until convergence. Proof. In the procedure of iteration, we get the global optimal selective matrix W t+1 I by solving optimization problem W t+1 n I = arg WI W I x i WI X iv(i) t 2 2, where v(i) t (i = 1, 2,, n) is the the current weights vectors. As a result, we have the following inequality (W t+1 I ) x i (W t+1 I ) X iv(i) t 2 2 (WI) t x i (WI) t X iv(i)i t 2 2. (25) Since variable Q t+1 is updated by solving problem Q t+1 = arg Q Q=I c T r(q L V tq) with fixed V t, we obtain the following inequality T r((q t+1 ) L V tq t+1 ) T r((q t ) L V tq t ). (26) Because the updating of weights matrix V t+1 R n n can be divided into n independently sub-optimization problem with respect to K-dimensional vector v (i) (i = 1, 2,, n). In Algorithm 2, an efficient APG Algorithm 1 are exploited to solve the optimal v t+1 (i) ( i). Note that the APG algorithm converges fast in [68], [63], Consequently, we have the following inequality (27) (W t+1 I ) x i (W t+1 I ) X iv t+1 (i) λt r((q t+1 ) L V t+1q t+1 ) (W t+1 I ) x i (W t+1 I ) X iv t (i) λt r((q t+1 ) L V tq t+1 )

7 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 7 Based on the inequities (25), (26) and (27), we arrive at (W t+1 I ) x i (W t+1 I ) X iv t+1 (i) λt r((q t+1 ) L V t+1q t+1 ) (W t+1 I ) x i (W t+1 I ) X iv t (i) λt r((q t+1 ) L V tq t+1 ) (W t I) x i (W t I) X iv t (i) λt r((q t ) L V tq t ). As a result, the objective function value of optimization problem (10) decrease monotonically in each iteration until the algorithm convergence. The proof is completed. B. Complexity Analysis It is noteworthy that the proposed Algorithm 2 consists of three updating rules which are computationally efficient. For the updating of variable Q, the main cost lies in calculating the c eigenvector of Laplacian matrix L V F n n, with complexity O(n 2 c). In fact, due to the K-sparsity of matrix L V, i.e., each column of L V has only K n nonzero elements, the complexity reduces to O(Kcn) in practice. For the updating of variable W I, we consider the K-sparsity of weights matrix V and calculate the diagonal elements of matrix X V R d d with complexity O(Kdn); moreover, to achieve the optimal feature subset of size m, we seek the m smallest diagonal elements with complexity O(m log d). As a result, the computational complex of updating variable W I turns to O(Kdn + m log d) O(Kdn) since log d d and m n hold in practice. To update the variable K- sparsity weight vector v (i) with APG Algorithm 1, the main cost comes from the calculation of A and b with complexity O(Kdm), as well as the cost of solving a Euclidean projection problem on the simplex space with complexity O(K log K) [64]. Thus, the computational complexity of updating variable V is O((Kdm + K log K)nt) O(Kdmnt) since log K dm holds in practice, where t denotes the iteration of APG Algorithm 1. Therefore, the complexity of each iteration in Algorithm 2 is summarized as O(Kcn + Kdn + Kdmnt) O(Kdmnt) since c d in practice. Considering the fast convergence of the APG Algorithm 1, we observe that the proposed alternative Algorithm 2 performs efficiently. Note that the computational cost of the proposed algorithm is linear with respect to the number of data points n, and thus our algorithm is capable of handling large-scale data. VI. EXPERIMENTAL ANALYSIS For fair comparison, we follow previous work on unsupervised feature selection and evaluate the performance of our proposed ANFS for clustering over some benchmark data sets, regarding to face recognition, human motion recognition and handwritten digits recognition. A. Experimental Setup We conduct experiments over eight benchmark data sets, including three face datasets (Japanese Female Facial Expression Database (JAFFE) [69], UMIST face data set [70] and Pointing4 dataset [71]), one motion dataset (the HumanEVA data set [72]), one object recognition dataset (Columbia Object Image Library-20 (COIL-20) data set [73]), two handwritten digit datasets (USPS dataset and MNIST dataset [74]) and TABLE I: Datasets description. Data set of samples of features of classes JAFFE UMIST HumanEVA 10, COIL20 1,440 1, USPS 9, MNIST 5, Pointing4 2,790 1, WebKB 814 4,029 7 one text dataset (WebKB collected by the University of Texas [75]). All of the image data sets are downloaded from different web sites for a comprehensive evaluation. In the experiments, we normalize each initial feature into [0, 1]. The details of these data sets are demonstrated in Table I. Note that all the datasets used in this paper have no official splits. Hence, we employ a 5-fold cross-validation to split the datasets. To be more specific, we partition the dataset into five subsets. We then selected one subset for testing and the remaining subsets for learning the optimal parameters. With the selected features, we independently repeat the k-means clustering 50 times and report the average results together with the variance. Additionally, c is set to be the true number of classes for each dataset; The size of nearest neighbors K is empirically set to be 5 for all the datasets. B. Compared Algorithms To illustrate the effectiveness of our proposed ANFS for feature selection, we compare it with one baseline and several unsupervised feature selection methods. The compared algorithms are enumerated as follows: 1) All-Fea: All original features are preserved as the baseline in the experiments. 2) Max Variance (MaxVar): Features are ranked according to the variance magnitude of each feature in a descending order, following by the selection of the highest ranked features. 3) Laplacian Score (LS) [2]: Features consistent with Gaussian Laplacian matrix are selected to best preserve the local manifold structure. 4) Multi-Cluster Feature Selection (MCFS) [43]: This unsupervised approach selects those features who make the multi-cluster structure of the data preserved best. Features are selected using spectral regression with the l 1 -norm regularization. 5) Unsupervised Discriative Feature Selection (UDFS) [34]: Under the assumption that the class label of input data can be predicted by a linear classifier, discriative analysis and l 2,1 -norm imization are incorporated into a joint framework for unsupervised feature selection. 6) Nonnegative Discriative Feature Selection (NDFS) [76]: NDFS exploits local discriative information and feature correlations simultaneously. Besides, the manifold structure information is also considered jointly. 7) Unsupervised Feature Selection Using Trace Ratio Criterion (TRACK) [77]: This unsupervised feature selection

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 8 algorithm integrates unsupervised trace ratio formulation and structured sparsity-inducing norms regularization. 8) Unsupervised Feature Analysis with Class Margin Optimization (MMC) [53]: MMC integrates the Maximum Margin Criterion with a sparsity-based model into a joint framework, where the class margin and feature correlation are taken into account at the same time. Following related studies on clustering, we leverage two popular evaluation metrics, namely clustering accuracy (ACC) and normalized mutual information (NMI), to measure the performance of clustering after various unsupervised feature selection. C. Results and analysis In this section, we empirically evaluate the performance of the compared nine unsupervised feature selection algorithms in terms of ACC and NMI. Additionally, to demonstrate the effectiveness of the structure regularization on the reconstruction graph, we also set λ = 0 for the proposed model (8) and learn the optimal feature subset by solving the following optimization problem W I L(W I, V ) = WI x i WI X i v (i) 2 2 (27) where the fixed weight vector v i (i = 1, 2,, n) are learned from LLE. To illustrate the influence of varying initialization of local reconstruction coefficients V on the performance of the proposed algorithm, we also report the experimental results of ANFS with random initialization of V, namely ANFS (RI). The performance of experiments on the two evaluation metrics mentioned above are shown in Table II and Table III, respectively. From the experimental results, we have the following observations: 1) Compared with the baseline method (All-Fea), it can be observed that feature selection is able to enhance performance by removing noisy and redundant information. Besides, it can also make the subsequent processing more efficient by selecting a subset of the original features, which is crucial for high-dimensional data processing. 2) By exploring local geometric structure of the data distribution, better performance is expected. For example, LS, MCFS, UDFS and NDFS generally performs much better than the baseline method. 3) MCC achieves the second best performance in terms of both ACC and NMI on almost all datasets by jointly optimizing the maximum margin criterion and the sparsitybased model. TRACK achieves more improvement by harnessing the discriant power of trace ratio criterion to select discriative features; 4) ANFS with λ = 0 performs poorer than the algorithms of ANFS and ANFS(RI). This observation on one hand illustrates the necessary of structure regularization; On the other hand, it shows the effectiveness of adaptive neighbors assignment. 5) ANFS and ANFS(RI) consistently outperform other alternatives over the eight benchmark datasets. As a result, it is beneficial to incorporate the structure learning of reconstruction graph into the procedure of adaptive neighbors assignment. 6) The algorithms ANFS with LLE initialization of V and ANFS with random initialization of V perform comparably over eight benchmark datasets. As a result, the proposed algorithm can always achieve good local optima with different initializations. To ensure our improvement is statistically significant, we give non-parametric pair-wised Wilcoxon test in Tables IV and V. We compute the p-value of ANFS against each other method, and set the level of significance α = From the results shown in Tables IV and V we can see that our method achieves statistically significant improvements. To verify the efficiency of the proposed model, we also report in Table VI the speed of different feature selection algorithms over different datasets. Since the filter based Max- Var and LS algorithms ignore the possible correlation between different features and separate the evaluation of features from the clustering learning tasks, they take less time than others algorithms. However, it is also the reason why they achieve poorer performance than other algorithms (See Table II and Table III for details). In contrast to other wrapper and embedded based algorithms, our model achieves a significant improvement in speed, where the adoption of selective matrix as well as the updating of V in R K space (rather than R n space ) play important roles. D. Parameter Sensitivity Contrast to other unsupervised feature selection algorithms, our proposed ANFS requires only two parameter including λ and the number of selected features to be set in advance. In the previous experiments, we estimate the best parameters using a 5-fold cross-validation. To illustrate the parameters influence on the performance of feature selection, we conduct experiments to analysis the parameters sensitivity in terms of clustering accuracy. Experimental results on the eight benchmark datasets are shown in Figure 1. The regularization parameter λ is tuned from {10 3, 10 2, 10 1, 10 0, 10 1, 10 2, 10 3 }. Empirically, we tune the number of selected features from half of the total number to the full feature size. From the results we can see that the performance changes differently with respect to different parameters on various datasets. How to identify the optimal values of the parameters is data dependent. This demonstrates the need for us to use cross-validation to select the best parameters for each dataset. We have similar observations in terms of normalized mutual information. E. Convergence Study To solve the objective function, we have developed an efficient iterative algorithm. In the previous section, we have proved the convergence of the proposed algorithm. In this section, we experimentally study the speed of convergence of ANFS. We set the regularization parameter λ as 1, which is a median value in the tuned range. The convergence curves on the eight benchmark datasets are shown in Figure 2. We can

9 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 9 TABLE II: Performance comparison in terms of clustering accuracy (ACC) ± std% for different selection methods over different data sets. Performance is reported in percentages. Dataset JAFFE Umist HumanEVA Coil20 USPS MNIST Pointing4 WebKB All-Fea 84.5 ± ± ± ± ± ± ± ± 1.4 MaxVar 81.6 ± ± ± ± ± ± ± ± 1.3 LS 86.3 ± ± ± ± ± ± ± ± 1.5 MCFS 86.9 ± ± ± ± ± ± ± ± 1.4 UDFS 88.1 ± ± ± ± ± ± ± ± 1.7 NDFS 88.9 ± ± ± ± ± ± ± ± 1.4 TRACK 89.8 ± ± ± ± ± ± ± ± 1.4 MMC 91.3 ± ± ± ± ± ± ± ± 1.6 ANFS(λ = 0) 90.8 ± ± ± ± ± ± ± ± 1.7 ANFS(RI) 93.3 ± ± ± ± ± ± ± ± 1.8 ANFS 93.5 ± ± ± ± ± ± ± ± 1.5 TABLE III: Performance comparison in terms of normalized mutual information (NMI) ± std% for different selection methods over different data sets. Performance is reported in percentages. Dataset JAFFE Umist HumanEVA Coil20 USPS MNIST Pointing4 WebKB All-Fea 90.3 ± ± ± ± ± ± ± ± 3.3 MaxVar 87.9 ± ± ± ± ± ± ± ± 2.8 LS 91.8 ± ± ± ± ± ± ± ± 3.4 MCFS 93.1 ± ± ± ± ± ± ± ± 3.1 UDFS 93.9 ± ± ± ± ± ± ± ± 3.5 NDFS 94.8 ± ± ± ± ± ± ± ± 2.9 TRACK 95.9 ± ± ± ± ± ± ± ± 1.9 MMC 97.1 ± ± ± ± ± ± ± ± 1.8 ANFS(λ = 0) 96.7 ± ± ± ± ± ± ± ± 2.0 ANFS(RI) 97.8 ± ± ± ± ± ± ± ± 1.9 ANFS 98.5 ± ± ± ± ± ± ± ± 1.7 TABLE IV: p-value of pair-wised Wilcoxon Test w.r.t. different algorithms in terms of clustering accuracy. Data Set All-Fea MaxVar LS MCFS UDFS NDFS TRACK MMC JAFFE 3e-11 5e-11 2e-11 3e-11 5e-11 2e-11 4e-11 2e-11 UMIST 2e-11 4e-11 3e-11 2e-11 5e-11 3e-11 2e-11 4e-11 HumanEVA 2e-11 5e-11 3e-11 5e-11 2e-11 5e-11 4e-11 5e-11 Coil20 3e-11 5e-11 3e-11 2e-11 4e-11 5e-11 3e-11 2e-11 USPS 4e-11 3e-11 2e-11 3e-11 5e-11 3e-11 4e-11 2e-11 MNIST 5e-11 3e-11 6e-11 4e-11 3e-11 5e-11 4e-11 3e-11 Pointing4 3e-11 6e-11 4e-11 2e-11 5e-11 3e-11 4e-11 3e-11 WebKB 5e-11 3e-11 2e-11 5e-11 4e-11 5e-11 3e-11 2e-11 TABLE V: p-value of pair-wised Wilcoxon Test w.r.t. different algorithms in terms of normalized mutual information. Data Set All-Fea MaxVar LS MCFS UDFS NDFS TRACK MMC JAFFE 3e-11 5e-11 4e-11 3e-11 3e-11 5e-11 3e-11 4e-11 UMIST 5e-11 3e-11 6e-11 3e-11 5e-11 6e-11 4e-11 3e-11 HumanEVA 4e-11 5e-11 3e-11 2e-11 6e-11 4e-11 3e-11 2e-11 Coil20 5e-11 4e-11 4e-11 3e-11 5e-11 5e-11 3e-11 5e-11 USPS 6e-11 3e-11 4e-11 3e-11 6e-11 3e-11 4e-11 6e-11 MNIST 4e-11 5e-11 2e-11 5e-11 5e-11 3e-11 4e-11 5e-11 Pointing4 7e-11 3e-11 4e-11 5e-11 3e-11 4e-11 5e-11 4e-11 WebKB 3e-11 5e-11 6e-11 4e-11 4e-11 3e-11 5e-11 4e-11 TABLE VI: Speed test for different feature selection algorithms over different data sets. Performance is reported in seconds. Dataset JAFFE Umist HumanEVA Coil20 USPS MNIST Pointing4 WebKB MaxVar ± ± ± ± ± ± ± ± LS ± ± ± ± ± ± ± ± MCFS ± ± ± ± ± ± ± ± UDFS ± ± ± ± ± ± ± ± NDFS ± ± ± ± ± ± ± ± TRACK ± ± ± ± ± ± ± ± MMC ± ± ± ± ± ± ± ± ANFS ± ± ± ± ± ± ± ± 1.048

(a) JAFFE (b) UMIST (c) HumanEVA (d) Coil20 (e) USPS (f) MNIST (g) Pointing4 (h) WebKB Fig. 2: Convergence curves of the proposed algorithm.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 10 (a) JAFFE (b) UMIST (c) HumanEVA (d) Coil20 (e) USPS (f) MNIST (g) Pointing4 (h) WebKB Fig. 1: Performance variations of the proposed method with respect to different values of λ and the number of selected features. (a) JAFFE (b) UMIST (c) HumanEVA (d) Coil20 (e) USPS (f) MNIST (g) Pointing4 (h) WebKB Fig. 2: Convergence curves of the proposed algorithm. see that the proposed algorithm converges within 20 iterations for all the datasets, validating that the proposed algorithm is effective and quickly converges. VII. C ONCLUSIONS In this paper, we propose a novel unsupervised feature selection model from a new perspective by learning the optimal feature subset and adaptive reconstruction graph simultaneously. Here the LLE based reconstruction graph is estimated to characterize the intrinsic local geometric properties with a structure regularization. In such a way, the number of connected components in the reconstruction graph equals to the number of classes regarding clustering tasks. Additionally, we introduce a selective matrix to specify the selected features without norm-based regularization. This strategy eliates the difficulties in tuning trade-off parameter of regularization. An efficient alternative optimization algorithm is exploited to address the proposed problem, together with theoretical analysis on its convergence and complexity. We also analyze the convergence of the proposed algorithm as well as its computational complexity theoretically. Extensive experimental results on several benchmark data sets demonstrate the effectiveness and superiority of the proposed approaches. In this paper, we focus on the scenario that the data points are with a single view for a single task. In our future work, it is potentially interesting to extend the proposed methods to unsupervised feature selection of data points with multi-view and multi-task [78] and other interesting applications, i.e., Cosaliency Detection [79], [80]. R EFERENCES [1] M. Ben-Bassat, Pattern recognition and reduction of dimensionality, Handbook of Statistics, vol. 2, pp , 1982.

11 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 11 [2] X. He, D. Cai, and P. Niyogi, Laplacian score for feature selection, in NIPS, [3] X. Chang, F. Nie, Y. Yang, C. Zhang, and H. Huang, Convex sparse PCA for unsupervised feature learning, TKDD, vol. 11, no. 1, pp. 3:1 3:16, [4] X. Chang, F. Nie, S. Wang, Y. Yang, X. Zhou, and C. Zhang, Compound rank-k projections for bilinear analysis, IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 7, pp , [5] H. Liu and H. Motoda, Feature selection for knowledge discovery and data ing. Springer Science & Business Media, 2012, vol [6] V. D. Silva and J. B. Tenenbaum, Global versus local methods in nonlinear dimensionality reduction, in Advances in Neural Information Processing Systems, 2002, pp [7] M. H. Law and A. K. Jain, Incremental nonlinear dimensionality reduction by manifold learning, IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 3, pp , [8] S. Li and D. Wei, Extremely high-dimensional feature selection via feature generating samplings, IEEE Trans.Cybern., vol. 44, no. 6, pp , [9] X. Chang and Y. Yang, Semi-supervised feature analysis by ing correlations among multiple tasks, IEEE Transactions on Neural Networks and Learning Systems, [10] C. Freeman, D. Kulic, and O. Basir, Feature-selected tree-based classification, IEEE Trans.Cybern., vol. 43, no. 6, pp , [11] E. Romero and J. M. Sopena, Perforg feature selection with multilayer perceptrons, IEEE Trans. Neural Netw., vol. 19, no. 3, pp , [12] S. Guérif, Unsupervised variable selection: when random rankings sound as irrelevancy, in JMLR W&P, [13] X. Chang, F. Nie, Y. Yang, and H. Huang, A convex formulation for semi-supervised multi-label feature selection, in AAAI, [14] J. M. Pena and R. Nilsson, On the complexity of discrete feature selection for optimal classification, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 8, pp , [15] Z. Zhao and H. Liu, Spectral feature selection for supervised and unsupervised learning, in Proceedings of the 24th International Conference on Machine Learning, 2007, pp [16] Y. Han, Y. Yang, Y. Yan, Z. Ma, N. Sebe, and X. Zhou, Semisupervised feature selection via spline regression for video semantic recognition, IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 2, pp , [17] Z. Zhao and H. Liu, Semi-supervised feature selection via spectral analysis. in SDM. SIAM, 2007, pp [18] L. Wolf and A. Shashua, Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach, J. Mach. Learn. Res., vol. 6, pp , [19] X. Zhu, X. Li, S. Zhang, C. Ju, and X. Wu, Robust joint graph sparse coding for unsupervised spectral feature selection, IEEE Trans. Neural Netw. Learn. Syst., to appear. [20] Q. Cheng, H. Zhou, and J. Cheng, The fisher-markov selector: fast selecting maximally separable feature subset for multiclass classification with applications to high-dimensional data, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 6, pp , [21] J. G. Dy and C. E. Brodley, Feature selection for unsupervised learning, J. Mach. Learn. Res., vol. 5, pp , [22] T. Hancock and H. Mamitsuka, Boosted network classifiers for local feature selection, IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 11, pp , [23] Y. Li, J. Si, G. Zhou, S. Huang, and S. Chen, Frel: A stable feature selection algorithm, IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 7, pp , [24] I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., vol. 3, pp , [25] L. Laporte, R. Flamary, S. Canu, S. Déjean, and J. Mothe, Nonconvex regularizations for feature selection in ranking with sparse svm, IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 6, pp , [26] D. Chakraborty and N. R. Pal, Selecting useful groups of features in a connectionist framework, IEEE Trans. Neural Netw., vol. 19, no. 3, pp , [27] W. Yang, Y. Gao, Y. Shi, and L. Cao, Mrm-lasso: A sparse multiview feature selection method via low-rank analysis, IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 11, pp , [28] H.-J. Lai, Y. Pan, Y. Tang, and R. Yu, Fsmrank: Feature selection algorithm for learning to rank, IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp , [29] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley & Sons, [30] X. He and X. Niyogi, Locality preserving projections, in Neural information processing systems, vol. 16, 2004, p [31] Z. Zhao, L. Wang, H. Liu, and J. Ye, On similarity preserving feature selection, IEEE Trans. Knowl. Data Eng., vol. 25, no. 3, pp , [32] L. Du and Y.-D. Shen, Unsupervised feature selection with adaptive structure learning, in ACM SIGKDD, 2015, pp [33] X. He, D. Cai, and P. Niyogi, Laplacian score for feature selection, in NIPS, [34] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou, l 2,1 -norm regularized discriative feature selection for unsupervised learning, in IJCAI, [35] S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, vol. 290, no. 5500, pp , [36] Y. Fu and T. S. Huang, Locally linear embedded eigenspace analysis, IFP-TR, Univ. of Illinois at Urbana-Champaign, Tech. Rep., [37] L. K. Saul and S. T. Roweis, Think globally, fit locally: unsupervised learning of low dimensional manifolds, J. Mach. Learn. Res., vol. 4, no. Jun, pp , [38] S. Xiang, F. Nie, C. Zhang, and C. Zhang, Nonlinear dimensionality reduction with local spline embedding, IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp , [39] F. Nie, X. Wang, and H. Huang, Clustering and projected clustering with adaptive neighbors, in ACM SIGKDD, [40] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, On the surprising behavior of distance metrics in high dimensional space. Springer, [41] C. Sammut and G. I. Webb, Encyclopedia of machine learning. Springer Science & Business Media, [42] L. Parsons, E. Haque, and H. Liu, Subspace clustering for high dimensional data: a review, ACM SIGKDD, vol. 6, no. 1, pp , [43] D. Cai, C. Zhang, and X. He, Unsupervised feature selection for multicluster data, in ACM SIGKDD, [44] A. L. Blum and P. Langley, Selection of relevant features and examples in machine learning, Artificial Intelligence, vol. 97, no. 1, pp , [45] H. Liu, X. Wu, and S. Zhang, Feature selection using hierarchical feature clustering, in ACM CIKM, [46] Z. Li, J. Liu, Y. Yang, X. Zhou, and H. Lu, Clustering-guided sparse structural learning for unsupervised feature selection, IEEE Trans. Knowl. Data Eng., vol. 26, no. 9, pp , [47] M. Qian and C. Zhai, Robust unsupervised feature selection, in Proceedings of the 23th International Joint Conference on Artificial Intelligence, 2013, pp [48] S. Wang, J. Tang, and H. Liu, Embedded unsupervised feature selection, in Association for the Advancement of Artificial Intelligence, [49] M. Masaeli, J. G. Dy, and G. M. Fung, From transformation-based dimensionality reduction to feature selection, in ICML, [50] S. Tabakhi, P. Moradi, and F. Akhlaghian, An unsupervised feature selection algorithm based on ant colony optimization, Engineering Applications of Artificial Intelligence, vol. 32, pp , [51] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, Gene selection for cancer classification using support vector machines, Mach. Learn., vol. 46, no. 1-3, pp , [52] S. Maldonado and R. Weber, A wrapper method for feature selection using support vector machines, Inf. Sci., vol. 179, no. 13, pp , [53] S. Wang, F. Nie, X. Chang, L. Yao, X. Li, and Q. Z. Sheng, Unsupervised feature analysis with class margin optimization, in Machine Learning and Knowledge Discovery in Databases. Springer, 2015, pp [54] C. Maugis, G. Celeux, and M.-L. Martin-Magniette, Variable selection for clustering with gaussian mixture models, Biometrics, vol. 65, no. 3, pp , [55] Z. Zhao, L. Wang, H. Liu et al., Efficient spectral feature selection with imum redundancy, in AAAI, [56] J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping, Use of the zero-norm with linear models and kernel methods, J. Mach. Learn. Res., vol. 3, no. Mar, pp , [57] X. Zhu, X. Wu, W. Ding, and S. Zhang, Feature selection by joint graph sparse coding, in ICDM, [58] H. Zeng and Y.-m. Cheung, Feature selection and kernel learning for local learning-based clustering, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp , 2011.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 12 [59] F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan, Trace ratio criterion for feature selection, in AAAI, 2008. [60] B. Mohar, Y.

Chang, Large graph construction for scalable semi-supervised learning, in ICML, 2010. [62] K.

[63] Y. Nesterov, Gradient methods for imizing composite objective function, CORE, 2007. [64] L. Condat, Fast projection onto the simplex and the 1 ball, Mathematical Programg, Series A, vol. 158, no.

Becker, V. Cevher, C. Koch, and A. Kyrillidis, Sparse projections onto the simplex, ICML, 2013. [67] Y. Censor, W. Chen, P. L. Combettes, R. Davidi, and G. T.

[68] Y. Nesterov, Smooth imization of non-smooth functions, Mathematical programg, vol. 103, no. 1, pp. 127 152, 2005. [69] M. N. Dailey, C. Joyce, M. J. Lyons, M. Kamachi, H. Ishi, J. Gyoba, and G.

F. Soulie, and T. S. Huang, Face recognition: From theory to applications. Springer Science & Business Media, 2012, vol. 163. [71] N. Gourier, D. Hall, and J. L.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 12 [59] F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan, Trace ratio criterion for feature selection, in AAAI, [60] B. Mohar, Y. Alavi, G. Chartrand, and O. Oellermann, The laplacian spectrum of graphs, Graph theory, combinatorics, and applications, vol. 2, pp , [61] W. Liu, J. He, and S.-F. Chang, Large graph construction for scalable semi-supervised learning, in ICML, [62] K. Fan, On a theorem of weyl concerning eigenvalues of linear transformations I, Proceedings of the National Academy of Sciences of the United States of America, vol. 35, no. 11, pp , [63] Y. Nesterov, Gradient methods for imizing composite objective function, CORE, [64] L. Condat, Fast projection onto the simplex and the 1 ball, Mathematical Programg, Series A, vol. 158, no. 1, pp , [65] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, Efficient projections onto the l 1-ball for learning in high dimensions, in ICML, 2008, pp [66] S. Becker, V. Cevher, C. Koch, and A. Kyrillidis, Sparse projections onto the simplex, ICML, [67] Y. Censor, W. Chen, P. L. Combettes, R. Davidi, and G. T. Herman, On the effectiveness of projection methods for convex feasibility problems with linear inequality constraints, Computational Optimization and Applications, vol. 51, no. 3, pp , [68] Y. Nesterov, Smooth imization of non-smooth functions, Mathematical programg, vol. 103, no. 1, pp , [69] M. N. Dailey, C. Joyce, M. J. Lyons, M. Kamachi, H. Ishi, J. Gyoba, and G. W. Cottrell, Evidence and a computational explanation of cultural differences in facial expression recognition. Emotion, vol. 10, no. 6, p. 874, [70] H. Wechsler, J. P. Phillips, V. Bruce, F. F. Soulie, and T. S. Huang, Face recognition: From theory to applications. Springer Science & Business Media, 2012, vol [71] N. Gourier, D. Hall, and J. L. Crowley, Estimating face orientation from robust detection of salient facial structures. [72] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, Hmdb: a large video database for human motion recognition, in IEEE International Conference on Computer Vision. IEEE, 2011, pp [73] S. A. Nene, S. K. Nayar, H. Murase et al., Columbia object image library (coil-20), Technical Report CUCS , Tech. Rep., [74] Y. Yang, D. Xu, F. Nie, S. Yan, and Y. Zhuang, Image clustering using local discriant models and global integration, IEEE Trans. Image Processing, vol. 19, no. 10, pp , [75] Online. [76] Z. Li, Y. Yang, J. Liu, X. Zhou, and H. Lu, Unsupervised feature selection using nonnegative spectral analysis, in AAAI, [77] D. Wang, F. Nie, and H. Huang, Unsupervised feature selection via unified trace ratio formulation and k-means clustering (track), in Machine Learning and Knowledge Discovery in Databases. Springer, 2014, pp [78] J. Liu, S. Ji, and J. Ye, Multi-task feature learning via efficient l 2,1 - norm imization, in UAI, [79] D. Zhang, J. Han, C. Li, J. Wang, and X. Li, Detection of co-salient objects by looking deep and wide, International Journal of Computer Vision, vol. 120, no. 2, pp , [80] D. Zhang, J. Han, J. Han, and L. Shao, Cosaliency detection based on intrasaliency prior transfer and deep intersaliency ing, IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 6, pp , Feiping Nie received the Ph.D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in He is currently a Professor with the Center for OP- Tical IMagery Analysis and Learning, Northwestern Polytechnical University, Xian, China. His current research interests include machine learning and its application fields, such as pattern recognition, data ing, computer vision, image processing, and information retrieval. Xiaojun Chang received the PhD degree in computer science from Centre for Quantum Computation and Intelligent Systems (QCIS), University of Technology Sydney, Australia in He has been working as a Research Associate in the language technology institute of Carnegie Mellon University. His main research interests include machine learning, data ing and computer vision. Yi Yang received the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China, in He is currently an associate professor with University of Technology Sydney, Australia. He was a Post-Doctoral Research with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. His current research interest include machine learning and its applications to multimedia content analysis and computer vision, such as multimedia indexing and retrieval, surveillance video analysis and video semantics understanding. Alexander G. Hauptmann received the B.A. and M.A. degrees in psychology from Johns Hopkins University, Baltimore, MD, the degree in computer science from the Technische Universitat Berlin, Berlin, Germany, in 1984, and the Ph.D. degree in computer science from Carnegie Mellon University (CMU), Pittsburgh, PA, in He is currently with the faculty of the Department of Computer Science and the Language Technologies Institute, CMU. His research interests include man-machine communication, natural language processing, speech understanding and synthesis, video analysis, and machine learning. Minnan Luo received the Ph. D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in Currently, she is an Assistant Professor in the School of Electronic and Information Engineering at Xi an Jiaotong University. She was a Post-Doctoral Research with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. Her research interests include machine learning and optimization, video analysis, cross-media retrieval and fuzzy system identification and modeling. Qinghua Zheng received the B.S. degree in computer software in 1990, the M.S. degree in computer organization and architecture in 1993, and the Ph.D. degree in system engineering in 1997 from Xi an Jiaotong University, China. He was a postdoctoral researcher at Harvard University in He is currently a professor in Xi an Jiaotong University, and the dean of the Department of Computer Science. His research areas include computer network security, intelligent E-learning theory and algorithm, multimedia e-learning, and trustworthy software.

An Adaptive Semi-Supervised Feature Analysis for Video Semantic Recognition

IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? 017 1 An Adaptive Semi-Supervised Feature Analysis for Video Semantic Recognition Minnan Luo, Xiaojun Chang, Liqiang Nie, Yi Yang, Alexander Hauptmann