IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1. Adaptive Unsupervised Feature Selection with Structure Regularization

Size: px
Start display at page:

Download "IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1. Adaptive Unsupervised Feature Selection with Structure Regularization"

Transcription

1 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Adaptive Unsupervised Feature Selection with Structure Regularization Minnan Luo, Feiping Nie, Xiaojun Chang, Yi Yang, Alexander G. Hauptmann and Qinghua Zheng Abstract Feature selection is one of the most important dimension reduction techniques for its efficiency and interpretation. Since practical data in large scale are usually collected without labels, and labelling these data is dramatically expensive and time-consug, unsupervised feature selection has become a ubiquitous and challenging problem. Without label information, the fundamental problem of unsupervised feature selection lies in how to characterize the geometry structure of original feature space and produce a faithful feature subset which preserves the intrinsic structure accurately. In this paper, we characterize the intrinsic local structure by an adaptive reconstruction graph and simultaneously consider its multi-connected-components (multicluster) structure by imposing a rank constraint on the corresponding Laplacian matrix. To achieve a desirable feature subset, we learn the optimal reconstruction graph and selective matrix simultaneously, instead of using a predetered graph. We exploit an efficient alternative optimization algorithm to solve the proposed challenging problem, together with the theoretical analyses on its convergence and computational complexity. Finally, extensive experiments on clustering task are conducted over several benchmark datasets to verify the effectiveness and superiority of the proposed unsupervised feature selection algorithm. Index Terms Dimension Reduction, Unsupervised Feature Selection, Adaptive Neighbors, Local Linear Embedding, Structure Regularization. I. INTRODUCTION HIGH-DIMENSIONAL data have been generated commonly in many scientific domains such as visual category recognition, gene expression array analysis and time series prediction. However, it is usually computationally expensive to directly handle high-dimensional data with massive storage cost. In fact, the intrinsic dimensionality [1] of data is often suspected to be much lower for some reasons, such as the irrelevant and correlated features involved in highdimensional feature space [1], [2], [3], [4]. Two kinds of This work was supported in part by the National Science Foundation of China under Grant , Grant , and Grant , in part by the Data to Decisions Cooperative Research Centre in part by the National Key Research and Development Program of China under Grant 2016YFB , and in part by the China Post-Doctoral Science Foundation under Grant 2015M Corresponding authors: Feiping Nie, Xiaojun Chang. Minnan Luo and Qinghua Zheng are with SPKLSTN Lab, Department of Computer Science, Xi an Jiaotong University, Shaanxi , China, {nluo,qhzheng}@mail.xjtu.edu.cn Feiping Nie is with the Center for OPTical Imagery Analysis and Learning, Northwestern Polytechnical University, China, feipingnie@gmail.com. Xiaojun Chang and Alexander G. Hauptmann are with the School of Computer Science, Carnegie Mellon University, PA, USA, cxj273@gmail.com, alex@cs.cmu.edu Yi Yang is with Centre for Quantum Computation and Intelligent Systems, University of Technology Sydney, Australia, yee.i.yang@gmail.com. approaches, namely feature extraction and feature selection, are typically investigated to learn a low-dimensional compact representation of data [5], [6], [7], [8], [9]. The former reduce the dimension by combining the original features to construct new features while the latter consists in dropping some irrelevant or weakly relevant features. In this paper, we focus on feature selection because it not only promote the computational efficiency and improve the performance significantly but also keep the interpretation of reduced description [10], [11], [12], [13]. From the perspective of utilizing label information, feature selection algorithms are mainly categorized as supervised learning [14], [15], semi-supervised learning [16], [17], and unsupervised learning [18], [19], [15]. The first two types of feature selection, to some extent, depend on label information to guide the feature evaluation by encoding features discriative information in labels. However, data in large scale are usually collected without labels for many practical applications. Annotating these unlabeled data is dramatically expensive and time-consug [20]. As a result, unsupervised feature selection becomes more ubiquitous and challenging for many real world applications [21]. In the past decades, unsupervised feature selection has attracted much attention of researchers and significantly facilitates many machine learning tasks such as classification, clustering, retrieval and ranking [22], [23], [24], [25], [26], [27], [28]. Without label information, the fundamental issue in unsupervised feature selection is how to model the manifold geometry structure of the whole feature set and produce a faithful feature subset which preserves the intrinsic structure accurately [29], [30]. Typically, the structure is characterized in the form of graph, include, but not limited to, pairwise similarity graph [31], [32], the K-nearest neighbor (KNN) graph [33], local total scatter and between-class scatter matrix [34], local symmetries of linear reconstruction for Locally Linear Embedding (LLE) [35], [36], [37]. Note that these graph-based structures are commonly calculated in the original high-dimensional feature space. Once the graph is detered, it is fixed in the procedure of feature selection. As a result, the feature selection models rely so heavily on the predetered graph [32]; Especially, the similarity graph based on pairwise distance might not be the optimal one and fail to capture the underlying geometry structure accurately for the following two reasons. On one hand, the similarity graph is usually constructed through Gaussian function. It leads the model performs much sensitively to the parameter tuning [38], [39]; On the other hand, the pairwise distance measured in high-dimensional feature space might not be qualitatively

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2 meaningful due to the curse of dimensionality, i.e., data points in high-dimensional space are approximately equidistant from each other [40], [41], [42]. Instead of estimating pairwise distances, LLE characterizes the local geometry properties by linear coefficients that reconstruct each data point from its neighbors, namely reconstruction weights [35]. However, the reconstruction weights based graph assumes that all the data points are connected as just one connected component, i.e., the structure information of graph is ignored completely. Especially, in the clustering task to partition the data into c clusters, the desire neighbors assignment is that the connected components in the graph are exact c [39]. In this paper, we follow LLE and propose a novel unsupervised feature selection model from a new point of view. Our model characterizes the intrinsic geometric properties of each neighborhood by an adaptive reconstruction weights instead of a predetered one, and simultaneously we impose a rank constraint on the Laplacian matrix of reconstruction weight based graph to achieve an ideal neighbors assignment. In such a way, the connected components in the graph are exact the cluster, and each connected component corresponds to one cluster. Additionally, we introduce a selective matrix to specify the desired feature subset by equipping the general transformation matrix with an explainable structure. This strategy makes the regularization on general transformation matrix unnecessary and eliates the difficulties in tuning the trade-off parameter. Extensive experimental results on several benchmark datasets demonstrate the effectiveness and superiority of the proposed method. In summary, we describe the main contribution of this paper as follows: 1) We characterize the intrinsic geometry structure of each neighborhood through a reconstruction nonnegative weights graph and impose a rank constraint on its Laplacian matrix to achieve an idea neighbors assignment. 2) Instead of using a predetered reconstruction graph for feature selection, we introduce a selective matrix without regularization and learn the optimal feature subset together with the adaptive reconstruction graph simultaneously. 3) An efficient alternative optimization algorithm is exploited to solve the proposed challenging problem with the theoretical analyses on its convergence and computational complexity. The remainder of this paper is organized as follows. We briefly review some related works on unsupervised feature selection in Section II. In Section III, we firstly introduce a selective matrix to specify the selected feature subset and then propose a novel model that learn the optimal feature subset with structure regularization of adaptive reconstruction weights matrix. We exploit an efficient alternating algorithm in Section IV to solve the proposed challenging problem, followed by theoretical analyses on its convergence and computational complexity in Section V. In Section VI, we conduct extensive experiments over several benchmark data sets to illustrate the effectiveness and superiority of the proposed method. Section VII concludes this work. Notations and Definitions We employ the notations as usual throughout this paper. Reals are written as lowercase letters; Vectors are denoted by boldface lowercase letters while matrices are presented as uppercase letters. 0 and 1 are column vectors whose elements are all zero and one, respectively. For any d-dimensional vector z = [z 1, z 2,, z d ] R d, ( d ) 1 its l p -norm is defined as z p = z i p p. For any matrix A = (a ij ) p q R p q, we denote its i-th row and the j-th column as a i and a j, respectively. The Frobenius i j a2 ij = norm of matrix A is defined as A F = T r(aa ), where T r( ) refers to the trace of matrix. D = diag(d 1, d 2,, d p ) R p p is a diagonal matrix with its (i, i)-th diagonal element d i ( i). We suppose dataset X = {x i : i = 1, 2,, n} consists of n data points, where each data point x i is presented by a d-dimensional feature vector. For each data point x i (i = 1, 2,, n), we denote its K-nearest neighbors set by N K (x i ) = {x i(k) : k = 1, 2,, K}. Let F = {f 1, f 2,, f d } be the set of d features for dataset X. We assume the selected feature subset with size m (m d) is denoted by F I = {f I(1), f I(2),, f I(m) } F, where I is a permutation of 1, 2,, d. II. RELATED WORKS A. Unsupervised Feature Selection Unsupervised feature selection aims to find a relevant subset of features that contains the most discriative information and preserves the intrinsic geometric structure as much as possible [33], [43]. In the past decades, studies on unsupervised feature selection can be mainly categorized into the following three groups, including filter, wrapper and embedded based approaches [44], [18]. Algorithms based on the filter model separate the evaluation of features from the learning tasks, i.e., they select the optimal features only exaing intrinsic properties of the data. For example, under the assumption that two data points belong to the same class if they close to each other, He et al. [33] defined a Laplacian Score to describe the locality preserving power of feature. However, this method neglects the possible correlation between different features [43]. Zhao et al. [15] exploited a unified framework for supervised and unsupervised feature selection based on spectral graph theory. Liu et al. [45] based on information measurement and performed the selection procedure as feature clustering in a hierarchically agglomerative way. Yang [34] incorporated discriative analysis and l 2,1 -norm imization into a joint framework for unsupervised feature selection; Li et al. [46] further performed spectral clustering with a nonnegative constraint for more accurate cluster labels; Consider the unreasonable constraint on the feature selection transformation matrix in [34], [46], Qian et al. [47] proposed a pseudo cluster labels based feature selection with local learning regularized robust non-negative matrix factorization. Wang et al. [48] embed feature selection into a clustering algorithm through sparse learning without the transformation. However, these methods might fail to capture the intrinsic structure accurately since they depend on a predetered similarity matrix computed in the original

3 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 3 feature space with Gaussian functions. Additionally, Masaeli et al. [49] convert transformation-based methods, including linear discriant analysis (LDA) and Hilbert-Schmidt Independence Criterion (HSIC), to two new feature selection algorithms through l 1 /l regularization. Tabakhi et al. [50] proposed a filter-based multivariate method which uses ant colony optimization to compute the similarity between features. Note that the filter model is relatively computationally efficient but might fail to select the most informative features for a particular learning task [19]. Instead, the approaches based on wrapper model requires a predetered learning algorithm and wraps the procedure of feature selection to evaluate relevant features. For example, Guyon et al. [51] refine the optimum feature subsets by combinatorial search using SVMs in a wrapper approach; Maldonado et al. [52] based on SVMs with kernel functions and removed the redundant features according to the number of errors in a validation subset; Dy et al. [21] exploited an Expectation-Maximization (EM) clustering algorithm and evaluated the candidate feature subsets through two different performance criteria, i.e., scatter separability and maximum likelihood. Wrapper-based methods perform better than filter models [53]. However, as argued in [19], [46], the wrapper models are usually with more expensive computation and prone to the issue of over-fitting. Taking the advantages of both filter and wrapper based methods, embedded based methods involve the feature selection into a joint framework of model construction [48], [19]. For example, Maugis et al. [54] took relation between the irrelevant variables and the clustering variables into consideration, and regard the feature selection as a model-based clustering problem. Cai et al. [43] developed a multi-cluster feature selection algorithm based on manifold learning and l 1 -norm regularization. However, this method evaluates features individually and cannot handle redundant features [55]. Weston et al. [56] formulated feature selection and pattern classification objectives in a single optimization by imizing a l 0 -norm regularizer. Zhao et al. [55] proposed a spectral feature selection to identify feature redundancy based on a sparse multioutput regression with l 2,1 -norm constraint. However, this method might deteriorate the performance since it conducts spectral regression after manifold learning. For this issue, Zhu et al. [57] performed unsupervised spectral feature selection with manifold learning and regression simultaneously, and further developed a joint graph sparse coding model for robust unsupervised feature selection [19]. However, this method employed a predetered radial basis kernel function to measure the similarity between two samples, which might fail to characterize the intrinsic local structure of data [35]. Zeng et al. [58] incorporate the updating of graph Laplacian into the built-in regularization of learning-based clustering algorithm to improve the quality of structure learning. Instead of using all input features to characterize the underlying structure of data, Du et al. [32] perform feature selection iteratively with a refined structure of data. However, the computational cost of this method is daunting in large-scale data since it concerns the pairwise similarity matrix as well as the reconstruction structure of each data point from the other all data. B. Locally Linear Embedding (LLE) The primary focus of LLE is to recover global nonlinear structure of high-dimensional data from linear fits and learn a neighborhood preserving embedding via exploiting the local symmetries of linear reconstructions [35] Specifically, LLE characterize the local geometry of the neighborhoods by linear coefficients v (i) = [v i(1), v i(2),, v i(k) ] R K that reconstruct the i-th data point from its neighbors. In other words, the coefficient v i(k) summarizes the contribution of neighbor x i(k) to the reconstruction of data point x i. The weights v i(k) is estimated by imizing the reconstruction errors over all data points, i.e., K k=1 v i(k)=1( i) T (V ) = K x i v i(k) x i(k) 2 2. (1) k=1 where sparse weight matrix V = [v 1, v 2,, v n ] R n n with v i = [v i1, v i2,, v in ] R n, collects all contribution coefficients such that v ij = v i(k) if j = i(k) and x i(k) N K (x i ), v ij = 0 otherwise. It has been pointed out in [35] that the weights matrix V obey an important symmetry, and thus the reconstruction weights characterize the intrinsic geometric properties of each neighborhood effectively. By symmetry, we define an undirected graph G = (X, V ), where dataset X denotes the set of vertices and V collects all edges (weights), i.e., an edge that connects vertices x i and x j is represented by the contribution coefficient v ij (i, j = 1, 2,, n). LLE assume that the same weights that reconstruct the data points in high dimensional space R d should also reconstruct its embedded manifold coordinates in low-dimensional space R m (m d). As a result, it deteres the corresponding m-dimensional vector y i for data point x i R d by solving the following optimization problem y i( i) y i v ij y j 2 2. (2) j=1 However, this strategy estimates the input-to-manifold mapping solely based on the training data points. As a result, it fails to project new data points out of the training dataset [36]. For this issue, some studies [36], [32] introduce a transformation matrix W R d m to project d-dimensional data point to an m dimensional one through y i = W x i, where W is estimated by solving the following optimization problem according to the fixed weights matrix V, W W x i v ij W x j 2 2. (3) j=1 However, these approaches employ a general transformation matrix to reduce feature dimension by extracting new coordinates (features). For the task of unsupervised feature selection, we aim to select a discriative original feature subset from a larger feature pool for a better interpretation. Additionally, it is noteworthy that previous approaches based on LLE usually ignore the structure of graph G; especially, in the clustering task to partition the data into c clusters, the desire neighbors assignment is that the connected components in the graph

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 4 G are exact c [39]. In the other word, the multi-connectedcomponents information is completely ignored by the previous unsupervised feature selection algorithms based on LLE. III. PROPOSED METHODOLOGY In this section, we exploit a novel model for unsupervised feature selection with adaptive reconstruction weights graph. This method replaces the transformation matrix with a selective matrix and incorporates the adaptive reconstruction graph into the procedure of feature selection. A. Selective Matrix To specify the selected features in the procedure of feature selection, we equip the conventional transportation matrix with an explainable structure to specify the selected features, namely selective matrix. Formally, for feature subset F I, we define the corresponding selective matrix as W I = [w I(1), w I(2),, w I(m) ] {0, 1} d m where w I(i) R d (i = 1, 2,, m) refers to a column vector whose components are all 0 expect 1 for the I(i)-th one. We observe that W I is indeed a column-full-rank transformation matrix. With the selective matrix W I, the procedure of original feature selection is expressed as transforg a d-dimensional data point x to an m-dimensional data point y through y = W I x. (4) where m d. Thanks to the particular structure, the selective matrix W I demonstrates three significant advantages. Firstly, it leads the feature selection algorithm faster due to its extreme sparsity [59]. Secondly, the feature selection results based on W I is much more explainable than the general transformation matrix. Last but not the least, the particular structure makes the norm based regularization on transformation matrix unnecessarily, which explicitly avoids the problem of trade-off parameter tuning with expensive computation. Based on the selective matrix W I R d m, we formulate the reconstruction error of original feature subset F I as L(W I, V ) = = K WI x i v i(k) WI x i(k) 2 2 (5) k=1 WI x i WI X i v (i) 2 2 (6) where X i = [x i(1), x i(2),, x i(k) ] R d K collects the K nearest neighbors of data points x i ; Coefficient v (i) = [v i(1), v i(2),, v i(k) ] R K represents the contribution weights of K nearest neighbors for data points x i. B. Structure Regularization of Reconstruction Graph For a desirable neighbors assignment (mentioned in the previous section), the contribution weights V should be constrained, such that the connected components of graph G = (X, V ) are exact c [39]. In fact, we have the following important property to describe the information of connected components for a graph with nonnegative weights. Theorem 1. [60] A graph G = (X, V ) with v ij 0 ( i, j) has c connected components if and only if the algebraic multiplicity of eigenvalue 0 for the corresponding Laplacian matrix L V is c, where the Laplacian matrix L V is defined as L V = D V (V + V )/2 R n n ; D V R n n is a diagonal matrix whose (i, i)-th diagonal element is calculated by n j=1 (v ij + v ji )/2 for i = 1, 2,, n. According to Theorem 1, if there are c connected subgraphs (clusters) among graph G = (X, V ), the equation Rank(L V ) = n c should be satisfied for the corresponding Laplacian matrix L V. Motivated by Property 1, we impose the constraint Rank(L V ) = n c on the Laplacian matrix of graph G, and formulate the idea of adaptive unsupervised feature selection with structure regularization as the following optimization problem: L(W I, V ) = W I,V s.t. WI x i WI X i v (i) 2 2 (7) v (i) 0, v (i) 1 = 1 ( i) Rank(L V ) = n c where 0 and 1 denote K-dimensional vectors whose entries are all 0 and 1, respectively. This idea characterizes the intrinsic structure of unsupervised data through geometric reconstruction similar to LLE instead of pairwise similarity matrix based on Gaussian functions. Beyond LLE, we impose the nonnegative constraint on the contribution coefficients, which is more advantageous since it further ensures the sparsity of the coefficients [61]. Importantly, we concerns the structure of the reconstruction graph such that the connected components in the graph are exact the number of clusters. Due to the difficulties in handling the rank constraint on the structure of graph G, subsequently, we follow the strategy used in [39] and reformulate the optimization problem (7) as an easier one. Let σ i (L V ) 0 be the i-th smallest eigenvalue of Laplacian matrix L V (i = 1, 2,, n). Then the constraint Rank(L S ) = n c could be satisfied if the smallest c eigenvalues of L V are zero. As a result, optimization problem (7) is equivalent to the following problem with respect to a large enough value of λ: W I,S s.t. K WI x i WI X i v (i) λ σ i (L V ) (8) v (i) 0, v (i) 1 = 1 ( i) Furthermore, we assign each data point x i in the graph G a vector q i R c (i = 1, 2,, n), and collect all the vectors into matrix Q = [q 1, q 2,, q n ] R n c. According to the Ky Fan s Theorem [62], we have K σ i (L V ) = T r(q L V Q), (9) Q Q=I c where I c is an c-by-c identical matrix. As a result, we introduce variable Q into optimization problem (8) and further reformulate it as the following problem, WI x i WI X i v (i) λt r(q L V Q) (10) W I,Q,V

5 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 5 s.t. Q Q = I c ; v (i) 0, v (i) 1 = 1 ( i) In summary, for the clustering task with unsupervised feature selection, optimization problems (7) and (10) demonstrate their significance in the two-folds. On one hand, it explicitly takes the structure of graph G into consideration and potentially learn the optimal local geometry structure through adaptively assigning neighbors to each data points. On the other hand, instead of using a predetered weights matrix V to characterize the intrinsic manifold structure of the whole feature space, the proposed model attempts to learn the contribution weights and the most informative feature subsets simultaneously. IV. OPTIMIZATION PROCEDURE In this section, we exploit an efficient alternative optimization algorithm to address the proposed challenging problem. A. Optimize Q With fixed variables V and W I, optimization problem (10) becomes the following problem with respect to variable Q, T r(q L V Q) (11) Q Q=I c Subject to the constraint Q Q = I c, we arrive at the following eigenvalue problem by Lagrange optimization L V Q = ΛQ (12) where Λ is the diagonal Lagrange multiplier matrix [36]. As a result, the optimal solution Q for problem (11) can be formed by the c eigenvectors of the current Laplacian matrix L V. B. Optimize W I With fixed variables V and Q, optimization problem (10) turns to seek feature subset F I F by solving the following problem with respect to variable W I {0, 1} d m, W I L(W I ) = WI x i WI X i v (i) 2 2. (13) Due to the constraint v(i) 1 = 1 ( i), the reconstruction error in terms of feature subset F I is rearranged as L(W I ) = K v i(k) (WI x i WI x i(k) ) 2 2 k=1 = T r [ WI ] X V W I (14) m = wi(k) X V w I(k) (15) k=1 where X V = X(I n V ) (I n V )X R d d. Considering the specific structure of selective matrix W I, for each feature f i F, we define its feature-lever reconstruction score based on the current graph G = (X, V ) by score(f i ) = w i X V w i (16) where w i {0, 1} d (i = 1, 2,, d) is an indicator vector whose components are all 0 expect 1 for the i-th one. Based on this definition, the task of feature selection with optimization problem (13) is to seek the feature subset F I = {f I (1), f I (2),, f I (m)} F with the imum score. Recall the definition of vector w i, we find that the reconstruction score of feature f i indeed equals to the (i, i)- th diagonal element of matrix X V, denoted by (X V ) i,i (i = 1, 2,, d). As a result, the global optimal feature subset F I is calculated by F I = arg score(f j ) (17) F I F, F I =m f j F I m = arg (X V ) I(j),I(j) (18) F I F, F I =m j=1 It infers that the global optimum of problem (13) W I can be obtained efficiently by locating the first m smallest diagonal elements of matrix X V. C. Optimize V When variables W I and Q are fixed, the optimization problem (10) turns to the following problem with respect to sparse weights matrix V, WI x i WI X i v (i) λt r(q L V Q) (19) V s.t. v (i) 0, v (i) 1 = 1 ( i) To solve this problem, we first rearrange the second term of objective function (19) as T r(q L V Q) = 1 q i q j 2 2 2v ij = 1 2 j=1 k=1 K q i q i(k) 2 2v i(k). Due to the independence between different data points, we solve optimization problem (19) with respect to each variable v i (i = 1, 2,, n) individually by the following problem V W I x i W I X i v (i) λ 2 s.t. v (i) 0, v (i) 1 = 1 K q i q i(k) 2 2v i(k) k=1 Let the objective function of this optimization problem be g(v (i) ) and p i = [p i(1), p i(2),, p i(k) ] R K with p i(k) = q i q i(k) 2 2 for k = 1, 2,, K. Then the equation q i q i(k) 2 2v i(k) = v (i) p i holds for i = 1, 2,, n. We denote the terms in g(v (i) ) which are independent of v (i) by H, and rearrange objective function g(v (i) ) as follows, g(v (i) ) = W I x i W I X iv (i) λ 2 v (i)p i = v (i)x i W IW I X iv (i) 2v (i)x i W IW I x i + λ 2 v (i)p i + H = v (i)av (i) λv (i)b + H

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 6 where A = X i W IW I X i R K K and b = 2 λ X i W IW I x i 1 2 p i R K. As a result, the optimization problem (19) becomes the following n sub-problems in a simple manner ϕ i (v (i) ) = v v (i) 0,v (i) 1=1 (i) Av (i) λv(i) b (20) (i = 1, 2,, n). To solve this optimization problem, we exploit an Accelerated Projected Gradient (APG) algorithm for its simplicity and efficiency [63]. Based on the APG method, we introduce an auxiliary variable z to convert the optimization problem (20) into an easier one, meanwhile, make the auxiliary variable approximate and converges to the solution v (i) during the procedure of iterative optimization. Specifically, in the t-th iteration, let the second order Taylor expansion of objective function ϕ i ( ) about an auxiliary variable z t be ϕ L i (v (i) ) = ϕ(z t ) + ϕ i (z t ), v (i) z t + L 2 v (i) z t 2 2 where L 0 is a constant. We update v (i) by solving an easier optimization problem v (i) 0,v (i) 1=1 ϕl i (v (i) ). (21) It is indeed equivalent to a Euclidean projection problem on the simplex space, i.e., v t+1 (i) = v (i) 0,v (i) 1=1 v (i) h t i 2 2 (22) where h t i = z t 1 L ϕ i(z t ) = z t 1 L (2Av (i) λb) R K. Note that there have been several efficient algorithms for the Euclidean projection problem on the simplex space in the past decades. Interesting readers please refer to [64], [65], [66], [67] for more details. In this paper, we solve this problem according to the fast algorithm proposed in [64]. To approximate the auxiliary variable to v (i), we update z t according to the following formula z t+1 = v t+1 (i) + τ t 1 τ t+1 (v t+1 (i) v t (i) ) (23) where the acceleration coefficient τ is updated through τ t+1 = 1 2 ( τ 2 t ). (24) We describe the APG algorithm for optimization problem (20) in Algorithm 1, where the convergence criteria is that the relative change of v (i) 2 is small enough. It has been pointed out in [63] that the APG algorithm converges both theoretically and experimentally. In summary, we demonstrate the overall alternative optimization algorithm for the proposed unsupervised feature selection in Algorithm 2, where the initial weights matrix V 0 is calculated via solving the optimization problem (1) based on LLE. V. THEORETICAL ANALYSIS ON CONVERGENCE AND COMPLEXITY To illustrate the efficiency of the proposed Algorithm 2, in this section, we theoretically analyze its convergence and computational complexity. Algorithm 1 APG algorithm for optimization problem (20). Input: X,W I,Q. Output: v (i) (i = 1, 2,, n). Initialize: v(i) 0 = z 0, τ 0 = 1. 1: while not converge do 2: Update v (i) ( i) by solving problem (22); 3: Update τ t+1 = 1 2 ( τt 2 ); 4: Update z t+1 = v t+1 (i) + τt 1 τ t+1 (v t+1 (i) v(i) t ); 5: end while Algorithm 2 Unsupervised feature selection with adaptive neighbors. Input: X, m, c, K, λ Output: F I. Initialize: t = 0; V 0 1: while not converge do 2: Update Q t+1 as the c eigenvectors of the current Laplacian matrix L V t; 3: Update W t+1 I by locating the the first m smallest diagonal elements of matrix X V t = X(I n V t ) (I n V t )X. 4: Update V t+1 by APG Algorithm 1; 5: t = t + 1; 6: end while A. Convergence Analysis The convergence of Algorithm 2 is given through the following Theorem 2. Theorem 2. The alternate updating rules in Algorithm 2 monotonically decrease the objective function value of optimization problem (10) in each iteration until convergence. Proof. In the procedure of iteration, we get the global optimal selective matrix W t+1 I by solving optimization problem W t+1 n I = arg WI W I x i WI X iv(i) t 2 2, where v(i) t (i = 1, 2,, n) is the the current weights vectors. As a result, we have the following inequality (W t+1 I ) x i (W t+1 I ) X iv(i) t 2 2 (WI) t x i (WI) t X iv(i)i t 2 2. (25) Since variable Q t+1 is updated by solving problem Q t+1 = arg Q Q=I c T r(q L V tq) with fixed V t, we obtain the following inequality T r((q t+1 ) L V tq t+1 ) T r((q t ) L V tq t ). (26) Because the updating of weights matrix V t+1 R n n can be divided into n independently sub-optimization problem with respect to K-dimensional vector v (i) (i = 1, 2,, n). In Algorithm 2, an efficient APG Algorithm 1 are exploited to solve the optimal v t+1 (i) ( i). Note that the APG algorithm converges fast in [68], [63], Consequently, we have the following inequality (27) (W t+1 I ) x i (W t+1 I ) X iv t+1 (i) λt r((q t+1 ) L V t+1q t+1 ) (W t+1 I ) x i (W t+1 I ) X iv t (i) λt r((q t+1 ) L V tq t+1 )

7 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 7 Based on the inequities (25), (26) and (27), we arrive at (W t+1 I ) x i (W t+1 I ) X iv t+1 (i) λt r((q t+1 ) L V t+1q t+1 ) (W t+1 I ) x i (W t+1 I ) X iv t (i) λt r((q t+1 ) L V tq t+1 ) (W t I) x i (W t I) X iv t (i) λt r((q t ) L V tq t ). As a result, the objective function value of optimization problem (10) decrease monotonically in each iteration until the algorithm convergence. The proof is completed. B. Complexity Analysis It is noteworthy that the proposed Algorithm 2 consists of three updating rules which are computationally efficient. For the updating of variable Q, the main cost lies in calculating the c eigenvector of Laplacian matrix L V F n n, with complexity O(n 2 c). In fact, due to the K-sparsity of matrix L V, i.e., each column of L V has only K n nonzero elements, the complexity reduces to O(Kcn) in practice. For the updating of variable W I, we consider the K-sparsity of weights matrix V and calculate the diagonal elements of matrix X V R d d with complexity O(Kdn); moreover, to achieve the optimal feature subset of size m, we seek the m smallest diagonal elements with complexity O(m log d). As a result, the computational complex of updating variable W I turns to O(Kdn + m log d) O(Kdn) since log d d and m n hold in practice. To update the variable K- sparsity weight vector v (i) with APG Algorithm 1, the main cost comes from the calculation of A and b with complexity O(Kdm), as well as the cost of solving a Euclidean projection problem on the simplex space with complexity O(K log K) [64]. Thus, the computational complexity of updating variable V is O((Kdm + K log K)nt) O(Kdmnt) since log K dm holds in practice, where t denotes the iteration of APG Algorithm 1. Therefore, the complexity of each iteration in Algorithm 2 is summarized as O(Kcn + Kdn + Kdmnt) O(Kdmnt) since c d in practice. Considering the fast convergence of the APG Algorithm 1, we observe that the proposed alternative Algorithm 2 performs efficiently. Note that the computational cost of the proposed algorithm is linear with respect to the number of data points n, and thus our algorithm is capable of handling large-scale data. VI. EXPERIMENTAL ANALYSIS For fair comparison, we follow previous work on unsupervised feature selection and evaluate the performance of our proposed ANFS for clustering over some benchmark data sets, regarding to face recognition, human motion recognition and handwritten digits recognition. A. Experimental Setup We conduct experiments over eight benchmark data sets, including three face datasets (Japanese Female Facial Expression Database (JAFFE) [69], UMIST face data set [70] and Pointing4 dataset [71]), one motion dataset (the HumanEVA data set [72]), one object recognition dataset (Columbia Object Image Library-20 (COIL-20) data set [73]), two handwritten digit datasets (USPS dataset and MNIST dataset [74]) and TABLE I: Datasets description. Data set of samples of features of classes JAFFE UMIST HumanEVA 10, COIL20 1,440 1, USPS 9, MNIST 5, Pointing4 2,790 1, WebKB 814 4,029 7 one text dataset (WebKB collected by the University of Texas [75]). All of the image data sets are downloaded from different web sites for a comprehensive evaluation. In the experiments, we normalize each initial feature into [0, 1]. The details of these data sets are demonstrated in Table I. Note that all the datasets used in this paper have no official splits. Hence, we employ a 5-fold cross-validation to split the datasets. To be more specific, we partition the dataset into five subsets. We then selected one subset for testing and the remaining subsets for learning the optimal parameters. With the selected features, we independently repeat the k-means clustering 50 times and report the average results together with the variance. Additionally, c is set to be the true number of classes for each dataset; The size of nearest neighbors K is empirically set to be 5 for all the datasets. B. Compared Algorithms To illustrate the effectiveness of our proposed ANFS for feature selection, we compare it with one baseline and several unsupervised feature selection methods. The compared algorithms are enumerated as follows: 1) All-Fea: All original features are preserved as the baseline in the experiments. 2) Max Variance (MaxVar): Features are ranked according to the variance magnitude of each feature in a descending order, following by the selection of the highest ranked features. 3) Laplacian Score (LS) [2]: Features consistent with Gaussian Laplacian matrix are selected to best preserve the local manifold structure. 4) Multi-Cluster Feature Selection (MCFS) [43]: This unsupervised approach selects those features who make the multi-cluster structure of the data preserved best. Features are selected using spectral regression with the l 1 -norm regularization. 5) Unsupervised Discriative Feature Selection (UDFS) [34]: Under the assumption that the class label of input data can be predicted by a linear classifier, discriative analysis and l 2,1 -norm imization are incorporated into a joint framework for unsupervised feature selection. 6) Nonnegative Discriative Feature Selection (NDFS) [76]: NDFS exploits local discriative information and feature correlations simultaneously. Besides, the manifold structure information is also considered jointly. 7) Unsupervised Feature Selection Using Trace Ratio Criterion (TRACK) [77]: This unsupervised feature selection

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 8 algorithm integrates unsupervised trace ratio formulation and structured sparsity-inducing norms regularization. 8) Unsupervised Feature Analysis with Class Margin Optimization (MMC) [53]: MMC integrates the Maximum Margin Criterion with a sparsity-based model into a joint framework, where the class margin and feature correlation are taken into account at the same time. Following related studies on clustering, we leverage two popular evaluation metrics, namely clustering accuracy (ACC) and normalized mutual information (NMI), to measure the performance of clustering after various unsupervised feature selection. C. Results and analysis In this section, we empirically evaluate the performance of the compared nine unsupervised feature selection algorithms in terms of ACC and NMI. Additionally, to demonstrate the effectiveness of the structure regularization on the reconstruction graph, we also set λ = 0 for the proposed model (8) and learn the optimal feature subset by solving the following optimization problem W I L(W I, V ) = WI x i WI X i v (i) 2 2 (27) where the fixed weight vector v i (i = 1, 2,, n) are learned from LLE. To illustrate the influence of varying initialization of local reconstruction coefficients V on the performance of the proposed algorithm, we also report the experimental results of ANFS with random initialization of V, namely ANFS (RI). The performance of experiments on the two evaluation metrics mentioned above are shown in Table II and Table III, respectively. From the experimental results, we have the following observations: 1) Compared with the baseline method (All-Fea), it can be observed that feature selection is able to enhance performance by removing noisy and redundant information. Besides, it can also make the subsequent processing more efficient by selecting a subset of the original features, which is crucial for high-dimensional data processing. 2) By exploring local geometric structure of the data distribution, better performance is expected. For example, LS, MCFS, UDFS and NDFS generally performs much better than the baseline method. 3) MCC achieves the second best performance in terms of both ACC and NMI on almost all datasets by jointly optimizing the maximum margin criterion and the sparsitybased model. TRACK achieves more improvement by harnessing the discriant power of trace ratio criterion to select discriative features; 4) ANFS with λ = 0 performs poorer than the algorithms of ANFS and ANFS(RI). This observation on one hand illustrates the necessary of structure regularization; On the other hand, it shows the effectiveness of adaptive neighbors assignment. 5) ANFS and ANFS(RI) consistently outperform other alternatives over the eight benchmark datasets. As a result, it is beneficial to incorporate the structure learning of reconstruction graph into the procedure of adaptive neighbors assignment. 6) The algorithms ANFS with LLE initialization of V and ANFS with random initialization of V perform comparably over eight benchmark datasets. As a result, the proposed algorithm can always achieve good local optima with different initializations. To ensure our improvement is statistically significant, we give non-parametric pair-wised Wilcoxon test in Tables IV and V. We compute the p-value of ANFS against each other method, and set the level of significance α = From the results shown in Tables IV and V we can see that our method achieves statistically significant improvements. To verify the efficiency of the proposed model, we also report in Table VI the speed of different feature selection algorithms over different datasets. Since the filter based Max- Var and LS algorithms ignore the possible correlation between different features and separate the evaluation of features from the clustering learning tasks, they take less time than others algorithms. However, it is also the reason why they achieve poorer performance than other algorithms (See Table II and Table III for details). In contrast to other wrapper and embedded based algorithms, our model achieves a significant improvement in speed, where the adoption of selective matrix as well as the updating of V in R K space (rather than R n space ) play important roles. D. Parameter Sensitivity Contrast to other unsupervised feature selection algorithms, our proposed ANFS requires only two parameter including λ and the number of selected features to be set in advance. In the previous experiments, we estimate the best parameters using a 5-fold cross-validation. To illustrate the parameters influence on the performance of feature selection, we conduct experiments to analysis the parameters sensitivity in terms of clustering accuracy. Experimental results on the eight benchmark datasets are shown in Figure 1. The regularization parameter λ is tuned from {10 3, 10 2, 10 1, 10 0, 10 1, 10 2, 10 3 }. Empirically, we tune the number of selected features from half of the total number to the full feature size. From the results we can see that the performance changes differently with respect to different parameters on various datasets. How to identify the optimal values of the parameters is data dependent. This demonstrates the need for us to use cross-validation to select the best parameters for each dataset. We have similar observations in terms of normalized mutual information. E. Convergence Study To solve the objective function, we have developed an efficient iterative algorithm. In the previous section, we have proved the convergence of the proposed algorithm. In this section, we experimentally study the speed of convergence of ANFS. We set the regularization parameter λ as 1, which is a median value in the tuned range. The convergence curves on the eight benchmark datasets are shown in Figure 2. We can

9 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 9 TABLE II: Performance comparison in terms of clustering accuracy (ACC) ± std% for different selection methods over different data sets. Performance is reported in percentages. Dataset JAFFE Umist HumanEVA Coil20 USPS MNIST Pointing4 WebKB All-Fea 84.5 ± ± ± ± ± ± ± ± 1.4 MaxVar 81.6 ± ± ± ± ± ± ± ± 1.3 LS 86.3 ± ± ± ± ± ± ± ± 1.5 MCFS 86.9 ± ± ± ± ± ± ± ± 1.4 UDFS 88.1 ± ± ± ± ± ± ± ± 1.7 NDFS 88.9 ± ± ± ± ± ± ± ± 1.4 TRACK 89.8 ± ± ± ± ± ± ± ± 1.4 MMC 91.3 ± ± ± ± ± ± ± ± 1.6 ANFS(λ = 0) 90.8 ± ± ± ± ± ± ± ± 1.7 ANFS(RI) 93.3 ± ± ± ± ± ± ± ± 1.8 ANFS 93.5 ± ± ± ± ± ± ± ± 1.5 TABLE III: Performance comparison in terms of normalized mutual information (NMI) ± std% for different selection methods over different data sets. Performance is reported in percentages. Dataset JAFFE Umist HumanEVA Coil20 USPS MNIST Pointing4 WebKB All-Fea 90.3 ± ± ± ± ± ± ± ± 3.3 MaxVar 87.9 ± ± ± ± ± ± ± ± 2.8 LS 91.8 ± ± ± ± ± ± ± ± 3.4 MCFS 93.1 ± ± ± ± ± ± ± ± 3.1 UDFS 93.9 ± ± ± ± ± ± ± ± 3.5 NDFS 94.8 ± ± ± ± ± ± ± ± 2.9 TRACK 95.9 ± ± ± ± ± ± ± ± 1.9 MMC 97.1 ± ± ± ± ± ± ± ± 1.8 ANFS(λ = 0) 96.7 ± ± ± ± ± ± ± ± 2.0 ANFS(RI) 97.8 ± ± ± ± ± ± ± ± 1.9 ANFS 98.5 ± ± ± ± ± ± ± ± 1.7 TABLE IV: p-value of pair-wised Wilcoxon Test w.r.t. different algorithms in terms of clustering accuracy. Data Set All-Fea MaxVar LS MCFS UDFS NDFS TRACK MMC JAFFE 3e-11 5e-11 2e-11 3e-11 5e-11 2e-11 4e-11 2e-11 UMIST 2e-11 4e-11 3e-11 2e-11 5e-11 3e-11 2e-11 4e-11 HumanEVA 2e-11 5e-11 3e-11 5e-11 2e-11 5e-11 4e-11 5e-11 Coil20 3e-11 5e-11 3e-11 2e-11 4e-11 5e-11 3e-11 2e-11 USPS 4e-11 3e-11 2e-11 3e-11 5e-11 3e-11 4e-11 2e-11 MNIST 5e-11 3e-11 6e-11 4e-11 3e-11 5e-11 4e-11 3e-11 Pointing4 3e-11 6e-11 4e-11 2e-11 5e-11 3e-11 4e-11 3e-11 WebKB 5e-11 3e-11 2e-11 5e-11 4e-11 5e-11 3e-11 2e-11 TABLE V: p-value of pair-wised Wilcoxon Test w.r.t. different algorithms in terms of normalized mutual information. Data Set All-Fea MaxVar LS MCFS UDFS NDFS TRACK MMC JAFFE 3e-11 5e-11 4e-11 3e-11 3e-11 5e-11 3e-11 4e-11 UMIST 5e-11 3e-11 6e-11 3e-11 5e-11 6e-11 4e-11 3e-11 HumanEVA 4e-11 5e-11 3e-11 2e-11 6e-11 4e-11 3e-11 2e-11 Coil20 5e-11 4e-11 4e-11 3e-11 5e-11 5e-11 3e-11 5e-11 USPS 6e-11 3e-11 4e-11 3e-11 6e-11 3e-11 4e-11 6e-11 MNIST 4e-11 5e-11 2e-11 5e-11 5e-11 3e-11 4e-11 5e-11 Pointing4 7e-11 3e-11 4e-11 5e-11 3e-11 4e-11 5e-11 4e-11 WebKB 3e-11 5e-11 6e-11 4e-11 4e-11 3e-11 5e-11 4e-11 TABLE VI: Speed test for different feature selection algorithms over different data sets. Performance is reported in seconds. Dataset JAFFE Umist HumanEVA Coil20 USPS MNIST Pointing4 WebKB MaxVar ± ± ± ± ± ± ± ± LS ± ± ± ± ± ± ± ± MCFS ± ± ± ± ± ± ± ± UDFS ± ± ± ± ± ± ± ± NDFS ± ± ± ± ± ± ± ± TRACK ± ± ± ± ± ± ± ± MMC ± ± ± ± ± ± ± ± ANFS ± ± ± ± ± ± ± ± 1.048

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 10 (a) JAFFE (b) UMIST (c) HumanEVA (d) Coil20 (e) USPS (f) MNIST (g) Pointing4 (h) WebKB Fig. 1: Performance variations of the proposed method with respect to different values of λ and the number of selected features. (a) JAFFE (b) UMIST (c) HumanEVA (d) Coil20 (e) USPS (f) MNIST (g) Pointing4 (h) WebKB Fig. 2: Convergence curves of the proposed algorithm. see that the proposed algorithm converges within 20 iterations for all the datasets, validating that the proposed algorithm is effective and quickly converges. VII. C ONCLUSIONS In this paper, we propose a novel unsupervised feature selection model from a new perspective by learning the optimal feature subset and adaptive reconstruction graph simultaneously. Here the LLE based reconstruction graph is estimated to characterize the intrinsic local geometric properties with a structure regularization. In such a way, the number of connected components in the reconstruction graph equals to the number of classes regarding clustering tasks. Additionally, we introduce a selective matrix to specify the selected features without norm-based regularization. This strategy eliates the difficulties in tuning trade-off parameter of regularization. An efficient alternative optimization algorithm is exploited to address the proposed problem, together with theoretical analysis on its convergence and complexity. We also analyze the convergence of the proposed algorithm as well as its computational complexity theoretically. Extensive experimental results on several benchmark data sets demonstrate the effectiveness and superiority of the proposed approaches. In this paper, we focus on the scenario that the data points are with a single view for a single task. In our future work, it is potentially interesting to extend the proposed methods to unsupervised feature selection of data points with multi-view and multi-task [78] and other interesting applications, i.e., Cosaliency Detection [79], [80]. R EFERENCES [1] M. Ben-Bassat, Pattern recognition and reduction of dimensionality, Handbook of Statistics, vol. 2, pp , 1982.

11 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 11 [2] X. He, D. Cai, and P. Niyogi, Laplacian score for feature selection, in NIPS, [3] X. Chang, F. Nie, Y. Yang, C. Zhang, and H. Huang, Convex sparse PCA for unsupervised feature learning, TKDD, vol. 11, no. 1, pp. 3:1 3:16, [4] X. Chang, F. Nie, S. Wang, Y. Yang, X. Zhou, and C. Zhang, Compound rank-k projections for bilinear analysis, IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 7, pp , [5] H. Liu and H. Motoda, Feature selection for knowledge discovery and data ing. Springer Science & Business Media, 2012, vol [6] V. D. Silva and J. B. Tenenbaum, Global versus local methods in nonlinear dimensionality reduction, in Advances in Neural Information Processing Systems, 2002, pp [7] M. H. Law and A. K. Jain, Incremental nonlinear dimensionality reduction by manifold learning, IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 3, pp , [8] S. Li and D. Wei, Extremely high-dimensional feature selection via feature generating samplings, IEEE Trans.Cybern., vol. 44, no. 6, pp , [9] X. Chang and Y. Yang, Semi-supervised feature analysis by ing correlations among multiple tasks, IEEE Transactions on Neural Networks and Learning Systems, [10] C. Freeman, D. Kulic, and O. Basir, Feature-selected tree-based classification, IEEE Trans.Cybern., vol. 43, no. 6, pp , [11] E. Romero and J. M. Sopena, Perforg feature selection with multilayer perceptrons, IEEE Trans. Neural Netw., vol. 19, no. 3, pp , [12] S. Guérif, Unsupervised variable selection: when random rankings sound as irrelevancy, in JMLR W&P, [13] X. Chang, F. Nie, Y. Yang, and H. Huang, A convex formulation for semi-supervised multi-label feature selection, in AAAI, [14] J. M. Pena and R. Nilsson, On the complexity of discrete feature selection for optimal classification, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 8, pp , [15] Z. Zhao and H. Liu, Spectral feature selection for supervised and unsupervised learning, in Proceedings of the 24th International Conference on Machine Learning, 2007, pp [16] Y. Han, Y. Yang, Y. Yan, Z. Ma, N. Sebe, and X. Zhou, Semisupervised feature selection via spline regression for video semantic recognition, IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 2, pp , [17] Z. Zhao and H. Liu, Semi-supervised feature selection via spectral analysis. in SDM. SIAM, 2007, pp [18] L. Wolf and A. Shashua, Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach, J. Mach. Learn. Res., vol. 6, pp , [19] X. Zhu, X. Li, S. Zhang, C. Ju, and X. Wu, Robust joint graph sparse coding for unsupervised spectral feature selection, IEEE Trans. Neural Netw. Learn. Syst., to appear. [20] Q. Cheng, H. Zhou, and J. Cheng, The fisher-markov selector: fast selecting maximally separable feature subset for multiclass classification with applications to high-dimensional data, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 6, pp , [21] J. G. Dy and C. E. Brodley, Feature selection for unsupervised learning, J. Mach. Learn. Res., vol. 5, pp , [22] T. Hancock and H. Mamitsuka, Boosted network classifiers for local feature selection, IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 11, pp , [23] Y. Li, J. Si, G. Zhou, S. Huang, and S. Chen, Frel: A stable feature selection algorithm, IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 7, pp , [24] I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., vol. 3, pp , [25] L. Laporte, R. Flamary, S. Canu, S. Déjean, and J. Mothe, Nonconvex regularizations for feature selection in ranking with sparse svm, IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 6, pp , [26] D. Chakraborty and N. R. Pal, Selecting useful groups of features in a connectionist framework, IEEE Trans. Neural Netw., vol. 19, no. 3, pp , [27] W. Yang, Y. Gao, Y. Shi, and L. Cao, Mrm-lasso: A sparse multiview feature selection method via low-rank analysis, IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 11, pp , [28] H.-J. Lai, Y. Pan, Y. Tang, and R. Yu, Fsmrank: Feature selection algorithm for learning to rank, IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp , [29] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley & Sons, [30] X. He and X. Niyogi, Locality preserving projections, in Neural information processing systems, vol. 16, 2004, p [31] Z. Zhao, L. Wang, H. Liu, and J. Ye, On similarity preserving feature selection, IEEE Trans. Knowl. Data Eng., vol. 25, no. 3, pp , [32] L. Du and Y.-D. Shen, Unsupervised feature selection with adaptive structure learning, in ACM SIGKDD, 2015, pp [33] X. He, D. Cai, and P. Niyogi, Laplacian score for feature selection, in NIPS, [34] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou, l 2,1 -norm regularized discriative feature selection for unsupervised learning, in IJCAI, [35] S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, vol. 290, no. 5500, pp , [36] Y. Fu and T. S. Huang, Locally linear embedded eigenspace analysis, IFP-TR, Univ. of Illinois at Urbana-Champaign, Tech. Rep., [37] L. K. Saul and S. T. Roweis, Think globally, fit locally: unsupervised learning of low dimensional manifolds, J. Mach. Learn. Res., vol. 4, no. Jun, pp , [38] S. Xiang, F. Nie, C. Zhang, and C. Zhang, Nonlinear dimensionality reduction with local spline embedding, IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp , [39] F. Nie, X. Wang, and H. Huang, Clustering and projected clustering with adaptive neighbors, in ACM SIGKDD, [40] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, On the surprising behavior of distance metrics in high dimensional space. Springer, [41] C. Sammut and G. I. Webb, Encyclopedia of machine learning. Springer Science & Business Media, [42] L. Parsons, E. Haque, and H. Liu, Subspace clustering for high dimensional data: a review, ACM SIGKDD, vol. 6, no. 1, pp , [43] D. Cai, C. Zhang, and X. He, Unsupervised feature selection for multicluster data, in ACM SIGKDD, [44] A. L. Blum and P. Langley, Selection of relevant features and examples in machine learning, Artificial Intelligence, vol. 97, no. 1, pp , [45] H. Liu, X. Wu, and S. Zhang, Feature selection using hierarchical feature clustering, in ACM CIKM, [46] Z. Li, J. Liu, Y. Yang, X. Zhou, and H. Lu, Clustering-guided sparse structural learning for unsupervised feature selection, IEEE Trans. Knowl. Data Eng., vol. 26, no. 9, pp , [47] M. Qian and C. Zhai, Robust unsupervised feature selection, in Proceedings of the 23th International Joint Conference on Artificial Intelligence, 2013, pp [48] S. Wang, J. Tang, and H. Liu, Embedded unsupervised feature selection, in Association for the Advancement of Artificial Intelligence, [49] M. Masaeli, J. G. Dy, and G. M. Fung, From transformation-based dimensionality reduction to feature selection, in ICML, [50] S. Tabakhi, P. Moradi, and F. Akhlaghian, An unsupervised feature selection algorithm based on ant colony optimization, Engineering Applications of Artificial Intelligence, vol. 32, pp , [51] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, Gene selection for cancer classification using support vector machines, Mach. Learn., vol. 46, no. 1-3, pp , [52] S. Maldonado and R. Weber, A wrapper method for feature selection using support vector machines, Inf. Sci., vol. 179, no. 13, pp , [53] S. Wang, F. Nie, X. Chang, L. Yao, X. Li, and Q. Z. Sheng, Unsupervised feature analysis with class margin optimization, in Machine Learning and Knowledge Discovery in Databases. Springer, 2015, pp [54] C. Maugis, G. Celeux, and M.-L. Martin-Magniette, Variable selection for clustering with gaussian mixture models, Biometrics, vol. 65, no. 3, pp , [55] Z. Zhao, L. Wang, H. Liu et al., Efficient spectral feature selection with imum redundancy, in AAAI, [56] J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping, Use of the zero-norm with linear models and kernel methods, J. Mach. Learn. Res., vol. 3, no. Mar, pp , [57] X. Zhu, X. Wu, W. Ding, and S. Zhang, Feature selection by joint graph sparse coding, in ICDM, [58] H. Zeng and Y.-m. Cheung, Feature selection and kernel learning for local learning-based clustering, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp , 2011.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 12 [59] F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan, Trace ratio criterion for feature selection, in AAAI, [60] B. Mohar, Y. Alavi, G. Chartrand, and O. Oellermann, The laplacian spectrum of graphs, Graph theory, combinatorics, and applications, vol. 2, pp , [61] W. Liu, J. He, and S.-F. Chang, Large graph construction for scalable semi-supervised learning, in ICML, [62] K. Fan, On a theorem of weyl concerning eigenvalues of linear transformations I, Proceedings of the National Academy of Sciences of the United States of America, vol. 35, no. 11, pp , [63] Y. Nesterov, Gradient methods for imizing composite objective function, CORE, [64] L. Condat, Fast projection onto the simplex and the 1 ball, Mathematical Programg, Series A, vol. 158, no. 1, pp , [65] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, Efficient projections onto the l 1-ball for learning in high dimensions, in ICML, 2008, pp [66] S. Becker, V. Cevher, C. Koch, and A. Kyrillidis, Sparse projections onto the simplex, ICML, [67] Y. Censor, W. Chen, P. L. Combettes, R. Davidi, and G. T. Herman, On the effectiveness of projection methods for convex feasibility problems with linear inequality constraints, Computational Optimization and Applications, vol. 51, no. 3, pp , [68] Y. Nesterov, Smooth imization of non-smooth functions, Mathematical programg, vol. 103, no. 1, pp , [69] M. N. Dailey, C. Joyce, M. J. Lyons, M. Kamachi, H. Ishi, J. Gyoba, and G. W. Cottrell, Evidence and a computational explanation of cultural differences in facial expression recognition. Emotion, vol. 10, no. 6, p. 874, [70] H. Wechsler, J. P. Phillips, V. Bruce, F. F. Soulie, and T. S. Huang, Face recognition: From theory to applications. Springer Science & Business Media, 2012, vol [71] N. Gourier, D. Hall, and J. L. Crowley, Estimating face orientation from robust detection of salient facial structures. [72] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, Hmdb: a large video database for human motion recognition, in IEEE International Conference on Computer Vision. IEEE, 2011, pp [73] S. A. Nene, S. K. Nayar, H. Murase et al., Columbia object image library (coil-20), Technical Report CUCS , Tech. Rep., [74] Y. Yang, D. Xu, F. Nie, S. Yan, and Y. Zhuang, Image clustering using local discriant models and global integration, IEEE Trans. Image Processing, vol. 19, no. 10, pp , [75] Online. [76] Z. Li, Y. Yang, J. Liu, X. Zhou, and H. Lu, Unsupervised feature selection using nonnegative spectral analysis, in AAAI, [77] D. Wang, F. Nie, and H. Huang, Unsupervised feature selection via unified trace ratio formulation and k-means clustering (track), in Machine Learning and Knowledge Discovery in Databases. Springer, 2014, pp [78] J. Liu, S. Ji, and J. Ye, Multi-task feature learning via efficient l 2,1 - norm imization, in UAI, [79] D. Zhang, J. Han, C. Li, J. Wang, and X. Li, Detection of co-salient objects by looking deep and wide, International Journal of Computer Vision, vol. 120, no. 2, pp , [80] D. Zhang, J. Han, J. Han, and L. Shao, Cosaliency detection based on intrasaliency prior transfer and deep intersaliency ing, IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 6, pp , Feiping Nie received the Ph.D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in He is currently a Professor with the Center for OP- Tical IMagery Analysis and Learning, Northwestern Polytechnical University, Xian, China. His current research interests include machine learning and its application fields, such as pattern recognition, data ing, computer vision, image processing, and information retrieval. Xiaojun Chang received the PhD degree in computer science from Centre for Quantum Computation and Intelligent Systems (QCIS), University of Technology Sydney, Australia in He has been working as a Research Associate in the language technology institute of Carnegie Mellon University. His main research interests include machine learning, data ing and computer vision. Yi Yang received the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China, in He is currently an associate professor with University of Technology Sydney, Australia. He was a Post-Doctoral Research with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. His current research interest include machine learning and its applications to multimedia content analysis and computer vision, such as multimedia indexing and retrieval, surveillance video analysis and video semantics understanding. Alexander G. Hauptmann received the B.A. and M.A. degrees in psychology from Johns Hopkins University, Baltimore, MD, the degree in computer science from the Technische Universitat Berlin, Berlin, Germany, in 1984, and the Ph.D. degree in computer science from Carnegie Mellon University (CMU), Pittsburgh, PA, in He is currently with the faculty of the Department of Computer Science and the Language Technologies Institute, CMU. His research interests include man-machine communication, natural language processing, speech understanding and synthesis, video analysis, and machine learning. Minnan Luo received the Ph. D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in Currently, she is an Assistant Professor in the School of Electronic and Information Engineering at Xi an Jiaotong University. She was a Post-Doctoral Research with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. Her research interests include machine learning and optimization, video analysis, cross-media retrieval and fuzzy system identification and modeling. Qinghua Zheng received the B.S. degree in computer software in 1990, the M.S. degree in computer organization and architecture in 1993, and the Ph.D. degree in system engineering in 1997 from Xi an Jiaotong University, China. He was a postdoctoral researcher at Harvard University in He is currently a professor in Xi an Jiaotong University, and the dean of the Department of Computer Science. His research areas include computer network security, intelligent E-learning theory and algorithm, multimedia e-learning, and trustworthy software.

An Adaptive Semi-Supervised Feature Analysis for Video Semantic Recognition

An Adaptive Semi-Supervised Feature Analysis for Video Semantic Recognition IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? 017 1 An Adaptive Semi-Supervised Feature Analysis for Video Semantic Recognition Minnan Luo, Xiaojun Chang, Liqiang Nie, Yi Yang, Alexander Hauptmann

More information

The Constrained Laplacian Rank Algorithm for Graph-Based Clustering

The Constrained Laplacian Rank Algorithm for Graph-Based Clustering The Constrained Laplacian Rank Algorithm for Graph-Based Clustering Feiping Nie, Xiaoqian Wang, Michael I. Jordan, Heng Huang Department of Computer Science and Engineering, University of Texas, Arlington

More information

Robust Spectral Learning for Unsupervised Feature Selection

Robust Spectral Learning for Unsupervised Feature Selection 24 IEEE International Conference on Data Mining Robust Spectral Learning for Unsupervised Feature Selection Lei Shi, Liang Du, Yi-Dong Shen State Key Laboratory of Computer Science, Institute of Software,

More information

Globally and Locally Consistent Unsupervised Projection

Globally and Locally Consistent Unsupervised Projection Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Globally and Locally Consistent Unsupervised Projection Hua Wang, Feiping Nie, Heng Huang Department of Electrical Engineering

More information

Unsupervised and Semi-Supervised Learning vial 1 -Norm Graph

Unsupervised and Semi-Supervised Learning vial 1 -Norm Graph Unsupervised and Semi-Supervised Learning vial -Norm Graph Feiping Nie, Hua Wang, Heng Huang, Chris Ding Department of Computer Science and Engineering University of Texas, Arlington, TX 769, USA {feipingnie,huawangcs}@gmail.com,

More information

Trace Ratio Criterion for Feature Selection

Trace Ratio Criterion for Feature Selection Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Trace Ratio Criterion for Feature Selection Feiping Nie 1, Shiming Xiang 1, Yangqing Jia 1, Changshui Zhang 1 and Shuicheng

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Unsupervised Feature Selection with Adaptive Structure Learning

Unsupervised Feature Selection with Adaptive Structure Learning Unsupervised Feature Selection with Adaptive Structure Learning Liang Du State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences School of Computer and Information

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

Improving Image Segmentation Quality Via Graph Theory

Improving Image Segmentation Quality Via Graph Theory International Symposium on Computers & Informatics (ISCI 05) Improving Image Segmentation Quality Via Graph Theory Xiangxiang Li, Songhao Zhu School of Automatic, Nanjing University of Post and Telecommunications,

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Simplicial Global Optimization

Simplicial Global Optimization Simplicial Global Optimization Julius Žilinskas Vilnius University, Lithuania September, 7 http://web.vu.lt/mii/j.zilinskas Global optimization Find f = min x A f (x) and x A, f (x ) = f, where A R n.

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Graph Autoencoder-Based Unsupervised Feature Selection

Graph Autoencoder-Based Unsupervised Feature Selection Graph Autoencoder-Based Unsupervised Feature Selection Siwei Feng Department of Electrical and Computer Engineering University of Massachusetts Amherst Amherst, MA, 01003 siwei@umass.edu Marco F. Duarte

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Minh Dao 1, Xiang Xiang 1, Bulent Ayhan 2, Chiman Kwan 2, Trac D. Tran 1 Johns Hopkins Univeristy, 3400

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

Lorentzian Distance Classifier for Multiple Features

Lorentzian Distance Classifier for Multiple Features Yerzhan Kerimbekov 1 and Hasan Şakir Bilge 2 1 Department of Computer Engineering, Ahmet Yesevi University, Ankara, Turkey 2 Department of Electrical-Electronics Engineering, Gazi University, Ankara, Turkey

More information

An efficient face recognition algorithm based on multi-kernel regularization learning

An efficient face recognition algorithm based on multi-kernel regularization learning Acta Technica 61, No. 4A/2016, 75 84 c 2017 Institute of Thermomechanics CAS, v.v.i. An efficient face recognition algorithm based on multi-kernel regularization learning Bi Rongrong 1 Abstract. A novel

More information

Feature Selection for fmri Classification

Feature Selection for fmri Classification Feature Selection for fmri Classification Chuang Wu Program of Computational Biology Carnegie Mellon University Pittsburgh, PA 15213 chuangw@andrew.cmu.edu Abstract The functional Magnetic Resonance Imaging

More information

HIGH-dimensional data are commonly observed in various

HIGH-dimensional data are commonly observed in various 1 Simplex Representation for Subspace Clustering Jun Xu 1, Student Member, IEEE, Deyu Meng 2, Member, IEEE, Lei Zhang 1, Fellow, IEEE 1 Department of Computing, The Hong Kong Polytechnic University, Hong

More information

Selecting Models from Videos for Appearance-Based Face Recognition

Selecting Models from Videos for Appearance-Based Face Recognition Selecting Models from Videos for Appearance-Based Face Recognition Abdenour Hadid and Matti Pietikäinen Machine Vision Group Infotech Oulu and Department of Electrical and Information Engineering P.O.

More information

A Two-phase Distributed Training Algorithm for Linear SVM in WSN

A Two-phase Distributed Training Algorithm for Linear SVM in WSN Proceedings of the World Congress on Electrical Engineering and Computer Systems and Science (EECSS 015) Barcelona, Spain July 13-14, 015 Paper o. 30 A wo-phase Distributed raining Algorithm for Linear

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM 96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays

More information

Locality Preserving Projections (LPP) Abstract

Locality Preserving Projections (LPP) Abstract Locality Preserving Projections (LPP) Xiaofei He Partha Niyogi Computer Science Department Computer Science Department The University of Chicago The University of Chicago Chicago, IL 60615 Chicago, IL

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Jointly Learning Data-Dependent Label and Locality-Preserving Projections

Jointly Learning Data-Dependent Label and Locality-Preserving Projections Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Jointly Learning Data-Dependent Label and Locality-Preserving Projections Chang Wang IBM T. J. Watson Research

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

Sparsity Preserving Canonical Correlation Analysis

Sparsity Preserving Canonical Correlation Analysis Sparsity Preserving Canonical Correlation Analysis Chen Zu and Daoqiang Zhang Department of Computer Science and Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China {zuchen,dqzhang}@nuaa.edu.cn

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Facial Expression Recognition Using Non-negative Matrix Factorization

Facial Expression Recognition Using Non-negative Matrix Factorization Facial Expression Recognition Using Non-negative Matrix Factorization Symeon Nikitidis, Anastasios Tefas and Ioannis Pitas Artificial Intelligence & Information Analysis Lab Department of Informatics Aristotle,

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

MOST machine learning algorithms rely on the assumption

MOST machine learning algorithms rely on the assumption 1 Domain Adaptation on Graphs by Learning Aligned Graph Bases Mehmet Pilancı and Elif Vural arxiv:183.5288v1 [stat.ml] 14 Mar 218 Abstract We propose a method for domain adaptation on graphs. Given sufficiently

More information

Lecture 2 September 3

Lecture 2 September 3 EE 381V: Large Scale Optimization Fall 2012 Lecture 2 September 3 Lecturer: Caramanis & Sanghavi Scribe: Hongbo Si, Qiaoyang Ye 2.1 Overview of the last Lecture The focus of the last lecture was to give

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Unsupervised Learning: Clustering Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com (Some material

More information

A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

More information

Semi-supervised Data Representation via Affinity Graph Learning

Semi-supervised Data Representation via Affinity Graph Learning 1 Semi-supervised Data Representation via Affinity Graph Learning Weiya Ren 1 1 College of Information System and Management, National University of Defense Technology, Changsha, Hunan, P.R China, 410073

More information

Second Order SMO Improves SVM Online and Active Learning

Second Order SMO Improves SVM Online and Active Learning Second Order SMO Improves SVM Online and Active Learning Tobias Glasmachers and Christian Igel Institut für Neuroinformatik, Ruhr-Universität Bochum 4478 Bochum, Germany Abstract Iterative learning algorithms

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\ Data Preprocessing S - MAI AMLT - 2016/2017 (S - MAI) Data Preprocessing AMLT - 2016/2017 1 / 71 Outline 1 Introduction Data Representation 2 Data Preprocessing Outliers Missing Values Normalization Discretization

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

The K-modes and Laplacian K-modes algorithms for clustering

The K-modes and Laplacian K-modes algorithms for clustering The K-modes and Laplacian K-modes algorithms for clustering Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced http://faculty.ucmerced.edu/mcarreira-perpinan

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Maximum Margin Binary Classifiers using Intrinsic and Penalty Graphs

Maximum Margin Binary Classifiers using Intrinsic and Penalty Graphs Maximum Margin Binary Classifiers using Intrinsic and Penalty Graphs Berkay Kicanaoglu, Alexandros Iosifidis and Moncef Gabbouj Department of Signal Processing, Tampere University of Technology, Tampere,

More information

Learning a Manifold as an Atlas Supplementary Material

Learning a Manifold as an Atlas Supplementary Material Learning a Manifold as an Atlas Supplementary Material Nikolaos Pitelis Chris Russell School of EECS, Queen Mary, University of London [nikolaos.pitelis,chrisr,lourdes]@eecs.qmul.ac.uk Lourdes Agapito

More information

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Int. J. Advance Soft Compu. Appl, Vol. 9, No. 1, March 2017 ISSN 2074-8523 The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Loc Tran 1 and Linh Tran

More information

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

A Unified Framework to Integrate Supervision and Metric Learning into Clustering A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December

More information

Emotion Classification

Emotion Classification Emotion Classification Shai Savir 038052395 Gil Sadeh 026511469 1. Abstract Automated facial expression recognition has received increased attention over the past two decades. Facial expressions convey

More information

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

More information

Locality Preserving Projections (LPP) Abstract

Locality Preserving Projections (LPP) Abstract Locality Preserving Projections (LPP) Xiaofei He Partha Niyogi Computer Science Department Computer Science Department The University of Chicago The University of Chicago Chicago, IL 60615 Chicago, IL

More information

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\ Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured

More information

A new Graph constructor for Semi-supervised Discriminant Analysis via Group Sparsity

A new Graph constructor for Semi-supervised Discriminant Analysis via Group Sparsity 2011 Sixth International Conference on Image and Graphics A new Graph constructor for Semi-supervised Discriminant Analysis via Group Sparsity Haoyuan Gao, Liansheng Zhuang, Nenghai Yu MOE-MS Key Laboratory

More information

Nonlinear dimensionality reduction of large datasets for data exploration

Nonlinear dimensionality reduction of large datasets for data exploration Data Mining VII: Data, Text and Web Mining and their Business Applications 3 Nonlinear dimensionality reduction of large datasets for data exploration V. Tomenko & V. Popov Wessex Institute of Technology,

More information

Data fusion and multi-cue data matching using diffusion maps

Data fusion and multi-cue data matching using diffusion maps Data fusion and multi-cue data matching using diffusion maps Stéphane Lafon Collaborators: Raphy Coifman, Andreas Glaser, Yosi Keller, Steven Zucker (Yale University) Part of this work was supported by

More information

Local and Global Discriminative Learning for Unsupervised Feature Selection

Local and Global Discriminative Learning for Unsupervised Feature Selection 2013 IEEE 13th International Conference on Data Mining Local and Global Discriminative Learning for Unsupervised Feature Selection Liang Du, Zhiyong Shen, Xuan Li, Peng Zhou Yi-Dong Shen State Key Laboratory

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

Relation among images: Modelling, optimization and applications

Relation among images: Modelling, optimization and applications Purdue University Purdue e-pubs Open Access Dissertations Theses and Dissertations Fall 2014 Relation among images: Modelling, optimization and applications Bin Shen Purdue University Follow this and additional

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Discriminative Representative Selection via Structure Sparsity

Discriminative Representative Selection via Structure Sparsity 204 22nd International Conference on Pattern Recognition Discriative Representative Selection via Structure Sparsity Baoxing Wang, Qiyue Yin, Shu Wu, Liang Wang and Guiquan Liu School of Software and Engineering,

More information

Function approximation using RBF network. 10 basis functions and 25 data points.

Function approximation using RBF network. 10 basis functions and 25 data points. 1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data

More information

Feature Selection with Integrated Relevance and Redundancy Optimization

Feature Selection with Integrated Relevance and Redundancy Optimization 25 IEEE International Conference on Data Mining Feature Selection with Integrated Relevance and Redundancy Optimization Linli Xu, Qi Zhou, Aiqing Huang, Wenjun Ouyang, Enhong Chen School of Computer Science

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

A Local Learning Approach for Clustering

A Local Learning Approach for Clustering A Local Learning Approach for Clustering Mingrui Wu, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics 72076 Tübingen, Germany {mingrui.wu, bernhard.schoelkopf}@tuebingen.mpg.de Abstract

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Time Series Prediction as a Problem of Missing Values: Application to ESTSP2007 and NN3 Competition Benchmarks

Time Series Prediction as a Problem of Missing Values: Application to ESTSP2007 and NN3 Competition Benchmarks Series Prediction as a Problem of Missing Values: Application to ESTSP7 and NN3 Competition Benchmarks Antti Sorjamaa and Amaury Lendasse Abstract In this paper, time series prediction is considered as

More information

Forward Feature Selection Using Residual Mutual Information

Forward Feature Selection Using Residual Mutual Information Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Variable Selection 6.783, Biomedical Decision Support

Variable Selection 6.783, Biomedical Decision Support 6.783, Biomedical Decision Support (lrosasco@mit.edu) Department of Brain and Cognitive Science- MIT November 2, 2009 About this class Why selecting variables Approaches to variable selection Sparsity-based

More information

PARALLEL CLASSIFICATION ALGORITHMS

PARALLEL CLASSIFICATION ALGORITHMS PARALLEL CLASSIFICATION ALGORITHMS By: Faiz Quraishi Riti Sharma 9 th May, 2013 OVERVIEW Introduction Types of Classification Linear Classification Support Vector Machines Parallel SVM Approach Decision

More information

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

More information

Feature Selection Using Principal Feature Analysis

Feature Selection Using Principal Feature Analysis Feature Selection Using Principal Feature Analysis Ira Cohen Qi Tian Xiang Sean Zhou Thomas S. Huang Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign Urbana,

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

FEATURE SELECTION TECHNIQUES

FEATURE SELECTION TECHNIQUES CHAPTER-2 FEATURE SELECTION TECHNIQUES 2.1. INTRODUCTION Dimensionality reduction through the choice of an appropriate feature subset selection, results in multiple uses including performance upgrading,

More information

Efficient Iterative Semi-supervised Classification on Manifold

Efficient Iterative Semi-supervised Classification on Manifold . Efficient Iterative Semi-supervised Classification on Manifold... M. Farajtabar, H. R. Rabiee, A. Shaban, A. Soltani-Farani Sharif University of Technology, Tehran, Iran. Presented by Pooria Joulani

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

Graph Laplacian Kernels for Object Classification from a Single Example

Graph Laplacian Kernels for Object Classification from a Single Example Graph Laplacian Kernels for Object Classification from a Single Example Hong Chang & Dit-Yan Yeung Department of Computer Science, Hong Kong University of Science and Technology {hongch,dyyeung}@cs.ust.hk

More information

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel

More information

learning stage (Stage 1), CNNH learns approximate hash codes for training images by optimizing the following loss function:

learning stage (Stage 1), CNNH learns approximate hash codes for training images by optimizing the following loss function: 1 Query-adaptive Image Retrieval by Deep Weighted Hashing Jian Zhang and Yuxin Peng arxiv:1612.2541v2 [cs.cv] 9 May 217 Abstract Hashing methods have attracted much attention for large scale image retrieval.

More information

Discriminate Analysis

Discriminate Analysis Discriminate Analysis Outline Introduction Linear Discriminant Analysis Examples 1 Introduction What is Discriminant Analysis? Statistical technique to classify objects into mutually exclusive and exhaustive

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points] CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, 2015. 11:59pm, PDF to Canvas [100 points] Instructions. Please write up your responses to the following problems clearly and concisely.

More information

Supplementary Figure 1. Decoding results broken down for different ROIs

Supplementary Figure 1. Decoding results broken down for different ROIs Supplementary Figure 1 Decoding results broken down for different ROIs Decoding results for areas V1, V2, V3, and V1 V3 combined. (a) Decoded and presented orientations are strongly correlated in areas

More information

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003 CS 664 Slides #11 Image Segmentation Prof. Dan Huttenlocher Fall 2003 Image Segmentation Find regions of image that are coherent Dual of edge detection Regions vs. boundaries Related to clustering problems

More information

Generic Face Alignment Using an Improved Active Shape Model

Generic Face Alignment Using an Improved Active Shape Model Generic Face Alignment Using an Improved Active Shape Model Liting Wang, Xiaoqing Ding, Chi Fang Electronic Engineering Department, Tsinghua University, Beijing, China {wanglt, dxq, fangchi} @ocrserv.ee.tsinghua.edu.cn

More information

Multi-label Classification. Jingzhou Liu Dec

Multi-label Classification. Jingzhou Liu Dec Multi-label Classification Jingzhou Liu Dec. 6 2016 Introduction Multi-class problem, Training data (x $, y $ ) ( ), x $ X R., y $ Y = 1,2,, L Learn a mapping f: X Y Each instance x $ is associated with

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CX 4242 DVA March 6, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Analyze! Limited memory size! Data may not be fitted to the memory of your machine! Slow computation!

More information