An Adaptive Semi-Supervised Feature Analysis for Video Semantic Recognition

Size: px

Start display at page:

Download "An Adaptive Semi-Supervised Feature Analysis for Video Semantic Recognition"

Claribel Wilkins
6 years ago
Views:

1 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? An Adaptive Semi-Supervised Feature Analysis for Video Semantic Recognition Minnan Luo, Xiaojun Chang, Liqiang Nie, Yi Yang, Alexander Hauptmann and Qinghua Zheng Abstract Video semantic recognition usually suffers from the curse of dimensionality and the absence of enough high-quality labeled instances, thus semi-supervised feature selection gains increasing attentions for its efficiency and comprehensibility. Most of the previous methods assume that videos with close distance (neighbors) have similar labels and characterize the intrinsic local structure through a predetermined graph of both labeled and unlabeled data. However, besides the parameter tuning problem underlying the construction of the graph, the affinity measurement in the original feature space usually suffers from the curse of dimensionality. Additionally, the predetermined graph separates itself from the procedure of feature selection, which might lead to downgraded performance for video semantic recognition. In this paper, we exploit a novel semisupervised feature selection method from a new perspective. The primary assumption underlying our model is that the instances with similar labels should have a larger probability of being neighbors. Instead of using a predetermined similarity graph, we incorporate the exploration of the local structure into the procedure of joint feature selection so as to learn the optimal graph simultaneously. Moreover, an adaptive loss function is exploited to measure the label fitness, which significantly enhances model s robustness to videos with a small or substantial loss. We propose an efficient alternating optimization algorithm to solve the proposed challenging problem, together with analyses on its convergence and computational complexity in theory. Finally, extensive experimental results on benchmark data sets illustrate the effectiveness and superiority of the proposed approach on video semantic recognition related tasks. Index Terms Semi-Supervised Learning, Feature Selection, Video Semantic Recognition, Manifold Regularization I. INTRODUCTION For the explosion of videos along with the rapid development of Internet, video semantic recognition has become a fundamental process that allows the creation of several applications [1], such as semantic video indexing and retrieval, object detection and tracking, video event detection and human activity analysis. However, these tasks usually suffer from the Corresponding author: Xiaojun Chang. This work was funded by the National Science Foundation of China (Nos , , ), the National Key Research and Development Program of China (No. 016YFB ), the Data to Deci-sions Cooperative Research Centre ( and China Postdoctoral Science Foundation (No. 015M5866). Minnan Luo and Qinghua Zheng are with SPKLSTN Lab, Department of Computer Science, Xi an Jiaotong University, Shaanxi , China, {minnluo,qhzheng}@mail.xjtu.edu.cn Xiaojun Chang and Yi Yang are with Centre for Quantum Computation and Intelligent Systems, University of Technology Sydney, Australia, {cxj73,yee.i.yang}@gmail.com Liqiang Nie is with School of Computer Science and Technology, Shandong University, nieliqiang@gmail.com Alexander Hauptmann is with the School of Computer Science, Carnegie Mellon University, PA, USA, alex@cs.cmu.edu. curse of dimensionality since the low-level visual features are often high-dimensional and noisy [], [3]. As a result, it is crucial to exploit efficient dimensionality reduction technology to promote the performance of video semantic recognition. Typically, feature selection techniques reduce the dimensionality by selecting a small subset of original features, and thus is capable of alleviating the curse of dimensionality while enhancing the comprehensibility of dimension reduction [4]. From the perspective of utilizing label information, feature selection algorithms can be roughly categorized into unsupervised, supervised and semi-supervised types. Unsupervised feature selection methods suppose that all of the data are collected without labels. They either involve maximization of clustering performance as quantified by some index [5] [7] or considers the selection of features based on feature dependency and relevance [8] [10] Although unsupervised feature selection methods are capable of handling unlabeled data, they ignore the discriminative information from labels [11], [1]. Instead, supervised feature selection methods only rely on the labeled data and usually evaluate the importance of features based on the correlation between features and the class labels [13], such as Pearson correlation coefficients [14], Fisher score [15] and information gain [16]. Supervised feature selection usually achieves reliable performance with a sufficient number of labeled training data. However, in many practical applications for video semantic recognition, the vast majority of large-scale videos are usually collected without labels. Annotating these unlabeled videos is dramatically expensive and time-consuming. Therefore, semi-supervised learning for feature selection that aims to learn a discriminative subset of the original features from both labeled and unlabeled data, is of potentially great practical significance [17] [19]. In the framework of semi-supervised learning, most of the feature selection methods are based on one of the following two assumptions. (1) The smoothness (cluster) assumption which assumes that data points are likely to have the same label if they are close to each other [0]; () The manifold assumption which states that the prediction function for feature selection lives in a low dimensional manifold of the higherdimensional feature space [1]. Under the smoothness assumption, graph-based semi-supervised feature selection that inherits excellent properties from spectral graph theory has attracted great attentions. For example, Zhao et al. [] pioneered a filter-based semi-supervised feature selection with particular graph theory and construct cluster indicators to evaluate the feature relevance one by one. Alalga et al. [3] transformed the labeled part of data into soft constraints and proposed a unified framework for semi-supervised multi-label feature

2 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? 017 selection based on a soft-constrained Laplacian score. Under the manifold assumption, embedded method based feature selection attempts to use the underlying geometric structure of data and integrate the selection into the training stage of the semi-supervised learning process. For example, Xu et al. [4] took advantage of semi-supervised SVM with manifold regularization and conducted feature selection through maximizing the classification margin between different classes and simultaneously exploiting the geometry of marginal distribution. Ma et al. [5] extended l,1 -norm regularized feature selection relevant features with manifold fitting on the data structure. It is noteworthy that the methods mentioned above usually characterize the intrinsic local structure with a predetermined graph Laplacian matrix based on a Gaussian function. However, besides the parameter tuning problem in the construction of graph [], the qualitative affinity measurement usually suffers from the curse of dimensionality [6], As a result, the predetermined data affinity matrix that is separated from the procedure of feature selection might not be the optimal one and fail to capture the underlying geometry structure accurately [7]. As argued in [8], the Laplacian regularization biases the solution towards a constant function and is short of extrapolating power. For the issues above, Shi et al. [8] employed Hessian regularization [9] instead of Laplacian regularization to exploit the intrinsic local geometry of the data manifold, and leverage l,1/ -matrix norm for a joint sparse feature selection. However, this method is conducted in a transductive way [30] and thus can not be used to predict the out-of-sample testing data. Han et al. [] encoded the local geometry structures of training data as a linear combination of within-class scatter matrices and spline scatter matrix. However, besides more parameters are introduced, both of the mentioned methods separate the characterization of local structure from the procedure of semi-supervised feature selection. Note that there are also some semi-supervised feature selection methods which do not consider the local structure characterization. For example, Chang et al. [31] associated each training data point with a pre-determined score and developed a convex semi-supervised multi-label feature selection framework without relying on graphs and Eigen decomposition. However, in many practical applications, it s hard to determine the score for each training data. Xu et al. [11] proposed a max-relevance and minredundancy criterion based on Pearsons correlation coefficient for semi-supervised feature selection. However, this lter-based method ignores the underlying learning algorithm which is utilized to train classifiers from labeled data [3]. Ren et al. [33] introduced a search based framework by randomly selecting unlabeled data to form new training sets and iteratively adding the most frequently selected feature to the result feature subset. However, this method neglects the interaction among features as well as the interaction between the heuristics of feature selection and the corresponding classifier [4]. In this paper, we develop an adaptive framework of semisupervised feature selection from a new point of view. Instead of using a predetermined affinity matrix to reflect the underlying local geometric structure of high-dimensional training data, the primary assumption underlying our model is that the data points with similar labels should have a larger probability of being neighbors. This strategy is also used in [34] to guide multi-label classification with random forest, and achieve better performance than using the feature space itself. Under this assumption, we incorporate the exploration of the local structure into the procedure of joint feature selection so as to learn the optimal graph simultaneously. To the best of our knowledge, this is the first work which explicitly learns the optimal graph in the procedure of semi-supervised feature selection. Additionally, due to a vast majority of data exists in the framework of semi-supervised learning without labels, it is significantly important in practice to improve the semisupervised model s robustness to outliers. Note that most of the previous methods utilize squared l -norm as loss function to measure the error of predicted labels, which usually performs sensitively since the outliers with significant loss will dominate the objective function. Although using l 1 -norm based loss function can relieve this issue to some extent, whereas it is sensitive to small loss [35]. For this issue, we leverage more robust loss function which is capable of overcoming the disadvantages of both l 1 -norm and l -norm based loss functions [35]. Our model yields encouraging experimental results on several benchmark datasets and demonstrates its effectiveness and superiority for video semantic recognition related tasks. In summary, we highlight the difference from the previous methods and describe the main contributions of this paper as follows: 1) We incorporate the exploration of optimal similarity matrix into the procedure of semi-supervised joint feature selection by adaptively assigning the nearest neighbors to each video. ) For real world applications, we exploit an adaptive loss function to measure the labels fitness over both labeled and unlabeled data, which enhance the model s robustness to noises and outliers. 3) To solve the proposed non-smooth objective function, we exploit an efficient alternative algorithm and prove its convergence theoretically and experimentally. The remainder of this paper is organized as follows. In Section II, we give a brief review of traditional framework for graph-based semi-supervised feature selection. A novel method with adaptive neighbor assignment and adaptive loss measurement is proposed in Section III, together with an efficient algorithm for the proposed challenging problem. The Theoretical study on the convergence and computational complexity of the proposed algorithm are demonstrated in Section IV In Section V, extensive experiments over several benchmark datasets are conducted to verify the effectiveness of the proposed method for video semantic recognition related tasks, including video concept detection and human activity analysis. Finally, conclusions are given in Section VI. Notations and Terms: We employ the notations as usual throughout this paper. For any d-dimensional vector z = [z 1, z,, z d ] R d, its l p -norm is defined as z p = ( d ) 1 i=1 z i p p. For any matrix A = (a ij ) p q R p q, we denote its i-th row and the j-th column as a i and a j, respectively. The Frobenius norm of matrix A is defined as

3 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? A F = i j a ij = T r(aa ), where T r( ) refers to the trace of matrix. The l,1 -norm of matrix A is formulated as A,1 = p i=1 ai for joint sparsity pattern. Let D = diag(d 1, d,, d p ) R p p be a diagonal matrix with its (i, i)-th diagonal element d i ( i). Let 0 and 1 be column vectors whose elements are all zero and one, respectively. II. PRELIMINARY AND MOTIVATION We first recall the traditional framework for graph-based semi-supervised feature selection, and then educe the motivation of our method formally. In this paper, the training videos are collected into a matrix X = [x 1, x,, x n ] R d n, where x i R d refers to a low-level feature vector of the i-th video; n is the total number of videos in training set. In the framework of semisupervised learning, we assume the first m (m n) videos x 1, x,, x m are labeled respectively with label vector y i = [y i1, y i,, y ic ] {0, 1} c (i = 1,,, m), where c refers to the number of semantic categories. Specifically, y ij = 1 when x i is in the j-th class, while y ij = 0 otherwise. The remain videos x m+1, x m+,, x n are unlabeled. Note that we focus on the standard case where each video belongs to only one category in this paper, i.e., the model is formulated with single label class. For feature selection over both labeled and unlabeled videos, graph Laplacian is widely used to preserve the manifold structure of video representation. Let G = {X, Ω} be the undirected weighted graph with n nodes of training videos, where Ω = (ω ij ) n n R n n is the affinity graph (similarity matrix) with each entry ω ij representing the similarity between videos x i and x j. Typically, the similarity matrix Ω is predetermined in the original high-dimensional feature space with Gaussian function, i.e. ω ij = { exp( xi xj σ ), x i N K (x j ) or x j N K (x i ); 0, otherwise. where N K (x i ) denotes the set of K-nearest neighbor to x i in the original high-dimensional space; The variance parameter σ empirically tuned in real-world applications will affect the performance significantly [36]. Based on graph theory, the corresponding Laplacian matrix is defined as L Ω = D Ω (Ω + Ω)/, where D Ω is a diagonal matirx whose i-th diagonal element is calculated as n j=1 (ω ij + ω ji )/ for i = 1,,, n. Let f i be a predicted label vector for the i-th video x i (i = 1,,, n). Under the assumption that videos with close distance (neighbors) should have similar labels, the manifold smoothness on the graph G is ensured by minimizing the following objective function [37] 1 i=1 j=1 f i f j ω ij = T r(f L Ω F ). (1) where F = [F l ; F u ] denotes the predicted label matrix over training videos. Specifically, F l = [f 1, f,, f m ] and F u = [f m+1, f m+,, f n ] refer to the predicted label matrices of labeled and unlabeled videos, respectively. Combining with the label fitness that is considered through a loss functions L(W, F ; X), traditional graph Laplacian based semi-supervised feature selection is formulated to solve the following optimization problem min W,b,F T r(f L S F ) + γl(w, F ; X) + λω(w ) () s.t. F l = Y l where W R d c is a transformation matrix; Y l = [y 1, y,, y m ] {0, 1} m c denotes the label matrix over labeled training videos; γ and λ are trade-off parameters which balance the manifold smoothness (the first term), the label fitness (the second term) and the regularization for transformation matrix (the third term). The regularization term Ω(W ) is used to encourage a specific structure of W. For example, in [5], Ω(W ) = W,1 is imposed for a sparsity on all features and non-sparsity between classes, and thus discriminative features for all classes get larger weights. The constraint on labeled videos aims to make the predicted labels F l consistent with the ground truth label matrix Y l. Note that the similarity matrix Ω used in objective function () depends on a pre-assigned neighbors according to Gaussian function. However, on one hand, the empirically tuned parameter σ usually deteriorates the accuracy and informativeness of similarity graph. On the other hand, the per-determined similarity matrix is based on norm-based distance measurement in original high-dimensional feature space, which inevitably suffers from the curse of dimensionality and fails to capture the intrinsic geometry relation accurately. Additionally, the loss function L(W, F ; X) is usually specified as a squared l -norm based distance measurement which is proved to be much sensitive to the outliers [38] [40]. As a result, it is significantly important to develop an robust semi-supervised feature selection model which can simultaneously characterize the underlying local geometric structure of both labeled and unlabeled video adaptively. III. THE PROPOSED METHODOLOGY In this section, we exploit a semi-supervised feature selection model by incorporating the exploration of local structure into the procedure of joint feature selection. A. Optimal Similarity Graph We adaptively assign each video x i neighbors with probability vector s i = [s i1, s i,, s in ] R n, such that s ij 0 and s i 1 = 1 for i = 1,,, n. Instead of supposing videos with close distance (neighbors) should have similar labels, in this paper, we assume that close labels have high similarity and vice versa. As a result, an optimal similarity graph is estimated through the following optimization problem min J ri (s i, F ) = 1 f i f j s i 0,s i 1=1 s ij + r i s ij (3) j=1 where the regularization term with parameter r i is significantly necessary because its absence leads to a trivial solution of optimization problem (3), i.e., only the nearest video can be the neighbor of x i with probability 1. Indeed, the regularization term enables all of the videos to be the neighbor of x i with the

4 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? same probability 1/n. Therefore, the objective function of (3) encourages close labels of f i and f j, measured as f i f j, have a larger probability s ij and vice versa. Over the entire videos in training dataset, we denote J r (S, F ) = n i=1 J r i (s i, F ) and solve the following optimization problem for an optimal neighbor assignment, min J r (S, F ) = 1 s.t. i=1 j=1 s i 0, s i 1 = 1 ( i) f i f j s ij + r i s ij (4) In practice, we set the initial similarity matrix S according to Gaussian function and update the neighbor assignment based on the label difference of each pair of videos. This strategy is totally different from the traditional method which use a predetermined similarity graph to characterize the underlying geometry structure of original features. Since the c-dimensional label space is much lower than the original d-dimensional original feature space, the optimal similarity graph not only mitigates the curse of dimensionality to some extent, but also captures the intrinsic structure effectively. B. Optimal Graph Embedded Feature Selection Based on the analysis above, we incorporate the optimal similarity graph learning into the procedure of semi-supervised feature selection and specify the label fitness with adaptive loss for better robustness. Such an idea can be formulated as a concise optimization problem as follows, min J r (S, F ) + γ X W + 1b F τ + λ W,1 (5) s. t. F l = Y l ; s i 0, s i 1 = 1 ( i) where γ and λ are trade-off parameters; Given any matrix A = [a 1, a,, a n ] R n n, the adaptive loss with parameter τ is defined as f τ ad loss : Rn n R, f τ ad loss(a) = A τ = i (1 + τ) a i a i + τ (6) Note that the adaptive loss measurement is indeed a smoothly interpolates between l 1 -norm and l -norm based loss functions. Thus it takes advantages of both norm-based loss measurements and become more robust to videos with a small loss or substantial loss. From the objective function of problem (5), we arrive at a new framework for semisupervised feature selection. This strategy takes advantages of both optimal neighbor assignment for model s accuracy and the adaptive loss measurement for model s robustness, namely Optimal Graph Embedded Semi-supervised Feature Selection (OGE-SFS). Considering the convexity of objective function with respect to variables W, b, S and F jointly, an alternative optimization algorithm can be readily employed to address the proposed challenging problem (5). Once W is obtained, we sort the d features according to w i F (1 i d) in descending order and select the top ranked ones. Adaptive neighbor assignment Alternative optimization Re-weighted algorithm for label prediction Fig. 1: Flowchart of alternative algorithm for OGE-SFS. C. Optimization Strategy The solution of optimization problem (5) can be approximately attained via updating the involved parameters S and {W, b, F } alternatively with the other one fixed. For a better understanding, we demonstrate the flowchart of optimization procedure in Figure 1, where we name the step of optimization S with fixed variables {W, b, F } as adaptive neighbor assignment, and exploit a re-weighted algorithm for label prediction by optimizing {W, b, F } with fixed variable S. In another words, the optimization strategy aims to learn the similarity matrix S and the predicted label matrix F simultaneously. Different from the traditional semi-supervised feature selection models that predict labels given a predetermined similarity matrix, the proposed model is capable of achieving better performance since it also learns the optimal graph for feature selection. Subsequently, we introduce the details of the alternative optimization algorithm. 1) Adaptive Neighbor Assignment: With fixed W, b and F, optimization problem (5) turns to the following n decoupled sub-problem due to the independence of videos in training dataset, 1 min s i 0,s i 1=1 f i f j s ij + r i s ij (7) j=1 for i = 1,,, n. With some evident reformation, this optimization problem equals to the following compact form, min s i + 1 h i (8) s i 0,s i 1=1 where h i = [h i1, h i,, h in ] is a n-dimensional vector corresponding to the i-th video and h ij = f i f j for j = 1,,, n. Note that this optimization problem indeed is a Euclidean projection problem on the simplex space. Following the strategies used in [41], we formulate the Lagrangian function of objective function (8) as L(s i, η, ζ) = s i + 1 h i η(s i 1 1) ζ s i (9) where η R and ζ = [ζ 1, ζ,, ζ n ] R n are the Lagrangian multipliers. Let the optimal value of problem (8) be s i, and the optimal Lagrangian multipliers be η and ζ. According to the KKT condition, we have the optimal solution s ij = 1 h ij + η + ζ j = ( 1 h ij + η ) + (10) for j = 1,,, n. Because of the locality assumption that there are only K nearest neighbor could have chance to connect to x i, the optimal s i has only K nonzero elements. Without loss of generality, let h i1, h i,, h in be sorted in ascending. We suppose the K-th smallest component in vector

5 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? s i is larger than 0 and the (K + 1)-th smallest component in vector s i is less than or equal to 0, i.e., { 1 h ik + η > 0; 1 h i(k+1) + (11) η 0. Based on the constraint s i 1 = 1 and Eq. (10), we have K j=1 ( 1 h ij + η ) = 1. It turns that η = (1 + 1 K j=1 h ij )/K. As a result, the regularization parameter r i satisfy the following inequality K h ik 1 K h ij < r i K h i(k+1) 1 K h ij. (1) j=1 j=1 We set r i = K h i(k+1) 1 K j=1 h ij to achieve an optimal solution s i with exact K nonzero components, i.e., s ij = ( 1 h ij + η ) + (13) for j = 1,,, n. In this case, the optimal neighbor assignment s i only depends on the number of neighbor K for each videos, which is more intuitive and easier to tune than regularization parameter r i because K is an integer and has explicit meaning in practice [41]. ) Label Prediction with Re-weighted Algorithm: Let f λ (W, b, F ; S) = T r(f L S F ) + λ W,1 and g i (W, b, F ) = x i W + b f i for a better representation. With fixed similarity matrix S, the optimization problem (5) with respect to variables W, b and F turns to φ = f λ(w, b, F ; S) + γ g i (W, b, F ) σ (14) min W,b,F s.t. F l = Y l i=1 according to Equation (1). Following the adaptive loss minimization problem [35], we calculate the derivative of objective function (14) with respect to {W, b, F } as φ {W,b,F } = f λ (W, b, F ; S) + γ n i=1 d ig i (W, b, F )g i (W, b, F ), where d i = (1 + σ) g i(w, b, F ) + σ ( g i (W, b, F ) + σ) (15) for i = 1,,, n. Given d i for the i-th video x i, the optimization problem (14) can be reformulated as min f λ (W, b, F ; S) + γ d i g i (W, b, F ) (16) i=1 according to the derivation. To address the problem (14), we first calculate d i according to Eq. (15) in terms of the current solutions, and then update {W, b, F } by addressing the optimization problem (16). Therefore, the key step for the procedure of optimization problem (14) lies in solving the following problem min W,b,F f λ(w, b, F ; S) + γt r(a DA) s.t. F l = Y l (17) where A = X W + 1b F ; D is a diagonal matrix with the i-th diagonal element d i is calculated by x i W + b f i d i = (1 + σ) + σ ( x i W + b f i + σ ). (18) Algorithm 1 Re-weighted algorithm for {W, b, F } Initialize: W, b and F. 1: while not converge do : Calculate a diagonal matrix D = diag(d 1, d,, d n ) according to Equation (18); 3: Update F by Equation (); update W by Equation (0) and update b by Equation (19). 4: end while Algorithm Alternative optimization for OGE-SFS Input: X, c, K, σ λ and γ. Output: W. Initialize: W 0,S 0 and t = 0. 1: repeat : Update W t+1, b t+1 and F t+1 with fixed S t by Algorithm 1; 3: Update S t+1 with fixed F t+1 by Equation (13); 4: until It can be verified that the objective function in optimization problem (18) is jointly convex with respect to variables W, b and F. Therefore, we set the derivative of objective function (17) with respect to b to 0, and obtain b = 1 1 D1 F D1 1 1 D1 W XD1. (19) Note that the derivative of W,1 with respect to W is R 1 W, where R = diag(r 11, r,, r dd ) is a diagonal matrix with the (i, i)-th diagonal element r ii = w i for i = 1,,, n. By substituting Eq. (19) into the objective function of optimization problem (17) and setting its derivative with respect to W to 0, we have W = (XNX + λr 1 ) 1 XNF (0) where N = D 1 1 D1 D11 D is the centering matrix. Moreover, by substituting Eq. (19) and (0) into the objective function of problem (17), we arrive at T r(f (L S + γn γnx (XNX +λr 1 ) 1 XN)F ) = T r ( F MF ), where M = L S + γn γnx (XNX + λr 1 ) 1 XN ( ) Mll M = lu. (1) M ul M uu As a result, the optimal labels matrix F u for unlabeled videos can be predicted as F u = M 1 uu M ul Y l. () In summary, we describe the re-weighted optimization for problem (14) in Algorithm 1, and demonstrate the overall alternative optimization algorithm for OGE-SFS in Algorithm. IV. CONVERGENCE AND COMPLEXITY ANALYSIS In this section, we provide a theoretical analysis on the convergence of the proposed alternative optimization algorithms, together with the analysis on its computational complexity.

6 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? A. Convergence Analysis Lemma 1: [35] For any n-dimensional vectors x and y, we have x x + σ y + σ ( y + σ) x y y + σ y + σ ( y + σ) y. Theorem 1: The re-weighted updating rules in Algorithm 1 monotonically decrease the objective function value of optimization problem (14) in each iteration. Proof: In step of Algorithm 1, we suppose the updated variables W, b and F are denoted by W, b and F, respectively. Based on the updating rules, we have f( W, b, F ; S) + γ f(w, b, F ; S) + γ d i g i( W, b, F ) i=1 d i g i(w, b, F ) (3) where d i = (1+σ) x i W +b f i +σ ( x i W +b f i +σ). On the other hand, based on Lemma 1, we have i=1 g i( W, b, F ) g i( W, b, F ) + σ gi(w, b, F ) + σ ( g i(w, b, F ) + σ) gi( W, b, F ) gi(w, b, F ) g i(w, b, F ) + σ gi(w, b, F ) + σ ( g i(w, b, F ) + σ) gi(w, b, F ). By substituting the value of d i into Eq. (3) and summing the two inequalities above, we arrive at f( W, b, F ; S) + γ f(w, b, F ; S) + γ i=1 i=1 (1 + σ) g i ( W, b, F ) g i ( W, b, F ) + σ d i (1 + σ) g i (W, b, F ) g i (W, b, F ) + σ. (4) Therefore, the re-weighted updating rules in Algorithm 1 monotonically decrease the objective function value of the problem (14) in each iteration. The proof is completed. Theorem : The alternate updating rules in Algorithm monotonically decrease the objective function value of optimization problem (5) in each iteration until convergence. Proof: In the procedure of iteration, we update the variables W t+1, b t+1 and F t+1 by Algorithm 1 while keeping S t fixed. According to Theorem 1, we have J r(s t, F t+1 ) + γ X W t+1 + 1(b t+1 ) F t+1 σ + λ W t+1,1 J r(s t, F t ) + γ X W t + 1(b t ) F t σ + λ W t,1 (5) On the other hand, with fixed F t+1, we update the similarity matrix S by Eq. (13) and obtain J r (S t+1, F t+1 ) J r (S t, F t+1 ). (6) Based on the inequities (5) and (6), we arrive at J r(s t+1, F t+1 ) + γ X W t+1 + 1(b t+1 ) F t+1 σ + λ W t+1,1 J r(s t, F t+1 ) + γ X W t+1 + 1(b t+1 ) F t+1 σ + λ W t+1,1 J r(s t, F t ) + γ X W t + 1(b t ) F t σ + λ W t,1 Hence, the objective function value of optimization problem (5) decrease monotonically in each iteration until the algorithm convergence. The proof is completed. B. Computational Complexity As shown in Algorithm, the alternative optimization algorithm for the proposed OGE-SFS consists two steps, i.e., adaptive neighbor assignment and re-weighted algorithm for label prediction. Subsequently, we analyze the computational cost of these two steps, respectively. The main computation cost of the first step lies in the calculation of predicted label distances between each pair of videos with complexity O(n c). Note that this cost is much smaller than traditional graph Laplacian in the original feature space with computational cost O(n d) since c d in many practical applications. During the step of label prediction with a fixed similarity matrix, the major computational cost lies in calculating the inverse of a matrix for updating W and F. Specifically, let P = XNX +λr 1 R d d with computational cost O(nd ) and Q = XNF R d c with cost O(ndc), the updating rule for W can be rewritten as W = P 1 Q. Note that the computational cost of matrix inverting is O(d 3 ). Fortunately, the expensive computation of matrix inverse can be avoided [37]. Since the updating of W equals to solving the following optimization problem min W W P W W Q, the optimal W can be achieved iteratively through gradient descent method, i.e. W t+1 = W t α(p W t Q). In such a way, the computational cost of matrix inverse reduces to O(T d c), where T is the number of iterations. As a result, the total computational cost of updating W turns to O(nd ) + O(T d ) for c d and c n in practice. The same strategy can also be used for the updating of F. V. EXPERIMENTS In this section, extensive experiments over several real world videos are conducted to evaluate the performance of the proposed semi-supervised feature selection method on video semantic recognition related tasks, including video concept detection (VCD) and human action recognition (HAR). A. Experimental Data To evaluate the proposed algorithm, we briefly introduce the following five datasets and summarize the details in Table I. Columbia Consumer Video dataset (CCV) [4]: This data set is collected for visual content recognition consisting of 9,317 YouTube videos over 0 semantic categories, where the manual annotation is conducted on the Amazon MTurk platform. In the experiments, we directly use the space-time interest points (STIP) features with 5,000-dimensional bag-of-visual-words (BoWs) to represent the low-level feature of videos [4]. Kodak consumer video dataset (Kodak) [43]: This set consists of 1,358 consumer video clips and 1,873 Youtube video clips, from which 3,590 key-frames are extracted and annotated in a rigorous fashion. All the key-frames belonging to concepts are used in the experiments. Following [], we extract SIFT points and represent each key-frame as a 1,000-dimensional histogram of BoWs based on the clustered visual dictionary. CareMedia dataset (CareMedia) [44]: This dataset includes 15 geriatric patients activities in the public

7 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? TABLE I: Data set description. Data set Tasks of samples of features of training data of categories of selected feature CCV VCD 9,317 5,000 4,659 0 {1, 000, 1, 500,, 4, 500, 5, 000} Kodak VCD 3,590 1,000,000 {100, 00,, 900, 1, 000} CareMedia HAR 3,913 1,000 1,000 5 {100, 00,, 900, 1, 000} KTH HAR,391 1,000 1,000 6 {100, 00,, 900, 1, 000} HMDB51 HAR 7,000 1,000 6, {100, 00,, 900, 1, 000} spaces of a nursing home. We use a subset of the entire dataset, consisting of 3,913 video sequences recorded by one camera in the dining room. In the experiments, MoSIFT feature with 1,000-dimensionality is extracted to represent each video sequence for its consideration of both local appearance and local motion [45]. KTH action dataset (KTH) [46]: This video dataset consists of,391 video sequences with respect to 6 types of human actions. Following the STIP descriptor in [], in the experiments, we represent each video sequence as a 1,000-dimensional STIP feature by concatenating several histograms from a space-time grid defined on the patch and generalizing SIFT descriptor to space-time. HMDB dataset: This dataset contains 7,000 manually annotated clips with 51 action categories extracted from a variety of sources ranging from digitized movies to YouTube. Following the STIP descriptor in [], we represent each video as a 1,000-dimensional STIP feature. B. Experimental Settings For CCV dataset, we adopt a standard partition which split the entire dataset into 4,659 training videos and 4,658 testing videos. Since there is no standard split for the other three datasets, we randomly sample a certain number of data as the training. The remaining videos for each of the three datasets are served as the corresponding testing data. We include the detailed splits for each dataset in Table I. The number of selected features for each dataset is also reported in Table I regarding the particular feature size. To study the performance of semi-supervised feature selection, in this paper, we set the ratio of labeled videos among the sampled training data set to different values from {1%, 5%, 10%, 5%, 50%, 100%}. To illustrate the effectiveness and superiority of our model s classification performance, we compare the proposed OGE- SFS with the following state-of-the-art methods. Note that All- Fea, SBMLR and FSNM are designed for supervised feature selection; LUDLP and ASL are used for semi-supervised classification with all features; The others are conducted for semi-supervised feature selection. All feature [All-Fea]: The original high-dimensional feature without feature selection has been used as a baseline in the experiments. Sparse Multinomial Logistic Regression via Bayesian l 1 Regularization [47] [SBMLR]: It can be applied to feature selection by exploiting sparsity using a Laplace prior. Feature Selection via Joint l,1 -norm minimization [48] [FSNM]: It employs on both loss function and regularization to realize feature selection across all data points. Learning from Labeled and Unlabeled Data with Label Propagation [49] [LUDLP]: This semi-supervised classification aims to propagate labels through the dataset along high density areas defined by unlabeled data. Large-Scale Adaptive Semi-Supervised Learning (ASL) [37]: Instead of using graph information, this semisupervised classification method suppress the weights of boundary unlabeled points adaptively. Locality sensitive semi-supervised feature selection [50] [LSDF]: This is a semi-supervised feature selection approach based on within-class and between-class graph construction. Discriminative Semi-Supervised Feature Selection via Manifold Regularization [4] [FS-Manifold] : It selects features through maximizing the classification margin between different classes and simultaneously exploiting the data geometry by the manifold regularization. Structural Feature Selection with Sparsity [5] [SFSS]: This semi-supervised feature selection algorithms incorporates joint feature selection and semi-supervised learning into a single framework. Correlations between different features have been taken into consideration. Semi-Supervised Feature Selection via Spline Regression [] [S FS R]: The discriminative information between labeled training videos and the local geometry structure of all the training videos are well preserved by the combined semi-supervised scatters. For fair comparison, the number of neighbors is set as five for all of the graph-based comparators. We repeat the procedure of training data sampling 50 times and report the average performances with the corresponding variance for all of the datasets, where the performance of each repetition is achieved by tuning each method s regularization parameters with a grid-search strategy from set {10 3, 10,,, 10 3 }. Specifically, we set the parameter of adaptive loss as σ = 0.1 in the proposed model. With the selected features, we report the results of classification based on χ kernel SVM method for its excellent performance on human action recognition [51]. In the paper, we leverage the mean average precision (map) as the evaluation metric due to its stability and discriminating capability for classification. C. Experimental Results To study the influence of different ratios of labeled training data on the performance of classification, we set the value of various ratios as {1%, 5%, 10%, 5% and 50%} in the semisupervised learning and demonstrate the values of map for all of the datasets in Table II to Table VI, respectively, where the best results are highlighted in bold. From these Tables, we reach the following observations and analysis:

8 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? TABLE II: Comparison on CCV dataset (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. All-Fea 0.51 ± ± ± ± ± ± SBMLR 0.6 ± ± ± ± ± ± FSNM 0.58 ± ± ± ± ± ± LUDLP 0.79 ± ± ± ± ± ± 0.01 ASL 0.31 ± ± ± ± ± ± LSDF 0.99 ± ± ± ± ± ± FS-Manifold ± ± ± ± ± ± 0.01 SFSS 0.34 ± ± ± ± ± ± S FS R ± ± ± ± ± ± OGE-SFS ± ± ± ± ± ± TABLE III: Comparison on Kodak dataset (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. All-Fea ± ± ± ± ± ± SBMLR ± ± ± ± ± ± FSNM ± ± ± ± ± ± LUDLP ± ± ± ± ± ± ASL ± ± ± ± ± ± LSDF ± ± ± ± ± ± 0.00 FS-Manifold ± ± ± ± ± ± SFSS ± ± ± ± ± ± S FS R ± ± ± ± ± ± 0.01 OGE-SFS 0.16 ± ± ± ± ± ± TABLE IV: Comparison on CareMedia dataset (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. All-Fea 0.04 ± ± ± ± ± ± 0.03 SBMLR 0.09 ± ± ± ± ± ± FSNM 0.8 ± ± ± ± ± ± LUDLP 0.36 ± ± ± ± ± ± ASL 0.41 ± ± ± ± ± ± 0.01 LSDF 0.44 ± ± ± ± ± ± 0.07 FS-Manifold 0.51 ± ± ± ± ± ± SFSS 0.59 ± ± ± ± ± ± S FS R 0.77 ± ± ± ± ± ± 0.01 OGE-SFS ± ± ± ± ± ± TABLE V: Comparison on KTH dataset (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. All-Fea 0.89 ± ± ± ± ± ± SBMLR 0.94 ± ± ± ± ± ± FSNM ± ± ± ± ± ± 0.01 LUDLP ± ± ± ± ± ± ASL ± ± ± ± ± ± LSDF ± ± ± ± ± ± FS-Manifold 0.35 ± ± ± ± ± ± SFSS ± ± ± ± ± ± S FS R ± ± ± ± ± ± OGE-SFS ± ± ± ± ± ± ) All-Fea method shows the worst performances among the other feature selection based methods over all of the datasets. As a result, it is significantly necessary to exploit dimensionality reduction techniques to relieve the curse of dimensionality and mitigate the effects of noises. ) For each method, the performance on each dataset becomes better as the ratio of labeled training video increase. Since the discriminative information is usually encoded in the qualified labels that are typically obtained expensively, it is necessary to develop semi-supervised learning for video semantic recognition. 3) Semi-supervised classification methods with or without feature selection commonly perform better than supervised methods due to the contributions of intrinsic structure exploited from unlabeled videos. We also observe

9 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? TABLE VI: Comparison on HMDB (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. All-Fea ± ± ± ± ± ± SBMLR ± ± ± ± ± ± FSNM ± ± ± ± ± ± LUDLP ± ± ± ± ± ± 0.00 ASL ± ± ± ± ± ± LSDF ± ± ± ± ± ± FS-Manifold 0.17 ± ± ± ± ± ± 0.03 SFSS ± ± ± ± ± ± 0.00 S FS R 0.13 ± ± ± ± ± ± OGE-SFS ± ± ± ± ± ± TABLE VII: p-value of Pair-wised Wilcoxon Test w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data on CCV. All-Fea 4e-11 e-11 5e-11 e-11 3e-11 1e-11 SBLR 4e-11 3e-11 5e-11 4e-11 e-11 3e-11 FSNM 4e-11 1e-11 5e-11 3e-11 e-11 e-11 LSDF 5e-11 3e-11 e-11 3e-11 4e-11 e-11 FS-Manifold 3e-11 3e-11 4e-11 3e-11 e-11 4e-11 SFSS e-11 4e-11 e-11 7e-11 3e-11 3e-11 S FS R e-11 3e-11 4e-11 e-11 e-11 4e-11 TABLE VIII: p-value of Pair-wised Wilcoxon Test w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data on Kodak. All-Fea 4e-11 4e-11 e-11 3e-11 e-11 4e-11 SBLR e-11 4e-11 4e-11 3e-11 e-11 4e-11 FSNM e-11 4e-11 3e-11 4e-11 3e-11 e-11 LSDF 3e-11 e-11 4e-11 e-11 4e-11 e-11 FS-Manifold 4e-11 3e-11 e-11 3e-11 e-11 3e-11 SFSS 5e-11 3e-11 1e-11 3e-11 3e-11 1e-11 S FS R 3e-11 5e-11 e-11 5e-11 3e-11 e-11 TABLE IX: p-value of Pair-wised Wilcoxon Test w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data on CareMedia. All-Fea 4e-11 e-11 3e-11 5e-11 4e-11 e-11 SBLR 5e-11 3e-11 4e-11 e-11 5e-11 4e-11 FSNM 4e-11 3e-11 4e-11 6e-11 3e-11 e-11 LSDF 4e-11 3e-11 4e-11 5e-11 4e-11 3e-11 FS-Manifold 4e-11 3e-11 e-11 3e-11 e-11 3e-11 SFSS 4e-11 e-11 5e-11 4e-11 3e-11 e-11 S FS R e-11 3e-11 e-11 4e-11 e-11 3e-11 TABLE X: p-value of Pair-wised Wilcoxon Test w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data on KTH. All-Fea 4e-11 3e-11 5e-11 3e-11 e-11 5e-11 SBLR 4e-11 e-11 4e-11 3e-11 3e-11 4e-11 FSNM 3e-11 4e-11 5e-11 4e-11 3e-11 5e-11 LSDF e-11 5e-11 3e-11 4e-11 5e-11 3e-11 FS-Manifold 3e-11 e-11 4e-11 4e-11 e-11 5e-11 SFSS 4e-11 3e-11 e-11 4e-11 5e-11 3e-11 S FS R 3e-11 e-11 e-11 3e-11 3e-11 4e-11 that semi-supervised feature selection methods with less unlabeled videos perform much better than supervised feature selection methods. 4) In general, semi-supervised feature selection algorithms (i.e. S FS R, OGE-SFS) outperform semi-supervised classification algorithms (i.e. LUDLP and ASL) on all the datasets, which confirms the necessarity of feature selection for the tasks of video semantic recognition. 5) For each dataset, the proposed OGE-SFS consistently outperforms all the compared semi-supervised feature selection methods on varying number of labeled training videos, thanks to the adaptive neighbor assignment and

IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? 017 10 TABLE XI: p-value of Pair-wised Wilcoxon Test w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data on HMDB.

10 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? TABLE XI: p-value of Pair-wised Wilcoxon Test w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data on HMDB. All-Fea e-11 4e-11 4e-11 e-11 3e-11 3e-11 SBLR 5e-11 4e-11 3e-11 e-11 4e-11 5e-11 FSNM 4e-11 e-11 3e-11 4e-11 e-11 4e-11 LSDF 3e-11 3e-11 4e-11 e-11 3e-11 3e-11 FS-Manifold 4e-11 3e-11 e-11 4e-11 3e-11 3e-11 SFSS 4e-11 e-11 3e-11 4e-11 e-11 4e-11 S FS R 4e-11 3e-11 3e-11 4e-11 4e-11 5e-11 (a) CCV (b) Kodak (c) Caremedia (d) KTH (e) HMDB Fig. : Convergence curves for the proposed algorithm over all of the datasets MAP 0.4 MAP 0. MAP 0. MAP λ γ 10-3 λ γ 10-3 λ γ 10-3 λ γ (a) CCV (b) Kodak (c) Caremedia (d) KTH (e) HMDB Fig. 3: Sensitivity analysis on regularization parameter γ and λ with 10% labeled training data for all datasets. adaptive loss measurement. Indeed, OGE-SFS can learn the optimal similarity matrix adaptively and capture the more exact intrinsic manifold structure of the original feature space. To ensure the improvement of the proposed algorithm in terms of feature selection is statistically significant, nonparametric pair-wised Wilcoxon test is conducted on the five datasets. The experimental results are reported in Table VII to Table XI, where the victories are highlighted in bold. We compute the p-value of the proposed algorithm against each other algorithm and set the level of significance α = From the results shown in Table VII to Table XI, we observe that the proposed algorithm achieves statistically significant improvements. D. Convergence Analysis In the previous section, we have verified the convergence of the proposed alternative optimization algorithm in theory. For practical application, without loss of generality, we plot the convergence curves of the proposed algorithm in Figure when the regularization parameters γ and λ are fixed at 1 for all of the datasets. Thanks to the sparsity assumption on adaptive similarity matrix S as well as the convex formulation of variables W, b and F, it indicates that the proposed alternative optimization algorithm converges to a local minimum within no more than 0 iterations over all of the datasets, especially, less than 10 iterations are required by CCV, Kodak and CarMedia dataset. As a result, the proposed algorithm is efficient and shows fast convergence in practice. E. Sensitivity Analysis To demonstrate the robustness of the proposed algorithm, we study the influences of the parameters γ and λ with 10% labeled training data for all of the datasets. Figure 3 shows the sensitiveness of the two regularization parameters λ and γ jointly to the performance of video semantic recognition, with fixed adaptive parameter σ = 0.1 and the size of adaptive neighborhood K = 5. We observe from the results of five datasets, on one hand, the proposed methods shows robustness to different values of regularization parameters λ and γ in a wide range; On the other hand, there seems no analogous rule to identify the best parameters for all of the application because of the varying properties related to the datasets. We also conduct experiments to test the influence of the number of nearest neighbors in terms of the experimental settings in Subsection B. Table XII shows the varying performance of the proposed model with respect to different number of nearest neighbors, i.e., K = 5, 10, 0. From the experimental results, we can see that the best performance will generally be obtained when K = 5 over all of the datasets. For more detail influence of K on local graph-based model, the interested readers can see in [5], [53].

11 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? TABLE XII: Influence of K (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. Dataset OGE-SFS(5) ± ± ± ± ± ± CCV OGE-SFS(10) ± ± ± ± ± ± OGE-SFS(0) 0.58 ± ± ± ± ± ± OGE-SFS(5) 0.16 ± ± ± ± ± ± Kodak OGE-SFS(10) 0.11 ± ± ± ± ± ± OGE-SFS(0) ± ± ± ± ± ± OGE-SFS(5) ± ± ± ± ± ± CareMeida OGE-SFS(10) ± ± ± ± ± ± OGE-SFS(0) 0.85 ± ± ± ± ± ± OGE-SFS(5) ± ± ± ± ± ± KTH OGE-SFS(10) ± ± ± ± ± ± OGE-SFS(0) ± ± ± ± ± ± OGE-SFS(5) ± ± ± ± ± ± HMDB OGE-SFS(10) ± ± ± ± ± ± 0.01 OGE-SFS(0) 0.11 ± ± ± ± ± ± 0.04 VI. CONCLUSION In this paper, we proposed a novel semi-supervised feature analyzing framework for video semantic recognition by integrating the adaptive optimal similarity matrix learning into the procedure of feature selection. This method alleviates the sensitivity of input affinity matrix to the model performance and captures the intrinsic manifold structure of the original feature space through adaptive neighbor assignment. Additionally, an adaptive loss function is adopted to enhance the model s robustness to the noises and outliers. We exploit a efficient alternative optimization algorithm to solve the proposed challenging problems. The convergence and complexity of the proposed algorithm are analyzed in theory. Both of the theoretical analysis and experimental performance illustrate the effectiveness and superiority of the proposed approaches on varying tasks of video semantic recognition. Note that we focus on the case that each data point belongs to only one category with single view in this paper. In our future work, it is potential interesting to involve the adaptive neighbor assignment into semi-supervised feature selection in the case of multi-labels [3], [31] and multi-view [54], [55], and other interesting applications, i.e., Co-saliency Detection [56], [57]. REFERENCES [1] X. Chang, Y.-L. Yu, Y. Yang, and E. P. Xing, Semantic pooling for complex event analysis in untrimmed videos, IEEE Trans. Pattern Anal. Mach. Intell. [] Y. Han, Y. Yang, Y. Yan, Z. Ma, N. Sebe, and X. Zhou, Semisupervised feature selection via spline regression for video semantic recognition, IEEE Trans. Neural Netw. Learn. Syst., vol. 6, no., pp. 5 64, 015. [3] M. Luo, F. Nie, X. Chang, Y. Yang, A. Hauptmann, and Q. Zheng, Avoiding Optimal Mean Robust PCA/DPCA with Non-greedy l 1 - norm Maximization, in IJCAI, 016. [4] H. Liu and H. Motoda, Feature selection for knowledge discovery and data mining. Springer Science & Business Media, 01. [5] J. G. Dy and C. E. Brodley, Feature selection for unsupervised learning, J. Mach. Learn. Res., vol. 5, pp , 004. [6] Z. Zhao, X. He, D. Cai, L. Zhang, W. Ng, and Y. Zhuang, Graph regularized feature selection with data reconstruction, IEEE Trans. Knowl. Data Eng., vol. 8, no. 3, pp , 016. [7] H. Elghazel and A. Aussem, Feature selection for unsupervised learning using random cluster ensembles, in ICDM, 010. [8] H. Arai, C. Maung, K. Xu, and H. Schweitzer, Unsupervised feature selection by heuristic search with provable bounds on suboptimality, in AAAI, 016. [9] P. Mitra, C. Murthy, and S. K. Pal, Unsupervised feature selection using feature similarity, IEEE Trans. Pattern Anal. Mach. Intell., vol. 4, no. 3, pp , 00. [10] H.-L. Wei and S. A. Billings, Feature subset selection and ranking for data dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell., vol. 9, no. 1, pp , 007. [11] J. Xu, B. Tang, H. He, and H. Man, Semisupervised feature selection based on relevance and redundancy criteria, IEEE Trans. Neural Netw. Learn. Syst., to appear. [1] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan, A multimedia retrieval framework based on semi-supervised ranking and relevance feedback, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp , 01. [13] H. Peng, F. Long, and C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy, IEEE Trans. Pattern Anal. Mach. Intell., vol. 7, no. 8, pp , 005. [14] J. Lee Rodgers and W. A. Nicewander, Thirteen ways to look at the correlation coefficient, The American Statistician, vol. 4, no. 1, pp , [15] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley & Sons, 01. [16] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 01. [17] X. Zhu, Semi-supervised learning with graphs, Ph.D. dissertation, Carnegie Mellon University, 005. [18] X. Chang and Y. Yang, Semi-supervised feature analysis by mining correlations among multiple tasks, IEEE Trans. Neural Netw. Learn. Syst., 016. [19] M. Belkin, P. Niyogi, and V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res., vol. 7, pp , 006. [0] B. S. O. Chapelle and E. A. Zien, Semi-Supervised Learning. MIT Press, 006. [1] J. Lafferty and L. Wasserman, Statistical analysis of semi-supervised regression, 007. [] Z. Zhao and H. Liu, Semi-supervised feature selection via spectral analysis, in SDM, 007. [3] A. Alalga, K. Benabdeslem, and N. Taleb, Soft-constrained laplacian score for semi-supervised multi-label feature selection, Knowl. Inf. Syst., vol. 47, no. 1, pp , 016. [4] Z. Xu, I. King, M. R. Lyu, and R. Jin, Discriminative semi-supervised feature selection via manifold regularization, IEEE Trans. Neural Netw., vol. 1, no. 7, pp , 010. [5] Z. Ma, F. Nie, Y. Yang, J. R. Uijlings, N. Sebe, and A. G. Hauptmann, Discriminating joint feature analysis for multimedia data understanding, IEEE Trans. Multimedia, vol. 14, no. 6, pp , 01. [6] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, On the surprising behavior of distance metrics in high dimensional space. Springer, 001. [7] S. Xiang, F. Nie, C. Zhang, and C. Zhang, Nonlinear dimensionality reduction with local spline embedding, IEEE Trans. Knowl. Data Eng., vol. 1, no. 9, pp , 009.

IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? 017 1 [8] C. Shi, Q. Ruan, G. An, and R. Zhao, Hessian semi-supervised sparse feature selection based on-matrix norm, IEEE Trans. Multimedia, vol.

Chapelle, B. Schölkopf, and A. Zien, A discussion of semi-supervised learning and transduction. MIT Press, 006. [31] X. Chang, F. Nie, Y. Yang, and H.

Jin, Discriminative semisupervised feature selection via manifold regularization, IEEE Trans. Neural Netw., vol. 1, no. 7, pp. 1033 1047, 010. [33] J. Ren, Z. Qiu, W. Fan, H. Cheng, and S. Y.

Learn., vol. 73, no., pp. 185 14, 008. [35] F. Nie, H. Wang, H. Huang, and C. H. Ding, Adaptive loss minimization for semi-supervised elastic embedding, in IJCAI, 013. [36] F. Wang and C.

Huang, Large-scale adaptive semi-supervised learning via unified inductive and transductive model, in ACM SIGKDD, 014, pp. 48 491. [38] F. Nie, H. Huang, C. Ding, D. Luo, and H.

Zha, r1-pca: rotational invariant l 1 - norm principal component analysis for robust subspace factorization, in ICML, 006. [40] H. Liu, Y. Liu, and F.

12 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? [8] C. Shi, Q. Ruan, G. An, and R. Zhao, Hessian semi-supervised sparse feature selection based on-matrix norm, IEEE Trans. Multimedia, vol. 17, no. 1, pp. 16 8, 015. [9] K. I. Kim, F. Steinke, and M. Hein, Semi-supervised regression using hessian energy with an application to semi-supervised dimensionality reduction, in NIPS. [30] O. Chapelle, B. Schölkopf, and A. Zien, A discussion of semi-supervised learning and transduction. MIT Press, 006. [31] X. Chang, F. Nie, Y. Yang, and H. Huang, A convex formulation for semi-supervised multi-label feature selection, in AAAI, 014. [3] Z. Xu, I. King, M. R.-T. Lyu, and R. Jin, Discriminative semisupervised feature selection via manifold regularization, IEEE Trans. Neural Netw., vol. 1, no. 7, pp , 010. [33] J. Ren, Z. Qiu, W. Fan, H. Cheng, and S. Y. Philip, Forward semisupervised feature selection, in PAKDD. [34] C. Vens, J. Struyf, L. Schietgat, S. Džeroski, and H. Blockeel, Decision trees for hierarchical multi-label classification, Mach. Learn., vol. 73, no., pp , 008. [35] F. Nie, H. Wang, H. Huang, and C. H. Ding, Adaptive loss minimization for semi-supervised elastic embedding, in IJCAI, 013. [36] F. Wang and C. Zhang, Label propagation through linear neighborhoods, IEEE Trans. Knowl. Data Eng., vol. 0, no. 1, pp , 008. [37] D. Wang, F. Nie, and H. Huang, Large-scale adaptive semi-supervised learning via unified inductive and transductive model, in ACM SIGKDD, 014, pp [38] F. Nie, H. Huang, C. Ding, D. Luo, and H. Wang, Robust principal component analysis with non-greedy l 1 -norm maximization, in IJCAI, 011. [39] C. Ding, D. Zhou, X. He, and H. Zha, r1-pca: rotational invariant l 1 - norm principal component analysis for robust subspace factorization, in ICML, 006. [40] H. Liu, Y. Liu, and F. Sun, Robust exemplar extraction using structured sparse coding, IEEE Trans. Neural Netw. Learn. Syst., vol. 6, no. 8, pp , 015. [41] F. Nie, X. Wang, and H. Huang, Clustering and projected clustering with adaptive neighbors, in ACM KDD, 014. [4] Y. Jiang, G. Ye, S. Chang, D. P. W. Ellis, and A. C. Loui, Consumer video understanding: a benchmark database and an evaluation of human and machine performance, in ICMR, 011. [43] A. C. Loui, J. Luo, S. Chang, D. Ellis, W. Jiang, L. S. Kennedy, K. Lee, and A. Yanagawa, Kodak s consumer video benchmark data set: concept definition and annotation, in ACM SIGMM MIR, 007. [44] [Online]. Available: [45] M.-y. Chen and A. Hauptmann, Mosift: Recognizing human actions in surveillance videos, CarnegieMellon Univ., Tech. Rep. CMU-CS [46] C. Schüldt, I. Laptev, and B. Caputo, Recognizing human actions: A local SVM approach, in ICPR, 004. [47] G. C. Cawley, N. L. C. Talbot, and M. Girolami, Sparse multinomial logistic regression via bayesian L1 regularisation, in NIPS, 006. [48] F. Nie, H. Huang, X. Cai, and C. H. Q. Ding, Efficient and robust feature selection via joint l,1-norms minimization, in NIPS, 010. [49] X. Zhu and Z. Ghahramani, Learning from labeled and unlabeled data with label propagation, Technical Report CMU-CALD-0-107, Carnegie Mellon University, Tech. Rep., 00. [50] J. Zhao, K. Lu, and X. He, Locality sensitive semi-supervised feature selection, Neurocomputing, vol. 71, no. 10-1, pp , 008. [51] S. Wang, Y. Yang, Z. Ma, X. Li, C. Pang, and A. G. Hauptmann, Action recognition by exploring data distribution and feature correlation, in CVPR, 01. [5] M. Maier, U. V. Luxburg, and M. Hein, Influence of graph construction on graph-based clustering measures, in NIPS, 008. [53] M. H. Rohban and H. R. Rabiee, Supervised neighborhood graph construction for semi-supervised classification, Pattern Recognition, vol. 45, no. 4, pp , 01. [54] S. Sun, Multi-view laplacian support vector machines, in International Conference on Advanced Data Mining and Applications, 011, pp. 09. [55] A. Iosifidis, A. Tefas, and I. Pitas, Regularized extreme learning machine for multi-view semi-supervised action recognition, Neurocomputing, vol. 145, pp. 50 6, 014. [56] D. Zhang, J. Han, C. Li, J. Wang, and X. Li, Detection of co-salient objects by looking deep and wide, International Journal of Computer Vision, vol. 10, no., pp. 15 3, 016. [57] D. Zhang, J. Han, J. Han, and L. Shao, Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining, IEEE Trans. Neural Netw. Learning Syst., vol. 7, no. 6, pp , 016. Minnan Luo received the Ph. D. degree from the Department of Computer Science and Technology, Tsinghua University, China, in 014. Currently, she is an Assistant Professor in the School of Electronic and Information Engineering at Xi an Jiaotong University. She was a Post-Doctoral Research with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. Her research interests include machine learning and optimization, crossmedia retrieval and fuzzy system. Xiaojun Chang received the PhD degree in computer science from Centre for Quantum Computation and Intelligent Systems (QCIS), University of Technology Sydney, Australia in 016. He has been working as a visiting student at the language technology institute of Carnegie Mellon University. His main research interests include machine learning, data mining and computer vision. Liqiang Nie received the B.E. degree from the Xi an Jiaotong University and the Ph.D. degree from the National University of Singapore in 009 and 013, respectively. He is currently a Research Fellow with the School of Computing, National University of Singapore. His current research interests include information retrieval and healthcare analytics. Yi Yang received the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China, in 010. He is currently an associate professor with University of Technology Sydney, Australia. He was a Post-Doctoral Research with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. His current research interest include machine learning and its applications to multimedia content analysis and computer vision, such as multimedia indexing and retrieval, surveillance video analysis and video semantics understanding. Alexander G. Hauptmann received the B.A. and M.A. degrees in psychology from Johns Hopkins University, Baltimore, MD, the degree in computer science from the Technische Universitat Berlin, Germany, in 1984, and the Ph.D. degree in computer science from Carnegie Mellon University, USA, in He is currently with the Department of Computer Science and the Language Technologies Institute, CMU. His research interests include natural language processing, speech understanding and synthesis, video analysis, and machine learning. Qinghua Zheng received the B.S. degree in computer software in 1990, the M.S. degree in computer organization and architecture in 1993, and the Ph.D. degree in system engineering in 1997 from Xi an Jiaotong University, China. He was a postdoctoral researcher at Harvard University in 00. He is currently a professor in Xi an Jiaotong University, and the dean of the Department of Computer Science. His research areas include computer network security, intelligent E-learning theory and algorithm, multimedia e-learning, and trustworthy software.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1. Adaptive Unsupervised Feature Selection with Structure Regularization

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Adaptive Unsupervised Feature Selection with Structure Regularization Minnan Luo, Feiping Nie, Xiaojun Chang, Yi Yang, Alexander G. Hauptmann