An Adaptive Semi-Supervised Feature Analysis for Video Semantic Recognition

Size: px
Start display at page:

Download "An Adaptive Semi-Supervised Feature Analysis for Video Semantic Recognition"

Transcription

1 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? An Adaptive Semi-Supervised Feature Analysis for Video Semantic Recognition Minnan Luo, Xiaojun Chang, Liqiang Nie, Yi Yang, Alexander Hauptmann and Qinghua Zheng Abstract Video semantic recognition usually suffers from the curse of dimensionality and the absence of enough high-quality labeled instances, thus semi-supervised feature selection gains increasing attentions for its efficiency and comprehensibility. Most of the previous methods assume that videos with close distance (neighbors) have similar labels and characterize the intrinsic local structure through a predetermined graph of both labeled and unlabeled data. However, besides the parameter tuning problem underlying the construction of the graph, the affinity measurement in the original feature space usually suffers from the curse of dimensionality. Additionally, the predetermined graph separates itself from the procedure of feature selection, which might lead to downgraded performance for video semantic recognition. In this paper, we exploit a novel semisupervised feature selection method from a new perspective. The primary assumption underlying our model is that the instances with similar labels should have a larger probability of being neighbors. Instead of using a predetermined similarity graph, we incorporate the exploration of the local structure into the procedure of joint feature selection so as to learn the optimal graph simultaneously. Moreover, an adaptive loss function is exploited to measure the label fitness, which significantly enhances model s robustness to videos with a small or substantial loss. We propose an efficient alternating optimization algorithm to solve the proposed challenging problem, together with analyses on its convergence and computational complexity in theory. Finally, extensive experimental results on benchmark data sets illustrate the effectiveness and superiority of the proposed approach on video semantic recognition related tasks. Index Terms Semi-Supervised Learning, Feature Selection, Video Semantic Recognition, Manifold Regularization I. INTRODUCTION For the explosion of videos along with the rapid development of Internet, video semantic recognition has become a fundamental process that allows the creation of several applications [1], such as semantic video indexing and retrieval, object detection and tracking, video event detection and human activity analysis. However, these tasks usually suffer from the Corresponding author: Xiaojun Chang. This work was funded by the National Science Foundation of China (Nos , , ), the National Key Research and Development Program of China (No. 016YFB ), the Data to Deci-sions Cooperative Research Centre ( and China Postdoctoral Science Foundation (No. 015M5866). Minnan Luo and Qinghua Zheng are with SPKLSTN Lab, Department of Computer Science, Xi an Jiaotong University, Shaanxi , China, {minnluo,qhzheng}@mail.xjtu.edu.cn Xiaojun Chang and Yi Yang are with Centre for Quantum Computation and Intelligent Systems, University of Technology Sydney, Australia, {cxj73,yee.i.yang}@gmail.com Liqiang Nie is with School of Computer Science and Technology, Shandong University, nieliqiang@gmail.com Alexander Hauptmann is with the School of Computer Science, Carnegie Mellon University, PA, USA, alex@cs.cmu.edu. curse of dimensionality since the low-level visual features are often high-dimensional and noisy [], [3]. As a result, it is crucial to exploit efficient dimensionality reduction technology to promote the performance of video semantic recognition. Typically, feature selection techniques reduce the dimensionality by selecting a small subset of original features, and thus is capable of alleviating the curse of dimensionality while enhancing the comprehensibility of dimension reduction [4]. From the perspective of utilizing label information, feature selection algorithms can be roughly categorized into unsupervised, supervised and semi-supervised types. Unsupervised feature selection methods suppose that all of the data are collected without labels. They either involve maximization of clustering performance as quantified by some index [5] [7] or considers the selection of features based on feature dependency and relevance [8] [10] Although unsupervised feature selection methods are capable of handling unlabeled data, they ignore the discriminative information from labels [11], [1]. Instead, supervised feature selection methods only rely on the labeled data and usually evaluate the importance of features based on the correlation between features and the class labels [13], such as Pearson correlation coefficients [14], Fisher score [15] and information gain [16]. Supervised feature selection usually achieves reliable performance with a sufficient number of labeled training data. However, in many practical applications for video semantic recognition, the vast majority of large-scale videos are usually collected without labels. Annotating these unlabeled videos is dramatically expensive and time-consuming. Therefore, semi-supervised learning for feature selection that aims to learn a discriminative subset of the original features from both labeled and unlabeled data, is of potentially great practical significance [17] [19]. In the framework of semi-supervised learning, most of the feature selection methods are based on one of the following two assumptions. (1) The smoothness (cluster) assumption which assumes that data points are likely to have the same label if they are close to each other [0]; () The manifold assumption which states that the prediction function for feature selection lives in a low dimensional manifold of the higherdimensional feature space [1]. Under the smoothness assumption, graph-based semi-supervised feature selection that inherits excellent properties from spectral graph theory has attracted great attentions. For example, Zhao et al. [] pioneered a filter-based semi-supervised feature selection with particular graph theory and construct cluster indicators to evaluate the feature relevance one by one. Alalga et al. [3] transformed the labeled part of data into soft constraints and proposed a unified framework for semi-supervised multi-label feature

2 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? 017 selection based on a soft-constrained Laplacian score. Under the manifold assumption, embedded method based feature selection attempts to use the underlying geometric structure of data and integrate the selection into the training stage of the semi-supervised learning process. For example, Xu et al. [4] took advantage of semi-supervised SVM with manifold regularization and conducted feature selection through maximizing the classification margin between different classes and simultaneously exploiting the geometry of marginal distribution. Ma et al. [5] extended l,1 -norm regularized feature selection relevant features with manifold fitting on the data structure. It is noteworthy that the methods mentioned above usually characterize the intrinsic local structure with a predetermined graph Laplacian matrix based on a Gaussian function. However, besides the parameter tuning problem in the construction of graph [], the qualitative affinity measurement usually suffers from the curse of dimensionality [6], As a result, the predetermined data affinity matrix that is separated from the procedure of feature selection might not be the optimal one and fail to capture the underlying geometry structure accurately [7]. As argued in [8], the Laplacian regularization biases the solution towards a constant function and is short of extrapolating power. For the issues above, Shi et al. [8] employed Hessian regularization [9] instead of Laplacian regularization to exploit the intrinsic local geometry of the data manifold, and leverage l,1/ -matrix norm for a joint sparse feature selection. However, this method is conducted in a transductive way [30] and thus can not be used to predict the out-of-sample testing data. Han et al. [] encoded the local geometry structures of training data as a linear combination of within-class scatter matrices and spline scatter matrix. However, besides more parameters are introduced, both of the mentioned methods separate the characterization of local structure from the procedure of semi-supervised feature selection. Note that there are also some semi-supervised feature selection methods which do not consider the local structure characterization. For example, Chang et al. [31] associated each training data point with a pre-determined score and developed a convex semi-supervised multi-label feature selection framework without relying on graphs and Eigen decomposition. However, in many practical applications, it s hard to determine the score for each training data. Xu et al. [11] proposed a max-relevance and minredundancy criterion based on Pearsons correlation coefficient for semi-supervised feature selection. However, this lter-based method ignores the underlying learning algorithm which is utilized to train classifiers from labeled data [3]. Ren et al. [33] introduced a search based framework by randomly selecting unlabeled data to form new training sets and iteratively adding the most frequently selected feature to the result feature subset. However, this method neglects the interaction among features as well as the interaction between the heuristics of feature selection and the corresponding classifier [4]. In this paper, we develop an adaptive framework of semisupervised feature selection from a new point of view. Instead of using a predetermined affinity matrix to reflect the underlying local geometric structure of high-dimensional training data, the primary assumption underlying our model is that the data points with similar labels should have a larger probability of being neighbors. This strategy is also used in [34] to guide multi-label classification with random forest, and achieve better performance than using the feature space itself. Under this assumption, we incorporate the exploration of the local structure into the procedure of joint feature selection so as to learn the optimal graph simultaneously. To the best of our knowledge, this is the first work which explicitly learns the optimal graph in the procedure of semi-supervised feature selection. Additionally, due to a vast majority of data exists in the framework of semi-supervised learning without labels, it is significantly important in practice to improve the semisupervised model s robustness to outliers. Note that most of the previous methods utilize squared l -norm as loss function to measure the error of predicted labels, which usually performs sensitively since the outliers with significant loss will dominate the objective function. Although using l 1 -norm based loss function can relieve this issue to some extent, whereas it is sensitive to small loss [35]. For this issue, we leverage more robust loss function which is capable of overcoming the disadvantages of both l 1 -norm and l -norm based loss functions [35]. Our model yields encouraging experimental results on several benchmark datasets and demonstrates its effectiveness and superiority for video semantic recognition related tasks. In summary, we highlight the difference from the previous methods and describe the main contributions of this paper as follows: 1) We incorporate the exploration of optimal similarity matrix into the procedure of semi-supervised joint feature selection by adaptively assigning the nearest neighbors to each video. ) For real world applications, we exploit an adaptive loss function to measure the labels fitness over both labeled and unlabeled data, which enhance the model s robustness to noises and outliers. 3) To solve the proposed non-smooth objective function, we exploit an efficient alternative algorithm and prove its convergence theoretically and experimentally. The remainder of this paper is organized as follows. In Section II, we give a brief review of traditional framework for graph-based semi-supervised feature selection. A novel method with adaptive neighbor assignment and adaptive loss measurement is proposed in Section III, together with an efficient algorithm for the proposed challenging problem. The Theoretical study on the convergence and computational complexity of the proposed algorithm are demonstrated in Section IV In Section V, extensive experiments over several benchmark datasets are conducted to verify the effectiveness of the proposed method for video semantic recognition related tasks, including video concept detection and human activity analysis. Finally, conclusions are given in Section VI. Notations and Terms: We employ the notations as usual throughout this paper. For any d-dimensional vector z = [z 1, z,, z d ] R d, its l p -norm is defined as z p = ( d ) 1 i=1 z i p p. For any matrix A = (a ij ) p q R p q, we denote its i-th row and the j-th column as a i and a j, respectively. The Frobenius norm of matrix A is defined as

3 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? A F = i j a ij = T r(aa ), where T r( ) refers to the trace of matrix. The l,1 -norm of matrix A is formulated as A,1 = p i=1 ai for joint sparsity pattern. Let D = diag(d 1, d,, d p ) R p p be a diagonal matrix with its (i, i)-th diagonal element d i ( i). Let 0 and 1 be column vectors whose elements are all zero and one, respectively. II. PRELIMINARY AND MOTIVATION We first recall the traditional framework for graph-based semi-supervised feature selection, and then educe the motivation of our method formally. In this paper, the training videos are collected into a matrix X = [x 1, x,, x n ] R d n, where x i R d refers to a low-level feature vector of the i-th video; n is the total number of videos in training set. In the framework of semisupervised learning, we assume the first m (m n) videos x 1, x,, x m are labeled respectively with label vector y i = [y i1, y i,, y ic ] {0, 1} c (i = 1,,, m), where c refers to the number of semantic categories. Specifically, y ij = 1 when x i is in the j-th class, while y ij = 0 otherwise. The remain videos x m+1, x m+,, x n are unlabeled. Note that we focus on the standard case where each video belongs to only one category in this paper, i.e., the model is formulated with single label class. For feature selection over both labeled and unlabeled videos, graph Laplacian is widely used to preserve the manifold structure of video representation. Let G = {X, Ω} be the undirected weighted graph with n nodes of training videos, where Ω = (ω ij ) n n R n n is the affinity graph (similarity matrix) with each entry ω ij representing the similarity between videos x i and x j. Typically, the similarity matrix Ω is predetermined in the original high-dimensional feature space with Gaussian function, i.e. ω ij = { exp( xi xj σ ), x i N K (x j ) or x j N K (x i ); 0, otherwise. where N K (x i ) denotes the set of K-nearest neighbor to x i in the original high-dimensional space; The variance parameter σ empirically tuned in real-world applications will affect the performance significantly [36]. Based on graph theory, the corresponding Laplacian matrix is defined as L Ω = D Ω (Ω + Ω)/, where D Ω is a diagonal matirx whose i-th diagonal element is calculated as n j=1 (ω ij + ω ji )/ for i = 1,,, n. Let f i be a predicted label vector for the i-th video x i (i = 1,,, n). Under the assumption that videos with close distance (neighbors) should have similar labels, the manifold smoothness on the graph G is ensured by minimizing the following objective function [37] 1 i=1 j=1 f i f j ω ij = T r(f L Ω F ). (1) where F = [F l ; F u ] denotes the predicted label matrix over training videos. Specifically, F l = [f 1, f,, f m ] and F u = [f m+1, f m+,, f n ] refer to the predicted label matrices of labeled and unlabeled videos, respectively. Combining with the label fitness that is considered through a loss functions L(W, F ; X), traditional graph Laplacian based semi-supervised feature selection is formulated to solve the following optimization problem min W,b,F T r(f L S F ) + γl(w, F ; X) + λω(w ) () s.t. F l = Y l where W R d c is a transformation matrix; Y l = [y 1, y,, y m ] {0, 1} m c denotes the label matrix over labeled training videos; γ and λ are trade-off parameters which balance the manifold smoothness (the first term), the label fitness (the second term) and the regularization for transformation matrix (the third term). The regularization term Ω(W ) is used to encourage a specific structure of W. For example, in [5], Ω(W ) = W,1 is imposed for a sparsity on all features and non-sparsity between classes, and thus discriminative features for all classes get larger weights. The constraint on labeled videos aims to make the predicted labels F l consistent with the ground truth label matrix Y l. Note that the similarity matrix Ω used in objective function () depends on a pre-assigned neighbors according to Gaussian function. However, on one hand, the empirically tuned parameter σ usually deteriorates the accuracy and informativeness of similarity graph. On the other hand, the per-determined similarity matrix is based on norm-based distance measurement in original high-dimensional feature space, which inevitably suffers from the curse of dimensionality and fails to capture the intrinsic geometry relation accurately. Additionally, the loss function L(W, F ; X) is usually specified as a squared l -norm based distance measurement which is proved to be much sensitive to the outliers [38] [40]. As a result, it is significantly important to develop an robust semi-supervised feature selection model which can simultaneously characterize the underlying local geometric structure of both labeled and unlabeled video adaptively. III. THE PROPOSED METHODOLOGY In this section, we exploit a semi-supervised feature selection model by incorporating the exploration of local structure into the procedure of joint feature selection. A. Optimal Similarity Graph We adaptively assign each video x i neighbors with probability vector s i = [s i1, s i,, s in ] R n, such that s ij 0 and s i 1 = 1 for i = 1,,, n. Instead of supposing videos with close distance (neighbors) should have similar labels, in this paper, we assume that close labels have high similarity and vice versa. As a result, an optimal similarity graph is estimated through the following optimization problem min J ri (s i, F ) = 1 f i f j s i 0,s i 1=1 s ij + r i s ij (3) j=1 where the regularization term with parameter r i is significantly necessary because its absence leads to a trivial solution of optimization problem (3), i.e., only the nearest video can be the neighbor of x i with probability 1. Indeed, the regularization term enables all of the videos to be the neighbor of x i with the

4 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? same probability 1/n. Therefore, the objective function of (3) encourages close labels of f i and f j, measured as f i f j, have a larger probability s ij and vice versa. Over the entire videos in training dataset, we denote J r (S, F ) = n i=1 J r i (s i, F ) and solve the following optimization problem for an optimal neighbor assignment, min J r (S, F ) = 1 s.t. i=1 j=1 s i 0, s i 1 = 1 ( i) f i f j s ij + r i s ij (4) In practice, we set the initial similarity matrix S according to Gaussian function and update the neighbor assignment based on the label difference of each pair of videos. This strategy is totally different from the traditional method which use a predetermined similarity graph to characterize the underlying geometry structure of original features. Since the c-dimensional label space is much lower than the original d-dimensional original feature space, the optimal similarity graph not only mitigates the curse of dimensionality to some extent, but also captures the intrinsic structure effectively. B. Optimal Graph Embedded Feature Selection Based on the analysis above, we incorporate the optimal similarity graph learning into the procedure of semi-supervised feature selection and specify the label fitness with adaptive loss for better robustness. Such an idea can be formulated as a concise optimization problem as follows, min J r (S, F ) + γ X W + 1b F τ + λ W,1 (5) s. t. F l = Y l ; s i 0, s i 1 = 1 ( i) where γ and λ are trade-off parameters; Given any matrix A = [a 1, a,, a n ] R n n, the adaptive loss with parameter τ is defined as f τ ad loss : Rn n R, f τ ad loss(a) = A τ = i (1 + τ) a i a i + τ (6) Note that the adaptive loss measurement is indeed a smoothly interpolates between l 1 -norm and l -norm based loss functions. Thus it takes advantages of both norm-based loss measurements and become more robust to videos with a small loss or substantial loss. From the objective function of problem (5), we arrive at a new framework for semisupervised feature selection. This strategy takes advantages of both optimal neighbor assignment for model s accuracy and the adaptive loss measurement for model s robustness, namely Optimal Graph Embedded Semi-supervised Feature Selection (OGE-SFS). Considering the convexity of objective function with respect to variables W, b, S and F jointly, an alternative optimization algorithm can be readily employed to address the proposed challenging problem (5). Once W is obtained, we sort the d features according to w i F (1 i d) in descending order and select the top ranked ones. Adaptive neighbor assignment Alternative optimization Re-weighted algorithm for label prediction Fig. 1: Flowchart of alternative algorithm for OGE-SFS. C. Optimization Strategy The solution of optimization problem (5) can be approximately attained via updating the involved parameters S and {W, b, F } alternatively with the other one fixed. For a better understanding, we demonstrate the flowchart of optimization procedure in Figure 1, where we name the step of optimization S with fixed variables {W, b, F } as adaptive neighbor assignment, and exploit a re-weighted algorithm for label prediction by optimizing {W, b, F } with fixed variable S. In another words, the optimization strategy aims to learn the similarity matrix S and the predicted label matrix F simultaneously. Different from the traditional semi-supervised feature selection models that predict labels given a predetermined similarity matrix, the proposed model is capable of achieving better performance since it also learns the optimal graph for feature selection. Subsequently, we introduce the details of the alternative optimization algorithm. 1) Adaptive Neighbor Assignment: With fixed W, b and F, optimization problem (5) turns to the following n decoupled sub-problem due to the independence of videos in training dataset, 1 min s i 0,s i 1=1 f i f j s ij + r i s ij (7) j=1 for i = 1,,, n. With some evident reformation, this optimization problem equals to the following compact form, min s i + 1 h i (8) s i 0,s i 1=1 where h i = [h i1, h i,, h in ] is a n-dimensional vector corresponding to the i-th video and h ij = f i f j for j = 1,,, n. Note that this optimization problem indeed is a Euclidean projection problem on the simplex space. Following the strategies used in [41], we formulate the Lagrangian function of objective function (8) as L(s i, η, ζ) = s i + 1 h i η(s i 1 1) ζ s i (9) where η R and ζ = [ζ 1, ζ,, ζ n ] R n are the Lagrangian multipliers. Let the optimal value of problem (8) be s i, and the optimal Lagrangian multipliers be η and ζ. According to the KKT condition, we have the optimal solution s ij = 1 h ij + η + ζ j = ( 1 h ij + η ) + (10) for j = 1,,, n. Because of the locality assumption that there are only K nearest neighbor could have chance to connect to x i, the optimal s i has only K nonzero elements. Without loss of generality, let h i1, h i,, h in be sorted in ascending. We suppose the K-th smallest component in vector

5 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? s i is larger than 0 and the (K + 1)-th smallest component in vector s i is less than or equal to 0, i.e., { 1 h ik + η > 0; 1 h i(k+1) + (11) η 0. Based on the constraint s i 1 = 1 and Eq. (10), we have K j=1 ( 1 h ij + η ) = 1. It turns that η = (1 + 1 K j=1 h ij )/K. As a result, the regularization parameter r i satisfy the following inequality K h ik 1 K h ij < r i K h i(k+1) 1 K h ij. (1) j=1 j=1 We set r i = K h i(k+1) 1 K j=1 h ij to achieve an optimal solution s i with exact K nonzero components, i.e., s ij = ( 1 h ij + η ) + (13) for j = 1,,, n. In this case, the optimal neighbor assignment s i only depends on the number of neighbor K for each videos, which is more intuitive and easier to tune than regularization parameter r i because K is an integer and has explicit meaning in practice [41]. ) Label Prediction with Re-weighted Algorithm: Let f λ (W, b, F ; S) = T r(f L S F ) + λ W,1 and g i (W, b, F ) = x i W + b f i for a better representation. With fixed similarity matrix S, the optimization problem (5) with respect to variables W, b and F turns to φ = f λ(w, b, F ; S) + γ g i (W, b, F ) σ (14) min W,b,F s.t. F l = Y l i=1 according to Equation (1). Following the adaptive loss minimization problem [35], we calculate the derivative of objective function (14) with respect to {W, b, F } as φ {W,b,F } = f λ (W, b, F ; S) + γ n i=1 d ig i (W, b, F )g i (W, b, F ), where d i = (1 + σ) g i(w, b, F ) + σ ( g i (W, b, F ) + σ) (15) for i = 1,,, n. Given d i for the i-th video x i, the optimization problem (14) can be reformulated as min f λ (W, b, F ; S) + γ d i g i (W, b, F ) (16) i=1 according to the derivation. To address the problem (14), we first calculate d i according to Eq. (15) in terms of the current solutions, and then update {W, b, F } by addressing the optimization problem (16). Therefore, the key step for the procedure of optimization problem (14) lies in solving the following problem min W,b,F f λ(w, b, F ; S) + γt r(a DA) s.t. F l = Y l (17) where A = X W + 1b F ; D is a diagonal matrix with the i-th diagonal element d i is calculated by x i W + b f i d i = (1 + σ) + σ ( x i W + b f i + σ ). (18) Algorithm 1 Re-weighted algorithm for {W, b, F } Initialize: W, b and F. 1: while not converge do : Calculate a diagonal matrix D = diag(d 1, d,, d n ) according to Equation (18); 3: Update F by Equation (); update W by Equation (0) and update b by Equation (19). 4: end while Algorithm Alternative optimization for OGE-SFS Input: X, c, K, σ λ and γ. Output: W. Initialize: W 0,S 0 and t = 0. 1: repeat : Update W t+1, b t+1 and F t+1 with fixed S t by Algorithm 1; 3: Update S t+1 with fixed F t+1 by Equation (13); 4: until It can be verified that the objective function in optimization problem (18) is jointly convex with respect to variables W, b and F. Therefore, we set the derivative of objective function (17) with respect to b to 0, and obtain b = 1 1 D1 F D1 1 1 D1 W XD1. (19) Note that the derivative of W,1 with respect to W is R 1 W, where R = diag(r 11, r,, r dd ) is a diagonal matrix with the (i, i)-th diagonal element r ii = w i for i = 1,,, n. By substituting Eq. (19) into the objective function of optimization problem (17) and setting its derivative with respect to W to 0, we have W = (XNX + λr 1 ) 1 XNF (0) where N = D 1 1 D1 D11 D is the centering matrix. Moreover, by substituting Eq. (19) and (0) into the objective function of problem (17), we arrive at T r(f (L S + γn γnx (XNX +λr 1 ) 1 XN)F ) = T r ( F MF ), where M = L S + γn γnx (XNX + λr 1 ) 1 XN ( ) Mll M = lu. (1) M ul M uu As a result, the optimal labels matrix F u for unlabeled videos can be predicted as F u = M 1 uu M ul Y l. () In summary, we describe the re-weighted optimization for problem (14) in Algorithm 1, and demonstrate the overall alternative optimization algorithm for OGE-SFS in Algorithm. IV. CONVERGENCE AND COMPLEXITY ANALYSIS In this section, we provide a theoretical analysis on the convergence of the proposed alternative optimization algorithms, together with the analysis on its computational complexity.

6 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? A. Convergence Analysis Lemma 1: [35] For any n-dimensional vectors x and y, we have x x + σ y + σ ( y + σ) x y y + σ y + σ ( y + σ) y. Theorem 1: The re-weighted updating rules in Algorithm 1 monotonically decrease the objective function value of optimization problem (14) in each iteration. Proof: In step of Algorithm 1, we suppose the updated variables W, b and F are denoted by W, b and F, respectively. Based on the updating rules, we have f( W, b, F ; S) + γ f(w, b, F ; S) + γ d i g i( W, b, F ) i=1 d i g i(w, b, F ) (3) where d i = (1+σ) x i W +b f i +σ ( x i W +b f i +σ). On the other hand, based on Lemma 1, we have i=1 g i( W, b, F ) g i( W, b, F ) + σ gi(w, b, F ) + σ ( g i(w, b, F ) + σ) gi( W, b, F ) gi(w, b, F ) g i(w, b, F ) + σ gi(w, b, F ) + σ ( g i(w, b, F ) + σ) gi(w, b, F ). By substituting the value of d i into Eq. (3) and summing the two inequalities above, we arrive at f( W, b, F ; S) + γ f(w, b, F ; S) + γ i=1 i=1 (1 + σ) g i ( W, b, F ) g i ( W, b, F ) + σ d i (1 + σ) g i (W, b, F ) g i (W, b, F ) + σ. (4) Therefore, the re-weighted updating rules in Algorithm 1 monotonically decrease the objective function value of the problem (14) in each iteration. The proof is completed. Theorem : The alternate updating rules in Algorithm monotonically decrease the objective function value of optimization problem (5) in each iteration until convergence. Proof: In the procedure of iteration, we update the variables W t+1, b t+1 and F t+1 by Algorithm 1 while keeping S t fixed. According to Theorem 1, we have J r(s t, F t+1 ) + γ X W t+1 + 1(b t+1 ) F t+1 σ + λ W t+1,1 J r(s t, F t ) + γ X W t + 1(b t ) F t σ + λ W t,1 (5) On the other hand, with fixed F t+1, we update the similarity matrix S by Eq. (13) and obtain J r (S t+1, F t+1 ) J r (S t, F t+1 ). (6) Based on the inequities (5) and (6), we arrive at J r(s t+1, F t+1 ) + γ X W t+1 + 1(b t+1 ) F t+1 σ + λ W t+1,1 J r(s t, F t+1 ) + γ X W t+1 + 1(b t+1 ) F t+1 σ + λ W t+1,1 J r(s t, F t ) + γ X W t + 1(b t ) F t σ + λ W t,1 Hence, the objective function value of optimization problem (5) decrease monotonically in each iteration until the algorithm convergence. The proof is completed. B. Computational Complexity As shown in Algorithm, the alternative optimization algorithm for the proposed OGE-SFS consists two steps, i.e., adaptive neighbor assignment and re-weighted algorithm for label prediction. Subsequently, we analyze the computational cost of these two steps, respectively. The main computation cost of the first step lies in the calculation of predicted label distances between each pair of videos with complexity O(n c). Note that this cost is much smaller than traditional graph Laplacian in the original feature space with computational cost O(n d) since c d in many practical applications. During the step of label prediction with a fixed similarity matrix, the major computational cost lies in calculating the inverse of a matrix for updating W and F. Specifically, let P = XNX +λr 1 R d d with computational cost O(nd ) and Q = XNF R d c with cost O(ndc), the updating rule for W can be rewritten as W = P 1 Q. Note that the computational cost of matrix inverting is O(d 3 ). Fortunately, the expensive computation of matrix inverse can be avoided [37]. Since the updating of W equals to solving the following optimization problem min W W P W W Q, the optimal W can be achieved iteratively through gradient descent method, i.e. W t+1 = W t α(p W t Q). In such a way, the computational cost of matrix inverse reduces to O(T d c), where T is the number of iterations. As a result, the total computational cost of updating W turns to O(nd ) + O(T d ) for c d and c n in practice. The same strategy can also be used for the updating of F. V. EXPERIMENTS In this section, extensive experiments over several real world videos are conducted to evaluate the performance of the proposed semi-supervised feature selection method on video semantic recognition related tasks, including video concept detection (VCD) and human action recognition (HAR). A. Experimental Data To evaluate the proposed algorithm, we briefly introduce the following five datasets and summarize the details in Table I. Columbia Consumer Video dataset (CCV) [4]: This data set is collected for visual content recognition consisting of 9,317 YouTube videos over 0 semantic categories, where the manual annotation is conducted on the Amazon MTurk platform. In the experiments, we directly use the space-time interest points (STIP) features with 5,000-dimensional bag-of-visual-words (BoWs) to represent the low-level feature of videos [4]. Kodak consumer video dataset (Kodak) [43]: This set consists of 1,358 consumer video clips and 1,873 Youtube video clips, from which 3,590 key-frames are extracted and annotated in a rigorous fashion. All the key-frames belonging to concepts are used in the experiments. Following [], we extract SIFT points and represent each key-frame as a 1,000-dimensional histogram of BoWs based on the clustered visual dictionary. CareMedia dataset (CareMedia) [44]: This dataset includes 15 geriatric patients activities in the public

7 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? TABLE I: Data set description. Data set Tasks of samples of features of training data of categories of selected feature CCV VCD 9,317 5,000 4,659 0 {1, 000, 1, 500,, 4, 500, 5, 000} Kodak VCD 3,590 1,000,000 {100, 00,, 900, 1, 000} CareMedia HAR 3,913 1,000 1,000 5 {100, 00,, 900, 1, 000} KTH HAR,391 1,000 1,000 6 {100, 00,, 900, 1, 000} HMDB51 HAR 7,000 1,000 6, {100, 00,, 900, 1, 000} spaces of a nursing home. We use a subset of the entire dataset, consisting of 3,913 video sequences recorded by one camera in the dining room. In the experiments, MoSIFT feature with 1,000-dimensionality is extracted to represent each video sequence for its consideration of both local appearance and local motion [45]. KTH action dataset (KTH) [46]: This video dataset consists of,391 video sequences with respect to 6 types of human actions. Following the STIP descriptor in [], in the experiments, we represent each video sequence as a 1,000-dimensional STIP feature by concatenating several histograms from a space-time grid defined on the patch and generalizing SIFT descriptor to space-time. HMDB dataset: This dataset contains 7,000 manually annotated clips with 51 action categories extracted from a variety of sources ranging from digitized movies to YouTube. Following the STIP descriptor in [], we represent each video as a 1,000-dimensional STIP feature. B. Experimental Settings For CCV dataset, we adopt a standard partition which split the entire dataset into 4,659 training videos and 4,658 testing videos. Since there is no standard split for the other three datasets, we randomly sample a certain number of data as the training. The remaining videos for each of the three datasets are served as the corresponding testing data. We include the detailed splits for each dataset in Table I. The number of selected features for each dataset is also reported in Table I regarding the particular feature size. To study the performance of semi-supervised feature selection, in this paper, we set the ratio of labeled videos among the sampled training data set to different values from {1%, 5%, 10%, 5%, 50%, 100%}. To illustrate the effectiveness and superiority of our model s classification performance, we compare the proposed OGE- SFS with the following state-of-the-art methods. Note that All- Fea, SBMLR and FSNM are designed for supervised feature selection; LUDLP and ASL are used for semi-supervised classification with all features; The others are conducted for semi-supervised feature selection. All feature [All-Fea]: The original high-dimensional feature without feature selection has been used as a baseline in the experiments. Sparse Multinomial Logistic Regression via Bayesian l 1 Regularization [47] [SBMLR]: It can be applied to feature selection by exploiting sparsity using a Laplace prior. Feature Selection via Joint l,1 -norm minimization [48] [FSNM]: It employs on both loss function and regularization to realize feature selection across all data points. Learning from Labeled and Unlabeled Data with Label Propagation [49] [LUDLP]: This semi-supervised classification aims to propagate labels through the dataset along high density areas defined by unlabeled data. Large-Scale Adaptive Semi-Supervised Learning (ASL) [37]: Instead of using graph information, this semisupervised classification method suppress the weights of boundary unlabeled points adaptively. Locality sensitive semi-supervised feature selection [50] [LSDF]: This is a semi-supervised feature selection approach based on within-class and between-class graph construction. Discriminative Semi-Supervised Feature Selection via Manifold Regularization [4] [FS-Manifold] : It selects features through maximizing the classification margin between different classes and simultaneously exploiting the data geometry by the manifold regularization. Structural Feature Selection with Sparsity [5] [SFSS]: This semi-supervised feature selection algorithms incorporates joint feature selection and semi-supervised learning into a single framework. Correlations between different features have been taken into consideration. Semi-Supervised Feature Selection via Spline Regression [] [S FS R]: The discriminative information between labeled training videos and the local geometry structure of all the training videos are well preserved by the combined semi-supervised scatters. For fair comparison, the number of neighbors is set as five for all of the graph-based comparators. We repeat the procedure of training data sampling 50 times and report the average performances with the corresponding variance for all of the datasets, where the performance of each repetition is achieved by tuning each method s regularization parameters with a grid-search strategy from set {10 3, 10,,, 10 3 }. Specifically, we set the parameter of adaptive loss as σ = 0.1 in the proposed model. With the selected features, we report the results of classification based on χ kernel SVM method for its excellent performance on human action recognition [51]. In the paper, we leverage the mean average precision (map) as the evaluation metric due to its stability and discriminating capability for classification. C. Experimental Results To study the influence of different ratios of labeled training data on the performance of classification, we set the value of various ratios as {1%, 5%, 10%, 5% and 50%} in the semisupervised learning and demonstrate the values of map for all of the datasets in Table II to Table VI, respectively, where the best results are highlighted in bold. From these Tables, we reach the following observations and analysis:

8 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? TABLE II: Comparison on CCV dataset (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. All-Fea 0.51 ± ± ± ± ± ± SBMLR 0.6 ± ± ± ± ± ± FSNM 0.58 ± ± ± ± ± ± LUDLP 0.79 ± ± ± ± ± ± 0.01 ASL 0.31 ± ± ± ± ± ± LSDF 0.99 ± ± ± ± ± ± FS-Manifold ± ± ± ± ± ± 0.01 SFSS 0.34 ± ± ± ± ± ± S FS R ± ± ± ± ± ± OGE-SFS ± ± ± ± ± ± TABLE III: Comparison on Kodak dataset (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. All-Fea ± ± ± ± ± ± SBMLR ± ± ± ± ± ± FSNM ± ± ± ± ± ± LUDLP ± ± ± ± ± ± ASL ± ± ± ± ± ± LSDF ± ± ± ± ± ± 0.00 FS-Manifold ± ± ± ± ± ± SFSS ± ± ± ± ± ± S FS R ± ± ± ± ± ± 0.01 OGE-SFS 0.16 ± ± ± ± ± ± TABLE IV: Comparison on CareMedia dataset (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. All-Fea 0.04 ± ± ± ± ± ± 0.03 SBMLR 0.09 ± ± ± ± ± ± FSNM 0.8 ± ± ± ± ± ± LUDLP 0.36 ± ± ± ± ± ± ASL 0.41 ± ± ± ± ± ± 0.01 LSDF 0.44 ± ± ± ± ± ± 0.07 FS-Manifold 0.51 ± ± ± ± ± ± SFSS 0.59 ± ± ± ± ± ± S FS R 0.77 ± ± ± ± ± ± 0.01 OGE-SFS ± ± ± ± ± ± TABLE V: Comparison on KTH dataset (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. All-Fea 0.89 ± ± ± ± ± ± SBMLR 0.94 ± ± ± ± ± ± FSNM ± ± ± ± ± ± 0.01 LUDLP ± ± ± ± ± ± ASL ± ± ± ± ± ± LSDF ± ± ± ± ± ± FS-Manifold 0.35 ± ± ± ± ± ± SFSS ± ± ± ± ± ± S FS R ± ± ± ± ± ± OGE-SFS ± ± ± ± ± ± ) All-Fea method shows the worst performances among the other feature selection based methods over all of the datasets. As a result, it is significantly necessary to exploit dimensionality reduction techniques to relieve the curse of dimensionality and mitigate the effects of noises. ) For each method, the performance on each dataset becomes better as the ratio of labeled training video increase. Since the discriminative information is usually encoded in the qualified labels that are typically obtained expensively, it is necessary to develop semi-supervised learning for video semantic recognition. 3) Semi-supervised classification methods with or without feature selection commonly perform better than supervised methods due to the contributions of intrinsic structure exploited from unlabeled videos. We also observe

9 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? TABLE VI: Comparison on HMDB (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. All-Fea ± ± ± ± ± ± SBMLR ± ± ± ± ± ± FSNM ± ± ± ± ± ± LUDLP ± ± ± ± ± ± 0.00 ASL ± ± ± ± ± ± LSDF ± ± ± ± ± ± FS-Manifold 0.17 ± ± ± ± ± ± 0.03 SFSS ± ± ± ± ± ± 0.00 S FS R 0.13 ± ± ± ± ± ± OGE-SFS ± ± ± ± ± ± TABLE VII: p-value of Pair-wised Wilcoxon Test w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data on CCV. All-Fea 4e-11 e-11 5e-11 e-11 3e-11 1e-11 SBLR 4e-11 3e-11 5e-11 4e-11 e-11 3e-11 FSNM 4e-11 1e-11 5e-11 3e-11 e-11 e-11 LSDF 5e-11 3e-11 e-11 3e-11 4e-11 e-11 FS-Manifold 3e-11 3e-11 4e-11 3e-11 e-11 4e-11 SFSS e-11 4e-11 e-11 7e-11 3e-11 3e-11 S FS R e-11 3e-11 4e-11 e-11 e-11 4e-11 TABLE VIII: p-value of Pair-wised Wilcoxon Test w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data on Kodak. All-Fea 4e-11 4e-11 e-11 3e-11 e-11 4e-11 SBLR e-11 4e-11 4e-11 3e-11 e-11 4e-11 FSNM e-11 4e-11 3e-11 4e-11 3e-11 e-11 LSDF 3e-11 e-11 4e-11 e-11 4e-11 e-11 FS-Manifold 4e-11 3e-11 e-11 3e-11 e-11 3e-11 SFSS 5e-11 3e-11 1e-11 3e-11 3e-11 1e-11 S FS R 3e-11 5e-11 e-11 5e-11 3e-11 e-11 TABLE IX: p-value of Pair-wised Wilcoxon Test w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data on CareMedia. All-Fea 4e-11 e-11 3e-11 5e-11 4e-11 e-11 SBLR 5e-11 3e-11 4e-11 e-11 5e-11 4e-11 FSNM 4e-11 3e-11 4e-11 6e-11 3e-11 e-11 LSDF 4e-11 3e-11 4e-11 5e-11 4e-11 3e-11 FS-Manifold 4e-11 3e-11 e-11 3e-11 e-11 3e-11 SFSS 4e-11 e-11 5e-11 4e-11 3e-11 e-11 S FS R e-11 3e-11 e-11 4e-11 e-11 3e-11 TABLE X: p-value of Pair-wised Wilcoxon Test w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data on KTH. All-Fea 4e-11 3e-11 5e-11 3e-11 e-11 5e-11 SBLR 4e-11 e-11 4e-11 3e-11 3e-11 4e-11 FSNM 3e-11 4e-11 5e-11 4e-11 3e-11 5e-11 LSDF e-11 5e-11 3e-11 4e-11 5e-11 3e-11 FS-Manifold 3e-11 e-11 4e-11 4e-11 e-11 5e-11 SFSS 4e-11 3e-11 e-11 4e-11 5e-11 3e-11 S FS R 3e-11 e-11 e-11 3e-11 3e-11 4e-11 that semi-supervised feature selection methods with less unlabeled videos perform much better than supervised feature selection methods. 4) In general, semi-supervised feature selection algorithms (i.e. S FS R, OGE-SFS) outperform semi-supervised classification algorithms (i.e. LUDLP and ASL) on all the datasets, which confirms the necessarity of feature selection for the tasks of video semantic recognition. 5) For each dataset, the proposed OGE-SFS consistently outperforms all the compared semi-supervised feature selection methods on varying number of labeled training videos, thanks to the adaptive neighbor assignment and

10 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? TABLE XI: p-value of Pair-wised Wilcoxon Test w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data on HMDB. All-Fea e-11 4e-11 4e-11 e-11 3e-11 3e-11 SBLR 5e-11 4e-11 3e-11 e-11 4e-11 5e-11 FSNM 4e-11 e-11 3e-11 4e-11 e-11 4e-11 LSDF 3e-11 3e-11 4e-11 e-11 3e-11 3e-11 FS-Manifold 4e-11 3e-11 e-11 4e-11 3e-11 3e-11 SFSS 4e-11 e-11 3e-11 4e-11 e-11 4e-11 S FS R 4e-11 3e-11 3e-11 4e-11 4e-11 5e-11 (a) CCV (b) Kodak (c) Caremedia (d) KTH (e) HMDB Fig. : Convergence curves for the proposed algorithm over all of the datasets MAP 0.4 MAP 0. MAP 0. MAP λ γ 10-3 λ γ 10-3 λ γ 10-3 λ γ (a) CCV (b) Kodak (c) Caremedia (d) KTH (e) HMDB Fig. 3: Sensitivity analysis on regularization parameter γ and λ with 10% labeled training data for all datasets. adaptive loss measurement. Indeed, OGE-SFS can learn the optimal similarity matrix adaptively and capture the more exact intrinsic manifold structure of the original feature space. To ensure the improvement of the proposed algorithm in terms of feature selection is statistically significant, nonparametric pair-wised Wilcoxon test is conducted on the five datasets. The experimental results are reported in Table VII to Table XI, where the victories are highlighted in bold. We compute the p-value of the proposed algorithm against each other algorithm and set the level of significance α = From the results shown in Table VII to Table XI, we observe that the proposed algorithm achieves statistically significant improvements. D. Convergence Analysis In the previous section, we have verified the convergence of the proposed alternative optimization algorithm in theory. For practical application, without loss of generality, we plot the convergence curves of the proposed algorithm in Figure when the regularization parameters γ and λ are fixed at 1 for all of the datasets. Thanks to the sparsity assumption on adaptive similarity matrix S as well as the convex formulation of variables W, b and F, it indicates that the proposed alternative optimization algorithm converges to a local minimum within no more than 0 iterations over all of the datasets, especially, less than 10 iterations are required by CCV, Kodak and CarMedia dataset. As a result, the proposed algorithm is efficient and shows fast convergence in practice. E. Sensitivity Analysis To demonstrate the robustness of the proposed algorithm, we study the influences of the parameters γ and λ with 10% labeled training data for all of the datasets. Figure 3 shows the sensitiveness of the two regularization parameters λ and γ jointly to the performance of video semantic recognition, with fixed adaptive parameter σ = 0.1 and the size of adaptive neighborhood K = 5. We observe from the results of five datasets, on one hand, the proposed methods shows robustness to different values of regularization parameters λ and γ in a wide range; On the other hand, there seems no analogous rule to identify the best parameters for all of the application because of the varying properties related to the datasets. We also conduct experiments to test the influence of the number of nearest neighbors in terms of the experimental settings in Subsection B. Table XII shows the varying performance of the proposed model with respect to different number of nearest neighbors, i.e., K = 5, 10, 0. From the experimental results, we can see that the best performance will generally be obtained when K = 5 over all of the datasets. For more detail influence of K on local graph-based model, the interested readers can see in [5], [53].

11 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? TABLE XII: Influence of K (MAP ± STD) w.r.t. 1%, 5%, 10%, 5%, 50% and 100% labeled training data. Dataset OGE-SFS(5) ± ± ± ± ± ± CCV OGE-SFS(10) ± ± ± ± ± ± OGE-SFS(0) 0.58 ± ± ± ± ± ± OGE-SFS(5) 0.16 ± ± ± ± ± ± Kodak OGE-SFS(10) 0.11 ± ± ± ± ± ± OGE-SFS(0) ± ± ± ± ± ± OGE-SFS(5) ± ± ± ± ± ± CareMeida OGE-SFS(10) ± ± ± ± ± ± OGE-SFS(0) 0.85 ± ± ± ± ± ± OGE-SFS(5) ± ± ± ± ± ± KTH OGE-SFS(10) ± ± ± ± ± ± OGE-SFS(0) ± ± ± ± ± ± OGE-SFS(5) ± ± ± ± ± ± HMDB OGE-SFS(10) ± ± ± ± ± ± 0.01 OGE-SFS(0) 0.11 ± ± ± ± ± ± 0.04 VI. CONCLUSION In this paper, we proposed a novel semi-supervised feature analyzing framework for video semantic recognition by integrating the adaptive optimal similarity matrix learning into the procedure of feature selection. This method alleviates the sensitivity of input affinity matrix to the model performance and captures the intrinsic manifold structure of the original feature space through adaptive neighbor assignment. Additionally, an adaptive loss function is adopted to enhance the model s robustness to the noises and outliers. We exploit a efficient alternative optimization algorithm to solve the proposed challenging problems. The convergence and complexity of the proposed algorithm are analyzed in theory. Both of the theoretical analysis and experimental performance illustrate the effectiveness and superiority of the proposed approaches on varying tasks of video semantic recognition. Note that we focus on the case that each data point belongs to only one category with single view in this paper. In our future work, it is potential interesting to involve the adaptive neighbor assignment into semi-supervised feature selection in the case of multi-labels [3], [31] and multi-view [54], [55], and other interesting applications, i.e., Co-saliency Detection [56], [57]. REFERENCES [1] X. Chang, Y.-L. Yu, Y. Yang, and E. P. Xing, Semantic pooling for complex event analysis in untrimmed videos, IEEE Trans. Pattern Anal. Mach. Intell. [] Y. Han, Y. Yang, Y. Yan, Z. Ma, N. Sebe, and X. Zhou, Semisupervised feature selection via spline regression for video semantic recognition, IEEE Trans. Neural Netw. Learn. Syst., vol. 6, no., pp. 5 64, 015. [3] M. Luo, F. Nie, X. Chang, Y. Yang, A. Hauptmann, and Q. Zheng, Avoiding Optimal Mean Robust PCA/DPCA with Non-greedy l 1 - norm Maximization, in IJCAI, 016. [4] H. Liu and H. Motoda, Feature selection for knowledge discovery and data mining. Springer Science & Business Media, 01. [5] J. G. Dy and C. E. Brodley, Feature selection for unsupervised learning, J. Mach. Learn. Res., vol. 5, pp , 004. [6] Z. Zhao, X. He, D. Cai, L. Zhang, W. Ng, and Y. Zhuang, Graph regularized feature selection with data reconstruction, IEEE Trans. Knowl. Data Eng., vol. 8, no. 3, pp , 016. [7] H. Elghazel and A. Aussem, Feature selection for unsupervised learning using random cluster ensembles, in ICDM, 010. [8] H. Arai, C. Maung, K. Xu, and H. Schweitzer, Unsupervised feature selection by heuristic search with provable bounds on suboptimality, in AAAI, 016. [9] P. Mitra, C. Murthy, and S. K. Pal, Unsupervised feature selection using feature similarity, IEEE Trans. Pattern Anal. Mach. Intell., vol. 4, no. 3, pp , 00. [10] H.-L. Wei and S. A. Billings, Feature subset selection and ranking for data dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell., vol. 9, no. 1, pp , 007. [11] J. Xu, B. Tang, H. He, and H. Man, Semisupervised feature selection based on relevance and redundancy criteria, IEEE Trans. Neural Netw. Learn. Syst., to appear. [1] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan, A multimedia retrieval framework based on semi-supervised ranking and relevance feedback, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp , 01. [13] H. Peng, F. Long, and C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy, IEEE Trans. Pattern Anal. Mach. Intell., vol. 7, no. 8, pp , 005. [14] J. Lee Rodgers and W. A. Nicewander, Thirteen ways to look at the correlation coefficient, The American Statistician, vol. 4, no. 1, pp , [15] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley & Sons, 01. [16] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 01. [17] X. Zhu, Semi-supervised learning with graphs, Ph.D. dissertation, Carnegie Mellon University, 005. [18] X. Chang and Y. Yang, Semi-supervised feature analysis by mining correlations among multiple tasks, IEEE Trans. Neural Netw. Learn. Syst., 016. [19] M. Belkin, P. Niyogi, and V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res., vol. 7, pp , 006. [0] B. S. O. Chapelle and E. A. Zien, Semi-Supervised Learning. MIT Press, 006. [1] J. Lafferty and L. Wasserman, Statistical analysis of semi-supervised regression, 007. [] Z. Zhao and H. Liu, Semi-supervised feature selection via spectral analysis, in SDM, 007. [3] A. Alalga, K. Benabdeslem, and N. Taleb, Soft-constrained laplacian score for semi-supervised multi-label feature selection, Knowl. Inf. Syst., vol. 47, no. 1, pp , 016. [4] Z. Xu, I. King, M. R. Lyu, and R. Jin, Discriminative semi-supervised feature selection via manifold regularization, IEEE Trans. Neural Netw., vol. 1, no. 7, pp , 010. [5] Z. Ma, F. Nie, Y. Yang, J. R. Uijlings, N. Sebe, and A. G. Hauptmann, Discriminating joint feature analysis for multimedia data understanding, IEEE Trans. Multimedia, vol. 14, no. 6, pp , 01. [6] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, On the surprising behavior of distance metrics in high dimensional space. Springer, 001. [7] S. Xiang, F. Nie, C. Zhang, and C. Zhang, Nonlinear dimensionality reduction with local spline embedding, IEEE Trans. Knowl. Data Eng., vol. 1, no. 9, pp , 009.

12 IEEE TRANSACTIONS ON CYBERNETICS, VOL.??, NO.??,?? [8] C. Shi, Q. Ruan, G. An, and R. Zhao, Hessian semi-supervised sparse feature selection based on-matrix norm, IEEE Trans. Multimedia, vol. 17, no. 1, pp. 16 8, 015. [9] K. I. Kim, F. Steinke, and M. Hein, Semi-supervised regression using hessian energy with an application to semi-supervised dimensionality reduction, in NIPS. [30] O. Chapelle, B. Schölkopf, and A. Zien, A discussion of semi-supervised learning and transduction. MIT Press, 006. [31] X. Chang, F. Nie, Y. Yang, and H. Huang, A convex formulation for semi-supervised multi-label feature selection, in AAAI, 014. [3] Z. Xu, I. King, M. R.-T. Lyu, and R. Jin, Discriminative semisupervised feature selection via manifold regularization, IEEE Trans. Neural Netw., vol. 1, no. 7, pp , 010. [33] J. Ren, Z. Qiu, W. Fan, H. Cheng, and S. Y. Philip, Forward semisupervised feature selection, in PAKDD. [34] C. Vens, J. Struyf, L. Schietgat, S. Džeroski, and H. Blockeel, Decision trees for hierarchical multi-label classification, Mach. Learn., vol. 73, no., pp , 008. [35] F. Nie, H. Wang, H. Huang, and C. H. Ding, Adaptive loss minimization for semi-supervised elastic embedding, in IJCAI, 013. [36] F. Wang and C. Zhang, Label propagation through linear neighborhoods, IEEE Trans. Knowl. Data Eng., vol. 0, no. 1, pp , 008. [37] D. Wang, F. Nie, and H. Huang, Large-scale adaptive semi-supervised learning via unified inductive and transductive model, in ACM SIGKDD, 014, pp [38] F. Nie, H. Huang, C. Ding, D. Luo, and H. Wang, Robust principal component analysis with non-greedy l 1 -norm maximization, in IJCAI, 011. [39] C. Ding, D. Zhou, X. He, and H. Zha, r1-pca: rotational invariant l 1 - norm principal component analysis for robust subspace factorization, in ICML, 006. [40] H. Liu, Y. Liu, and F. Sun, Robust exemplar extraction using structured sparse coding, IEEE Trans. Neural Netw. Learn. Syst., vol. 6, no. 8, pp , 015. [41] F. Nie, X. Wang, and H. Huang, Clustering and projected clustering with adaptive neighbors, in ACM KDD, 014. [4] Y. Jiang, G. Ye, S. Chang, D. P. W. Ellis, and A. C. Loui, Consumer video understanding: a benchmark database and an evaluation of human and machine performance, in ICMR, 011. [43] A. C. Loui, J. Luo, S. Chang, D. Ellis, W. Jiang, L. S. Kennedy, K. Lee, and A. Yanagawa, Kodak s consumer video benchmark data set: concept definition and annotation, in ACM SIGMM MIR, 007. [44] [Online]. Available: [45] M.-y. Chen and A. Hauptmann, Mosift: Recognizing human actions in surveillance videos, CarnegieMellon Univ., Tech. Rep. CMU-CS [46] C. Schüldt, I. Laptev, and B. Caputo, Recognizing human actions: A local SVM approach, in ICPR, 004. [47] G. C. Cawley, N. L. C. Talbot, and M. Girolami, Sparse multinomial logistic regression via bayesian L1 regularisation, in NIPS, 006. [48] F. Nie, H. Huang, X. Cai, and C. H. Q. Ding, Efficient and robust feature selection via joint l,1-norms minimization, in NIPS, 010. [49] X. Zhu and Z. Ghahramani, Learning from labeled and unlabeled data with label propagation, Technical Report CMU-CALD-0-107, Carnegie Mellon University, Tech. Rep., 00. [50] J. Zhao, K. Lu, and X. He, Locality sensitive semi-supervised feature selection, Neurocomputing, vol. 71, no. 10-1, pp , 008. [51] S. Wang, Y. Yang, Z. Ma, X. Li, C. Pang, and A. G. Hauptmann, Action recognition by exploring data distribution and feature correlation, in CVPR, 01. [5] M. Maier, U. V. Luxburg, and M. Hein, Influence of graph construction on graph-based clustering measures, in NIPS, 008. [53] M. H. Rohban and H. R. Rabiee, Supervised neighborhood graph construction for semi-supervised classification, Pattern Recognition, vol. 45, no. 4, pp , 01. [54] S. Sun, Multi-view laplacian support vector machines, in International Conference on Advanced Data Mining and Applications, 011, pp. 09. [55] A. Iosifidis, A. Tefas, and I. Pitas, Regularized extreme learning machine for multi-view semi-supervised action recognition, Neurocomputing, vol. 145, pp. 50 6, 014. [56] D. Zhang, J. Han, C. Li, J. Wang, and X. Li, Detection of co-salient objects by looking deep and wide, International Journal of Computer Vision, vol. 10, no., pp. 15 3, 016. [57] D. Zhang, J. Han, J. Han, and L. Shao, Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining, IEEE Trans. Neural Netw. Learning Syst., vol. 7, no. 6, pp , 016. Minnan Luo received the Ph. D. degree from the Department of Computer Science and Technology, Tsinghua University, China, in 014. Currently, she is an Assistant Professor in the School of Electronic and Information Engineering at Xi an Jiaotong University. She was a Post-Doctoral Research with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. Her research interests include machine learning and optimization, crossmedia retrieval and fuzzy system. Xiaojun Chang received the PhD degree in computer science from Centre for Quantum Computation and Intelligent Systems (QCIS), University of Technology Sydney, Australia in 016. He has been working as a visiting student at the language technology institute of Carnegie Mellon University. His main research interests include machine learning, data mining and computer vision. Liqiang Nie received the B.E. degree from the Xi an Jiaotong University and the Ph.D. degree from the National University of Singapore in 009 and 013, respectively. He is currently a Research Fellow with the School of Computing, National University of Singapore. His current research interests include information retrieval and healthcare analytics. Yi Yang received the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China, in 010. He is currently an associate professor with University of Technology Sydney, Australia. He was a Post-Doctoral Research with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. His current research interest include machine learning and its applications to multimedia content analysis and computer vision, such as multimedia indexing and retrieval, surveillance video analysis and video semantics understanding. Alexander G. Hauptmann received the B.A. and M.A. degrees in psychology from Johns Hopkins University, Baltimore, MD, the degree in computer science from the Technische Universitat Berlin, Germany, in 1984, and the Ph.D. degree in computer science from Carnegie Mellon University, USA, in He is currently with the Department of Computer Science and the Language Technologies Institute, CMU. His research interests include natural language processing, speech understanding and synthesis, video analysis, and machine learning. Qinghua Zheng received the B.S. degree in computer software in 1990, the M.S. degree in computer organization and architecture in 1993, and the Ph.D. degree in system engineering in 1997 from Xi an Jiaotong University, China. He was a postdoctoral researcher at Harvard University in 00. He is currently a professor in Xi an Jiaotong University, and the dean of the Department of Computer Science. His research areas include computer network security, intelligent E-learning theory and algorithm, multimedia e-learning, and trustworthy software.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1. Adaptive Unsupervised Feature Selection with Structure Regularization

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1. Adaptive Unsupervised Feature Selection with Structure Regularization IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Adaptive Unsupervised Feature Selection with Structure Regularization Minnan Luo, Feiping Nie, Xiaojun Chang, Yi Yang, Alexander G. Hauptmann

More information

Efficient Iterative Semi-supervised Classification on Manifold

Efficient Iterative Semi-supervised Classification on Manifold . Efficient Iterative Semi-supervised Classification on Manifold... M. Farajtabar, H. R. Rabiee, A. Shaban, A. Soltani-Farani Sharif University of Technology, Tehran, Iran. Presented by Pooria Joulani

More information

Improving Image Segmentation Quality Via Graph Theory

Improving Image Segmentation Quality Via Graph Theory International Symposium on Computers & Informatics (ISCI 05) Improving Image Segmentation Quality Via Graph Theory Xiangxiang Li, Songhao Zhu School of Automatic, Nanjing University of Post and Telecommunications,

More information

1 Training/Validation/Testing

1 Training/Validation/Testing CPSC 340 Final (Fall 2015) Name: Student Number: Please enter your information above, turn off cellphones, space yourselves out throughout the room, and wait until the official start of the exam to begin.

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Minh Dao 1, Xiang Xiang 1, Bulent Ayhan 2, Chiman Kwan 2, Trac D. Tran 1 Johns Hopkins Univeristy, 3400

More information

Function approximation using RBF network. 10 basis functions and 25 data points.

Function approximation using RBF network. 10 basis functions and 25 data points. 1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Unsupervised and Semi-Supervised Learning vial 1 -Norm Graph

Unsupervised and Semi-Supervised Learning vial 1 -Norm Graph Unsupervised and Semi-Supervised Learning vial -Norm Graph Feiping Nie, Hua Wang, Heng Huang, Chris Ding Department of Computer Science and Engineering University of Texas, Arlington, TX 769, USA {feipingnie,huawangcs}@gmail.com,

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

GRAPH-BASED SEMI-SUPERVISED HYPERSPECTRAL IMAGE CLASSIFICATION USING SPATIAL INFORMATION

GRAPH-BASED SEMI-SUPERVISED HYPERSPECTRAL IMAGE CLASSIFICATION USING SPATIAL INFORMATION GRAPH-BASED SEMI-SUPERVISED HYPERSPECTRAL IMAGE CLASSIFICATION USING SPATIAL INFORMATION Nasehe Jamshidpour a, Saeid Homayouni b, Abdolreza Safari a a Dept. of Geomatics Engineering, College of Engineering,

More information

Learning Two-View Stereo Matching

Learning Two-View Stereo Matching Learning Two-View Stereo Matching Jianxiong Xiao Jingni Chen Dit-Yan Yeung Long Quan Department of Computer Science and Engineering The Hong Kong University of Science and Technology The 10th European

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Visual Understanding via Multi-Feature Shared Learning with Global Consistency

Visual Understanding via Multi-Feature Shared Learning with Global Consistency Visual Understanding via Multi-Feature Shared Learning with Global Consistency Lei Zhang, Member, IEEE, and David Zhang, Fellow, IEEE Abstract Image/video data is usually represented with multiple visual

More information

The Constrained Laplacian Rank Algorithm for Graph-Based Clustering

The Constrained Laplacian Rank Algorithm for Graph-Based Clustering The Constrained Laplacian Rank Algorithm for Graph-Based Clustering Feiping Nie, Xiaoqian Wang, Michael I. Jordan, Heng Huang Department of Computer Science and Engineering, University of Texas, Arlington

More information

Semi-supervised Data Representation via Affinity Graph Learning

Semi-supervised Data Representation via Affinity Graph Learning 1 Semi-supervised Data Representation via Affinity Graph Learning Weiya Ren 1 1 College of Information System and Management, National University of Defense Technology, Changsha, Hunan, P.R China, 410073

More information

Robust Spectral Learning for Unsupervised Feature Selection

Robust Spectral Learning for Unsupervised Feature Selection 24 IEEE International Conference on Data Mining Robust Spectral Learning for Unsupervised Feature Selection Lei Shi, Liang Du, Yi-Dong Shen State Key Laboratory of Computer Science, Institute of Software,

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

An efficient face recognition algorithm based on multi-kernel regularization learning

An efficient face recognition algorithm based on multi-kernel regularization learning Acta Technica 61, No. 4A/2016, 75 84 c 2017 Institute of Thermomechanics CAS, v.v.i. An efficient face recognition algorithm based on multi-kernel regularization learning Bi Rongrong 1 Abstract. A novel

More information

MOST machine learning algorithms rely on the assumption

MOST machine learning algorithms rely on the assumption 1 Domain Adaptation on Graphs by Learning Aligned Graph Bases Mehmet Pilancı and Elif Vural arxiv:183.5288v1 [stat.ml] 14 Mar 218 Abstract We propose a method for domain adaptation on graphs. Given sufficiently

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

Limitations of Matrix Completion via Trace Norm Minimization

Limitations of Matrix Completion via Trace Norm Minimization Limitations of Matrix Completion via Trace Norm Minimization ABSTRACT Xiaoxiao Shi Computer Science Department University of Illinois at Chicago xiaoxiao@cs.uic.edu In recent years, compressive sensing

More information

Robust Visual Knowledge Transfer via Extreme Learning Machine Based Domain Adaptation

Robust Visual Knowledge Transfer via Extreme Learning Machine Based Domain Adaptation Robust Visual Knowledge Transfer via Extreme Learning Machine Based Domain Adaptation Lei Zhang, Member, IEEE and David Zhang, Fellow, IEEE Abstract We address the problem of visual knowledge adaptation

More information

Globally and Locally Consistent Unsupervised Projection

Globally and Locally Consistent Unsupervised Projection Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Globally and Locally Consistent Unsupervised Projection Hua Wang, Feiping Nie, Heng Huang Department of Electrical Engineering

More information

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017 CPSC 340: Machine Learning and Data Mining Kernel Trick Fall 2017 Admin Assignment 3: Due Friday. Midterm: Can view your exam during instructor office hours or after class this week. Digression: the other

More information

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM 96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

Discriminating Joint Feature Analysis for Multimedia Data Understanding

Discriminating Joint Feature Analysis for Multimedia Data Understanding Discriminating Joint Feature Analysis for Multimedia Data Understanding Zhigang Ma, Feiping Nie, Yi Yang, Jasper Uijlings, Nicu Sebe, and Alexander G Hauptmann Abstract In this paper, we propose a novel

More information

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA Retrospective ICML99 Transductive Inference for Text Classification using Support Vector Machines Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA Outline The paper in

More information

Robust Shape Retrieval Using Maximum Likelihood Theory

Robust Shape Retrieval Using Maximum Likelihood Theory Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2

More information

Aggregating Descriptors with Local Gaussian Metrics

Aggregating Descriptors with Local Gaussian Metrics Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,

More information

Markov Random Fields and Gibbs Sampling for Image Denoising

Markov Random Fields and Gibbs Sampling for Image Denoising Markov Random Fields and Gibbs Sampling for Image Denoising Chang Yue Electrical Engineering Stanford University changyue@stanfoed.edu Abstract This project applies Gibbs Sampling based on different Markov

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

Support Vector Machines

Support Vector Machines Support Vector Machines SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

Image Processing. Image Features

Image Processing. Image Features Image Processing Image Features Preliminaries 2 What are Image Features? Anything. What they are used for? Some statements about image fragments (patches) recognition Search for similar patches matching

More information

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012) Three things everyone should know to improve object retrieval Relja Arandjelović and Andrew Zisserman (CVPR 2012) University of Oxford 2 nd April 2012 Large scale object retrieval Find all instances of

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Unsupervised Feature Selection for Sparse Data

Unsupervised Feature Selection for Sparse Data Unsupervised Feature Selection for Sparse Data Artur Ferreira 1,3 Mário Figueiredo 2,3 1- Instituto Superior de Engenharia de Lisboa, Lisboa, PORTUGAL 2- Instituto Superior Técnico, Lisboa, PORTUGAL 3-

More information

Lecture 2 September 3

Lecture 2 September 3 EE 381V: Large Scale Optimization Fall 2012 Lecture 2 September 3 Lecturer: Caramanis & Sanghavi Scribe: Hongbo Si, Qiaoyang Ye 2.1 Overview of the last Lecture The focus of the last lecture was to give

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

An Approach for Reduction of Rain Streaks from a Single Image

An Approach for Reduction of Rain Streaks from a Single Image An Approach for Reduction of Rain Streaks from a Single Image Vijayakumar Majjagi 1, Netravati U M 2 1 4 th Semester, M. Tech, Digital Electronics, Department of Electronics and Communication G M Institute

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

Trace Ratio Criterion for Feature Selection

Trace Ratio Criterion for Feature Selection Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Trace Ratio Criterion for Feature Selection Feiping Nie 1, Shiming Xiang 1, Yangqing Jia 1, Changshui Zhang 1 and Shuicheng

More information

Robust Ring Detection In Phase Correlation Surfaces

Robust Ring Detection In Phase Correlation Surfaces Griffith Research Online https://research-repository.griffith.edu.au Robust Ring Detection In Phase Correlation Surfaces Author Gonzalez, Ruben Published 2013 Conference Title 2013 International Conference

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER A.Shabbir 1, 2 and G.Verdoolaege 1, 3 1 Department of Applied Physics, Ghent University, B-9000 Ghent, Belgium 2 Max Planck Institute

More information

A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

More information

Face Recognition Based on LDA and Improved Pairwise-Constrained Multiple Metric Learning Method

Face Recognition Based on LDA and Improved Pairwise-Constrained Multiple Metric Learning Method Journal of Information Hiding and Multimedia Signal Processing c 2016 ISSN 2073-4212 Ubiquitous International Volume 7, Number 5, September 2016 Face Recognition ased on LDA and Improved Pairwise-Constrained

More information

Data fusion and multi-cue data matching using diffusion maps

Data fusion and multi-cue data matching using diffusion maps Data fusion and multi-cue data matching using diffusion maps Stéphane Lafon Collaborators: Raphy Coifman, Andreas Glaser, Yosi Keller, Steven Zucker (Yale University) Part of this work was supported by

More information

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\ Data Preprocessing S - MAI AMLT - 2016/2017 (S - MAI) Data Preprocessing AMLT - 2016/2017 1 / 71 Outline 1 Introduction Data Representation 2 Data Preprocessing Outliers Missing Values Normalization Discretization

More information

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE REGULARIZATION METHODS FOR HIGH DIMENSIONAL LEARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu June 6, 2011 ABOUT THIS

More information

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\ Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured

More information

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Atul Kanaujia, CBIM, Rutgers Cristian Sminchisescu, TTI-C Dimitris Metaxas,CBIM, Rutgers 3D Human Pose Inference Difficulties Towards

More information

Convex Optimization / Homework 2, due Oct 3

Convex Optimization / Homework 2, due Oct 3 Convex Optimization 0-725/36-725 Homework 2, due Oct 3 Instructions: You must complete Problems 3 and either Problem 4 or Problem 5 (your choice between the two) When you submit the homework, upload a

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Multiple-Choice Questionnaire Group C

Multiple-Choice Questionnaire Group C Family name: Vision and Machine-Learning Given name: 1/28/2011 Multiple-Choice naire Group C No documents authorized. There can be several right answers to a question. Marking-scheme: 2 points if all right

More information

A Geometric Analysis of Subspace Clustering with Outliers

A Geometric Analysis of Subspace Clustering with Outliers A Geometric Analysis of Subspace Clustering with Outliers Mahdi Soltanolkotabi and Emmanuel Candés Stanford University Fundamental Tool in Data Mining : PCA Fundamental Tool in Data Mining : PCA Subspace

More information

SUPPLEMENTARY MATERIAL

SUPPLEMENTARY MATERIAL SUPPLEMENTARY MATERIAL Zhiyuan Zha 1,3, Xin Liu 2, Ziheng Zhou 2, Xiaohua Huang 2, Jingang Shi 2, Zhenhong Shang 3, Lan Tang 1, Yechao Bai 1, Qiong Wang 1, Xinggan Zhang 1 1 School of Electronic Science

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

Sparse coding for image classification

Sparse coding for image classification Sparse coding for image classification Columbia University Electrical Engineering: Kun Rong(kr2496@columbia.edu) Yongzhou Xiang(yx2211@columbia.edu) Yin Cui(yc2776@columbia.edu) Outline Background Introduction

More information

Data-driven Kernels for Support Vector Machines

Data-driven Kernels for Support Vector Machines Data-driven Kernels for Support Vector Machines by Xin Yao A research paper presented to the University of Waterloo in partial fulfillment of the requirement for the degree of Master of Mathematics in

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions

More information

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Hyunghoon Cho and David Wu December 10, 2010 1 Introduction Given its performance in recent years' PASCAL Visual

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Clustering. Supervised vs. Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

An Introduction to Machine Learning

An Introduction to Machine Learning TRIPODS Summer Boorcamp: Topology and Machine Learning August 6, 2018 General Set-up Introduction Set-up and Goal Suppose we have X 1,X 2,...,X n data samples. Can we predict properites about any given

More information

A new Graph constructor for Semi-supervised Discriminant Analysis via Group Sparsity

A new Graph constructor for Semi-supervised Discriminant Analysis via Group Sparsity 2011 Sixth International Conference on Image and Graphics A new Graph constructor for Semi-supervised Discriminant Analysis via Group Sparsity Haoyuan Gao, Liansheng Zhuang, Nenghai Yu MOE-MS Key Laboratory

More information

Learning a Manifold as an Atlas Supplementary Material

Learning a Manifold as an Atlas Supplementary Material Learning a Manifold as an Atlas Supplementary Material Nikolaos Pitelis Chris Russell School of EECS, Queen Mary, University of London [nikolaos.pitelis,chrisr,lourdes]@eecs.qmul.ac.uk Lourdes Agapito

More information

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

More information

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE REGULARIZATION METHODS FOR HIGH DIMENSIONAL LEARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu June 3, 2013 ABOUT THIS

More information

Machine Learning Feature Creation and Selection

Machine Learning Feature Creation and Selection Machine Learning Feature Creation and Selection Jeff Howbert Introduction to Machine Learning Winter 2012 1 Feature creation Well-conceived new features can sometimes capture the important information

More information

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation. Equation to LaTeX Abhinav Rastogi, Sevy Harris {arastogi,sharris5}@stanford.edu I. Introduction Copying equations from a pdf file to a LaTeX document can be time consuming because there is no easy way

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

Hyperspectral Data Classification via Sparse Representation in Homotopy

Hyperspectral Data Classification via Sparse Representation in Homotopy Hyperspectral Data Classification via Sparse Representation in Homotopy Qazi Sami ul Haq,Lixin Shi,Linmi Tao,Shiqiang Yang Key Laboratory of Pervasive Computing, Ministry of Education Department of Computer

More information

The K-modes and Laplacian K-modes algorithms for clustering

The K-modes and Laplacian K-modes algorithms for clustering The K-modes and Laplacian K-modes algorithms for clustering Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced http://faculty.ucmerced.edu/mcarreira-perpinan

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Locality Preserving Projections (LPP) Abstract

Locality Preserving Projections (LPP) Abstract Locality Preserving Projections (LPP) Xiaofei He Partha Niyogi Computer Science Department Computer Science Department The University of Chicago The University of Chicago Chicago, IL 60615 Chicago, IL

More information

Exploring Curve Fitting for Fingers in Egocentric Images

Exploring Curve Fitting for Fingers in Egocentric Images Exploring Curve Fitting for Fingers in Egocentric Images Akanksha Saran Robotics Institute, Carnegie Mellon University 16-811: Math Fundamentals for Robotics Final Project Report Email: asaran@andrew.cmu.edu

More information