Feature Selection in a Kernel Space

Size: px
Start display at page:

Download "Feature Selection in a Kernel Space"

Transcription

1 Bin Cao Peking University, Beijing, China Dou Shen Hong Kong University of Science and Technology, Hong Kong Jian-Tao Sun Microsoft Research Asia, 49 Zhichun Road, Beijing, China Qiang Yang Hong Kong University of Science and Technology, Hong Kong Zheng Chen Microsoft Research Asia, 49 Zhichun Road, Beijing, China Abstract We address the problem of feature selection in a kernel space to select the most discriminative and informative features for classification and data analysis. This is a difficult problem because the dimension of a kernel space may be infinite. In the past, little work has been done on feature selection in a kernel space. To solve this problem, we derive a basis set in the kernel space as a first step for feature selection. Using the basis set, we then extend the margin-based feature selection algorithms that are proven effective even when many features are dependent. The selected features form a subspace of the kernel space, in which different state-of-the-art classification algorithms can be applied for classification. We conduct extensive experiments over real and simulated data to compare our proposed method with four baseline algorithms. Both theoretical analysis and experimental results validate the effectiveness of our proposed method. Appearing in Proceedings of the 24 th International Conference on Machine Learning, Corvallis, OR, 27. Copyright 27 by the author(s)/owner(s). 1. Introduction Finding a proper representation of data from their original features is a fundamental problem in machine learning. Generally speaking, not all original features are beneficial for classification tasks. Some of the features, which can even be noise, will hurt the classification performance. Finding a proper representation is essential for removing the noisy features or for deriving some new features from the original space, so that we only need to keep the necessary information for classification and data analysis. Solutions for feature selection problems can be broadly classified into two groups: feature extraction (such as PCA(I.T.Jolliffe., 22), LDA(Fukunaga, 199)) and feature selection (such as IG, MI(Guyon & Elisseeff, 23), Relief (Kira & Rendell, 1992)). Traditional linear feature selection and extraction operations are conducted in the original input space, and thus cannot handle nonlinear relationships in the data well. For example, the principal components of features may be nonlinearly related to the input variables and the data of different categories are not separable by a hyperplane. To tackle this problem, kernel methods are introduced by mapping the data from an original space (or an input space) to a kernel space using a specially designed mapping function. Among these methods, Kernel Principle Component Analysis (KPCA) (Scholkopf & A.J.Smola., 22) was proposed to find nonlinear principal components. However, as an unsupervised learning method, KPCA does not take the 121

2 class 1 class Figure 1: Illustrating the XOR data distribution. The left figure shows the data distribution in an original space, which is non-linearly separable. The middle figure shows the data distribution in the kernel space, which is linearly separable. The rightmost sub-figure shows the data distribution in a kernel subspace, where the data are linearly separable in a space with low dimensions category information into consideration. Kernel Fisher Discriminant Analysis () (Mika et al., 1999a) and Generalized Discriminant Analysis () (Baudat & Anouar, 2) can make use of category information. However, the number of extracted features by these methods is limited by the number of categories which is at most n 1 if the category count is n. In practice, it is not uncommon that there are more than n 1 features that are useful for classification. In addition, these kernelized methods has high computational complexity, since they rely on the computation of SVD (Horn & Johnson, 1985), which requires at least O(N 2 k) time, to our best knowledge, where N is the number data points to be considered and k is the number of eigenvectors. Intuitively, we can adapt information gain (IG) or mutual information (MI) methods in the kernel space. However, the mapping functions from the original space to the kernel space are usually implicit. Even if the function is explicit, the dimension of kernel space may be infinite. Furthermore, feature dependency often prevents us from considering the features in a kernel space by IG or MI. In this paper, we present a novel method for feature selection directly in the kernel space, without moving back to the original feature space as in (Bradley & Mangasarian, 1998; Liang & Zhao, 26). To do this, we first construct a basis set in the kernel space. We then extend the margin-based feature-selection methods to select the best bases for classification. Finally, we classify the test data in the kernel space with the selected basis set directly. Compared to methods such as KPCA and, the computational complexity of our method is reduced significantly to O(N 2 ). We conduct extensive experiments over the IDA data sets to compare our method with four state-of-the-art baselines. Theoretical analysis and experimental results validate the effectiveness of our proposed method. 2. Related Works Traditional feature selection methods such as Information Gain and χ 2 -Test consider the contribution of each feature to the classification task independently. They have been proven to be effective in many applications (Guyon & Elisseeff, 23; Yang & Pedersen, 1997). However, they do not work well when features are not independent from each other as in the XOR problem on the leftmost sub-figure of Figure 1. Recently, margin based feature selection or feature weighting methods are proposed including ReliefF (Kononenko, 1994), Simba(Gilad-Bachrach et al., 24), I-Relief (Sun & Li, 26). They are derived from the basic Relief algorithm(kira & Rendell, 1992) which weights each feature to get maximal margin. The margin of a data point x used in these methods is defined as the distance between the the nearest same-labelled data (nearhit) to x and the nearest different-labeled data (nearmiss) to x, respectively. The advantage of margin based feature selection methods is optimal weights of features can be estimated by considering the global information. Thus these methods work well even for data with dependent features (Gilad-Bachrach et al., 24). The concept of margins has been widely used in machine learning in the past decades to measure the models generalization ability. Our work is also related to kernel methods for feature selection (Bradley & Mangasarian, 1998; Liang & Zhao, 26). However, most of the previous work in this area uses kernels to help select features in the original space. In contrast, our goal is to select features in the kernel space, because we cannot change non-separability nature of the data through feature selection in the original space. However, feature selection in a kernel space can handle this problem, as illustrated in the right sub-figure of Figure 1. Kernel based feature-extraction methods have also been designed to address the nonlinear projection problem. Kernel Principle Component Analysis (Scholkopf & A.J.Smola., 22) was proposed to find principal components in high-dimensional feature spaces. KPCA is shown to be effective in many tasks such as image de-noising (Mika et al., 1999b). However, KPCA does not consider the category information. As a result, features extracted by KPCA may be irrelevant to a classification task. To tackle this problem, Kernel based LDA methods such as FKD (Mika et al., 1999a), (Baudat & Anouar, 2) are proposed. extracts the features that are discriminative and beneficial for classification in a kernel space. However, the number of extracted features by is limited by the number of categories. 122

3 (Weston et al., 23) discussed the problem of kernel feature selection when the mapping function between the orignal space and the kernel space can be explicitly expressed. This work is limited to only certain types of kernels, such as the polynomial kernels. However, our method does not have this restriction. We are able to deal with any kernels as a result. (Baudat & Anouar, 23) proposed a method for featurevector selection using data selection. However, similar to KPCA, the method does not find discriminative feature vectors. In (Niijima & Kuhara, 26), Niijima proposed a wrapper method that utilizes the kernelbased classifiers for gene-subset selection. However, the wrapper method faces the problem of high computational complexity. In (Wu et al., 25), the author proposed the Sparse Large Margin Classifiers. They find that sparse SVM classifiers in a kernel space usually have better generalization ability. However, their work is restricted to SVM classifiers, whereas our method can be applied to different classifiers. 3. Feature Selection in a Kernel Space Let D = {(x (n),y (n) )} N n=1 R I {±1} denote a training dataset, where I is the data dimensionality. K(x,x ) is the kernel function with an implicit mapping function φ(x) = z where x R I,z F, where F is a high-dimensional space. Generally speaking, the features in the kernel space are not assumed to be independent. Therefore, feature selection methods that consider each feature individually are unlikely to work well in a kernel space. However, a margin-based feature-selection method can handle the feature-dependency problem successfully, as explored in (Gilad-Bachrach et al., 24). In the following, we give a brief introduction to the basic margin-based feature-selection method Relief (Kononenko, 1994; Sun & Li, 26), and then present our algorithm to extend Relief in a kernel space Margin Based Feature Selection (Weighting) Methods Margin-based feature-selection methods such as ReliefF, Simba, I-Relief have been developed based on the idea of Relief, which uses a weighting-based method for feature selection. The weight of each feature reflects its ability to discriminate different categories. Relief tries to find weights by maximizing the margins between data of different categories. The margin of a data point x used in Relief is defined by the distance between the nearhit NH(x) and nearmiss NM(x) of the target data point, which are the nearest point with the same and different label, respectively. The margin of the whole training dataset is the sum of the margin of all data points, as shown in the following equation. max w D (D w (x (n),nm(x (n) )) D w (x (n),nh(x (n) ))) n=1 s.t. w 2 2 = 1,w where D w (x,x ) = i=1 w i x i x i is a projection of the difference-vector onto a dimension, and w i is the weight of the i th feature. Thus, the feature weighting problem is converted to an optimization problem, for maximizing the above function. The solution to the above problem can be found by solving for the weight vector w = (m) + / (m) + 2, where m = D n=1 (x(n) NM(x (n) )) (x (n) NH(x (n) )) which is the margin vector and (m i ) + = max(m i,). We refer the readers to (Kononenko, 1994; Sun & Li, 26) for details of Relief Margin Based Feature Selection in Kernel Space Now we discuss how to extend Relief in a kernel space. Note that Relief calculates the weight of each feature by w i = e (i) (m) + / (m) + 2, where e (i) is the i th basis in a Euclidean space and m is a margin vector. Similarly, we can derive the weight of a feature in the kernel space by w i = e (i) (m ) + / (m ) + 2, where e (i) is the basis of the kernel space and m is a margin vector in the kernel space. Recall that the margin defined in Relief is determined by the distance defined in a space. We can extend Relief to a kernel space by using a distance function induced by the corresponding kernel function, as shown in Equation 1. The nearhit and nearmiss points are found in the kernel space based on the distance. D K (x,x ) = φ(x) φ(x ),φ(x) φ(x ) = K(x,x) + K(x,x ) 2K(x,x ) (1) where φ(x) is the mapping function corresponding to the kernel K(x,x ). When we want to calculate the distance in the weighted kernel space, we need to use the weighted kernel K w as follows K w (x,x ) = i w i φ i (x)φ i (x ) where φ i (x) is the i th feature of φ(x). In the kernel space, the index of i is allowed to range to infinity. Since we do not have φ i (x) explicitly, we cannot calculate the above equation directly. However, if we have 123

4 an orthogonal basis set {η (i) } in the kernel space, then the function K w (x,x ) can be calculated as K w (x,x ) = i w i φ(x),η (i) φ(x ),η (i) Given a data set D, we let the bases of a kernel space be expressed by combinations of the data in high dimensional spaces. That is: η (i) = j α ijφ(x (j) ). We have the following equation: D K w (x,x ) = w i (φ(x) η (i) )(φ(x ) η (i) ) i D = w i ( i j α ij K(x, x (j) ))( j α ij K(x, x (j) )) We should point out that when the data are not sufficient or the dimension of the kernel space is infinite, not all the bases of kernel space can be expressed by combinations of the data. Therefore this method can only get an approximation K w(x,x ) of K w (x,x ). However, the following theorem shows the approximation is reasonable. Theorem 1. Let S be the space spanned by the available high dimensional data. If φ(x ) S, then we have K w (x,x ) = K w(x,x ) Proof. K w (x,x ) = K w(x,x ) + i= D +1 w i(φ(x) η (i) )(φ(x ) η (i) ), where η (i),i = D + 1 are bases of orthogonal complement subspace of S. Since φ(x) is contained in S, therefore η (i) φ(x) =,i = D + 1. Hence we have K w(x,x ) = K w (x,x ). If φ(x ) / S, the difference between K w(x,x ) and K w (x,x ) is controlled by φ(x ) φ (x ), where φ (x ) is the projection vector of φ(x ) on S. If we assume the training and test data are sampled from the same distribution, then this term can be regarded as extremely small and the approximation would be reasonable Finding the Basis Set From Section 3.2 we can see that, to extend Relief in a kernel space, a critical issue is to find a basis set in the kernel space. In this section, we propose to use the Kernel Gram-Schmidt Process to find an orthogonal basis set. Given a data set in an original space D and a kernel function K(x,x ), as well as an implicit mapping function φ(x), the mapped data in the kernel space are D = {φ(x (i) ) i = 1... D }. Let µ = Algorithm 1 Kernel Gram-Schmidt Process Input: data x (i) (i = 1..N) Output: an orthogonal set of basis vectors for i = 1 to N do v (i) = φ(x (i) ) for j = 1 to i 1 do v (i) v (i) φ(x (i) ),v (j) v (j) end for Normalize: v (i) end for Output: basis set {v (i) } v(i) v (i) (φ(x (1) ),φ(x (2) ),...,φ(x ( D ) )) T. Assume that the rank of D is D. The Kernel Gram-Schmidt Process can find a subset of a orthogonal basis set in a kernel space. This is done by Algorithm (1), where as output, each basis is a linear combination of the data in high dimensions. That is v (i) = j α ijφ(x (j) ). Thus, we wish to find the mixture weights calculated as follows: α (i) = e (i) i 1 k=1 α(k) j α kjk(x (k),x (j) ) (e (i) i 1 k=1 α(k) j α kjk(x (k),x (j) )) µ where e (i) is a vector in which the i th element equals to one and all other elements are equal to zero. α (i) = (α i1,α i2,...,α i D ). When the input data are mapped into a kernel space, the dimensionality of the mapped data are usually very high, and may even reach infinity. We do not expect to find all the basis in the kernel space. In fact, the number of orthogonal basis we get by Gram-Schmidt Process is the rank of the mapped data set {φ(x)}. This rank also equals to the numerical rank of the kernel matrix. Besides the Kernel Gram-Schmidt Process, other methods such as Kernel PCA can also be used to find a basis set in a kernel space. Since the orthogonal basis sets found by different methods are all orthogonal in the kernel subspace S, and since the number of the basis are equal to the rank of the high-dimensional data set {φ(x)}, the basis set found by the Gram-Schmidt Process or by other methods are equivalent under an orthogonal transformation. Now we show that the selection of a basis set does not affect the distance measure learned in the kernel space. Theorem 2. {e (i) } and {e (i) } are two different orthogonal basis sets in a kernel space, which are equivalent under an orthogonal transformation. Then, a weighting transformation under {e (i) } is also a weighting transformation under {e (i) }. 124

5 Proof. Suppose e (i) = j ξ ije (j), then φ w (x) = w i φ(x),e (i) i = w i φ(x), ξ ij e (j) i j = w j φ(x),e (j) j where w j = i w iξ ij (3) Algorithm 2 FSKS Input: training data x i, label y i Output: selected features in the kernel space step 1: Constructing a basis set by either Kernel GP or Kernel PCA step 2: Calculating w i by Kernel Relief step 3: Ranking implicit features by w i, select features based on the rank step 4: Projecting the data into the learned subspace Theorem 3. If the orthogonal basis sets of a kernel space are equivalent under an orthogonal transformation, then the optimal K w (x,x ) learned under different basis sets are equivalent. Dimension Class 1 Class 2 Class 3 Dimension Class 1 Class 2 Class 3 Proof. Suppose {e (i) } and {e (i) } are two different orthogonal basis sets of kernel space which are equivalent under an orthogonal transformation. Based on Theorem 2, the weighting problem under {e (i) } can be converted to the weighting problem under {e (i) }. Therefore there exist K w (x,x ) under {e (i) } equivalent to K w (x,x ) under {e (i) }. Since the optimal K w (x,x ) is unique, the above conclusion can be draw. From Theorem 3, we can see that no matter what particular method is used to find the basis set in a kernel space, the optimal distance K w (x,x ) remains the same. However, although the weighted feature-space kernels K w (x,x ) are all equivalent, when we discretize the above-computed weights to binary values (zero or one) for feature selection, the results may still be different. We plan to investigate this difference in our future work. In this paper, we just use the Kernel Gram- Schmidt Process and Kernel PCA to find the orthogonal basis set in the kernel space Feature-Selection Algorithms Our feature-selection framework is summarized in Algorithm (2), where by using different basis-computing method, we can get different algorithms. When Kernel PCA is applied for finding the kernel basis set, we call the algorithm FSKSPCA. When Kernel Gram- Schmidt Process is used, we call the algorithm Discussion Feature selection in the kernel space has very close relationship with SVM regularization methods such as (Wu et al., 25). We share the interest to find a discriminating subspace of the feature space F. The Dimension Dimension 9 Figure 2: Iris data in different kernel subspaces difference is that our method finds an explicit expression of the subspace and removes the dependence on certain classification algorithms. 4. Experiments In this section, we conduct extensive experiments to compare our proposed feature-selection method with some state-of-the-art baselines. The experimental results validate the effectiveness of our method Case Study: Discriminative Power The iris-flower data set (Fisher, 1936) is widely used to test feature selection and feature extraction methods (Baudat & Anouar, 2; Yan et al., 25). The data set contains 15 examples in 3 categories. The original feature space of the data has four dimensions, which are sepal length, sepal width, petal length and petal width. We use this dataset to illustrate that our method, by extending Relief in kernel spaces, can effectively find discriminative features for classification. The result of applying our method to the iris data is shown in Figure 2. In this experiment, an RBF kernel with σ = 1 is used. We construct 15 implicit features in the kernel space where they are ranked according to their weights. Features that are top-ranked are expected to contain more discriminative information. This is validated by the two sub-figures in Figure 2. The left sub-figure shows the data projected on the subspace constructed by the first and second topranked features, whereas the right sub-figure shows the 125

6 SVM Sigmoid SVM RBF Table 1: IDA data sets used in the experiments Data set #Train #Test #Features #Basis #Best german diabetis heart thyroid λ Figure 3: Plotting error rates against λ data projected on the subspace of the ninth and tenth features. From figure 2, it is clear that the first two features are much more discriminative for classification than the ninth and tenth features Case Study: Robustness Kernel selection has always been regarded as a critical issue in machine learning. For a given kernel function, if this function is not appropriately selected, one would generally obtain bad performance. However, our feature-selection method has a natural way of automatically correcting this improper selection: even when a kernel function is not appropriately selected, or the kernel parameters are not properly chosen, our method can still perform well. This idea is validated in the following experiment. In this experiment, we still use the iris dataset. We use a Sigmoid kernel as defined by K(x,x ) = tanh(λx x +1). We vary the parameter λ so that it ranges between 1 to 1. From figure 3, we can see that sigmoid kernel is not appropriate for classifying this dataset, since the performance of SVM is very poor as compared to the classifier using the RBF kernel function. However, our method can still achieve good results by selecting 1 features, where the classification result using the Sigmoid kernel is comparable to the performance achieved by the SVM using RBF kernels Tests on Classification Performance Data Sets and Baselines We next perform a series of classification experiments over two categories of datasets. The datasets that we use include a simulated dataset and a set of benchmark datasets from the IDA Benchmark reposiory 1. The data are given in 1 predefined splits into training and test samples. We use four datasets, which 1 are german, diabetis, heart and thyroid, in our experiments. The information of datasets is shown in Table 1 in which #Best is the number of features when best performance achieved. The results reported in this section are all average values over different training and test data splits. In order to verify the effectiveness of our methods, we selected several state-of-the-art baselines in our comparison: 1. Generalized Discriminant Analysis() (Baudat & Anouar, 2); 2. Kernel Fisher Discriminant Analysis() (Mika et al., 1999a); 3. Support Vector Machine (SVM); 4. Kernel K-Nearest Neighbor (KKNN). and can generate at most n 1 features if the total number of categories is n. In the following experiments, we use n 1 features for and uniformly. KNN is the K-Nearest Neighbor algorithm based on a distance function induced by the kernel function. In the experiments, we let K = Experiment on Simulated Data We first generated some simulated data as shown in Figure 4. In this experiment, we use the RBF kernel with σ = 1. The rank of mapped data in the kernel space is 5, which corresponds to the number of implicit features in the kernel space. Given the number of the selected features, we map the data into the subspace in the kernel space and then calculate a distance ratio of this data. The distance ratio is defined as: i,j y r = (i) =y (j) Distance(x(i),x (j) ) i,j Distance(x(i),x (j) ) The numerator is the sum of the distance between the inner-class data pair, while the denominator is the sum of the distance between all data pairs. The smaller the r is, the easier it is for the data to be separated. The relationship between the distance ratio and the number of implicit features is shown in Figure 5. The dashed line shows the ratio in the kernel space, where we consider all the features. The solid line shows the ratio in the subspace of the kernel space. From figure 5, we can see that the first few implicit features reduce the distance ratio. This means that the first few 126

7 class +1 class 1 Distance Ratio All Features Selected Features FSGP SVM KKNN Number of Selected Features Figure 4: Data in original space Figure 5: Distance Ratio against number of features Figure 8: Classification result of diabetis by SVM Figure 9: Classification result of diabetis by KKNN SVM KKNN SVM KNN The Figure 6: Classification result of german by SVM Figure 7: Classification result of german by KKNN Figure 1: Classification result of thyroid by SVM Figure 11: Classification result of thyroid by KKNN features contain much discriminative information for classification. However, when more implicit features are added, the distance ratio increases, which indicates that the newly added features are noisy and more or less irrelevant for our classification task. When all implicit features are considered, we return to the kernel space, where the distance ratio is the worst. This experiment also indicates that even when the data are located in a relative low dimension subspace in the kernel space, feature selection is still needed to reduce the noisy features Experiments on IDA Data We test our feature selection algorithm on four IDA datasets using two classifiers: SVM and KNN. The results of using SVM are shown in Figures 6, 8, 1, 12 and the results of using the KNN classifier are shown in Figures 7, 9, 11, 13. In these figures, the X-axis is the number of selected features, and the Y-axis is the predictive error of the classifiers. From the experiment results as shown in Figures 6 to 13, we can see that our method can achieve comparable performance, if not better, to other state-of-the-art algorithms. Often, our method achieves the best result when we use 1-2 features. When using SVM as the classifier, our methods can reduce the error by more than 1% and 5% on the thyroid data set and the heart data set as compared to the other baselines, respectively. When using KNN as the classifier, the improvement is over 5% on average. Considering that the computational complexity of our method is O(N 2 ) while those of and are O(N 2 k), the performance of our method is encouraging. Another observation from the experimental results is that in most cases, the best performance of and are similar. The first few features found by are better than. However, s computational complexity is much higher than because of the SVD computation SVM Figure 12: Classification result of heart by SVM KKNN Figure 13: Classification result of heart by KKNN 127

8 5. Conclusions and Future Work In this paper, we have presented a novel method for feature selection in a kernel space. By utilizing the Kernel Gram-Schmidt Process algorithm, we construct an orthogonal basis set in the kernel space. We then extend the margin-based feature-selection methods in the kernel space, so that we can select a subspace of the kernel space which contains the most discriminative information for classification. We have illustrated the advantages of our method over simulated and real data, where we compared our methods against some state-of-the-art baselines. The experimental results show that our methods can achieve comparable or better performance than the best baseline while having a lower computational complexity. In the future, we plan to further explore the relationship between our method and SVM regularization methods. We also want to theoretically estimate the error bound when we apply feature selection in a kernel space. Acknowledgments We would like to thank Jianmin Wu for the valuable discussions. Dou Shen and Qiang Yang are supported by a grant from Hong Kong RGC References Aha, D. W. (199). A study of instance-based algorithms for supervised learning tasks: Mathematical, empirical, and psychological evaluations. Doctoral dissertation, Department of Information & Computer Science, University of California, Irvine. Baudat, G., & Anouar, F. (2). Generalized discriminant analysis using a kernel approach. Neural Computation, 12, Baudat, G., & Anouar, F. (23). Feature vector selection and projection using kernels. Neurocomputing, 55, Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. ICML 98 (pp. 82 9). Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals Eugen., 7, Fukunaga, K. (199). Introduction to statistical pattern recognition (2nd ed.). San Diego, CA, USA: Academic Press Professional, Inc. Gilad-Bachrach, R., Navot, A., & Tishby, N. (24). Margin based feature selection - theory and algorithms. ICML 4 (p. 43). Guyon, I., & Elisseeff, A. (23). An introduction to variable and feature selection. J. Mach. Learn. Res., 3, Horn, R. A., & Johnson, C. R. (1985). Matrix analysis. Cambridge University Press. I.T.Jolliffe. (22). Springer Verlag. Principal components analysis. Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. ML92 (pp ). Kononenko, I. (1994). Estimating attributes: analysis and extensions of relief. ECML-94: Proceedings of the European conference on machine learning on Machine Learning (pp ). Liang, Z., & Zhao, T. (26). Feature selection for linear support vector machines. ICPR 6 (pp ). Mika, S., Ratsch, G., Weston, J., Scholkopf, B., & Muller, K. (1999,a). Fisher discriminant analysis with kernels. Mika, S., Schölkopf, B., Smola, A. J., Müller, K.-R., Scholz, M., & Rätsch, G. (1999b). Kernel PCA and de noising in feature spaces. Advances in Neural Information Processing Systems 11. MIT Press. Niijima, S., & Kuhara, S. (26). Gene subset selection in kernel-induced feature space. Pattern Recognition Letters, 27, Scholkopf, B., & A.J.Smola. (22). Learnin with kernels. Cambridge, MA,: The MIT Press. Sun, Y., & Li, J. (26). Iterative relief for feature weighting. ICML 6 (pp ). Weston, J., Elisseeff, A., Schölkopf, B., & Tipping, M. (23). Use of the zero norm with linear models and kernel methods. J. Mach. Learn. Res., 3, Wu, M., Schölkopf, B., & Bakir, G. (25). Building sparse large margin classifiers. ICML 5 (pp ). Yan, J., Liu, N., Zhang, B., Yan, S., Chen, Z., Cheng, Q., Fan, W., & Ma, W.-Y. (25). Ocfs: optimal orthogonal centroid feature selection for text categorization. SIGIR 5 (pp ). Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. ICML 97 (pp ). 128

OBJECT CLASSIFICATION USING SUPPORT VECTOR MACHINES WITH KERNEL-BASED DATA PREPROCESSING

OBJECT CLASSIFICATION USING SUPPORT VECTOR MACHINES WITH KERNEL-BASED DATA PREPROCESSING Image Processing & Communications, vol. 21, no. 3, pp.45-54 DOI: 10.1515/ipc-2016-0015 45 OBJECT CLASSIFICATION USING SUPPORT VECTOR MACHINES WITH KERNEL-BASED DATA PREPROCESSING KRZYSZTOF ADAMIAK PIOTR

More information

Classification by Support Vector Machines

Classification by Support Vector Machines Classification by Support Vector Machines Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Practical DNA Microarray Analysis 2003 1 Overview I II III

More information

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas Table of Contents Recognition of Facial Gestures...................................... 1 Attila Fazekas II Recognition of Facial Gestures Attila Fazekas University of Debrecen, Institute of Informatics

More information

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Jian Guo, Debadyuti Roy, Jing Wang University of Michigan, Department of Statistics Introduction In this report we propose robust

More information

A Subspace Kernel for Nonlinear Feature Extraction

A Subspace Kernel for Nonlinear Feature Extraction A Subspace Kernel for Nonlinear Feature Extraction Mingrui Wu, Jason Farquhar Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {firstname.lastname}@tuebingen.mpg.de Abstract Kernel

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Class dependent feature weighting and K-nearest neighbor classification

Class dependent feature weighting and K-nearest neighbor classification Class dependent feature weighting and K-nearest neighbor classification Elena Marchiori Institute for Computing and Information Sciences, Radboud University Nijmegen, The Netherlands elenam@cs.ru.nl Abstract.

More information

The Pre-Image Problem in Kernel Methods

The Pre-Image Problem in Kernel Methods The Pre-Image Problem in Kernel Methods James Kwok Ivor Tsang Department of Computer Science Hong Kong University of Science and Technology Hong Kong The Pre-Image Problem in Kernel Methods ICML-2003 1

More information

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators HW due on Thursday Face Recognition: Dimensionality Reduction Biometrics CSE 190 Lecture 11 CSE190, Winter 010 CSE190, Winter 010 Perceptron Revisited: Linear Separators Binary classification can be viewed

More information

DM6 Support Vector Machines

DM6 Support Vector Machines DM6 Support Vector Machines Outline Large margin linear classifier Linear separable Nonlinear separable Creating nonlinear classifiers: kernel trick Discussion on SVM Conclusion SVM: LARGE MARGIN LINEAR

More information

Chap.12 Kernel methods [Book, Chap.7]

Chap.12 Kernel methods [Book, Chap.7] Chap.12 Kernel methods [Book, Chap.7] Neural network methods became popular in the mid to late 1980s, but by the mid to late 1990s, kernel methods have also become popular in machine learning. The first

More information

Second Order SMO Improves SVM Online and Active Learning

Second Order SMO Improves SVM Online and Active Learning Second Order SMO Improves SVM Online and Active Learning Tobias Glasmachers and Christian Igel Institut für Neuroinformatik, Ruhr-Universität Bochum 4478 Bochum, Germany Abstract Iterative learning algorithms

More information

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

Classification by Support Vector Machines

Classification by Support Vector Machines Classification by Support Vector Machines Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Practical DNA Microarray Analysis 2003 1 Overview I II III

More information

Class-Information-Incorporated Principal Component Analysis

Class-Information-Incorporated Principal Component Analysis Class-Information-Incorporated Principal Component Analysis Songcan Chen * Tingkai Sun Dept. of Computer Science & Engineering, Nanjing University of Aeronautics & Astronautics, Nanjing, 210016, China

More information

The Effects of Outliers on Support Vector Machines

The Effects of Outliers on Support Vector Machines The Effects of Outliers on Support Vector Machines Josh Hoak jrhoak@gmail.com Portland State University Abstract. Many techniques have been developed for mitigating the effects of outliers on the results

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

ENSEMBLE RANDOM-SUBSET SVM

ENSEMBLE RANDOM-SUBSET SVM ENSEMBLE RANDOM-SUBSET SVM Anonymous for Review Keywords: Abstract: Ensemble Learning, Bagging, Boosting, Generalization Performance, Support Vector Machine In this paper, the Ensemble Random-Subset SVM

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Support Vector Machines Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Support Vector Machines: introduction 2 Support Vector Machines (SVMs) SVMs

More information

Multi-label Classification. Jingzhou Liu Dec

Multi-label Classification. Jingzhou Liu Dec Multi-label Classification Jingzhou Liu Dec. 6 2016 Introduction Multi-class problem, Training data (x $, y $ ) ( ), x $ X R., y $ Y = 1,2,, L Learn a mapping f: X Y Each instance x $ is associated with

More information

SoftDoubleMinOver: A Simple Procedure for Maximum Margin Classification

SoftDoubleMinOver: A Simple Procedure for Maximum Margin Classification SoftDoubleMinOver: A Simple Procedure for Maximum Margin Classification Thomas Martinetz, Kai Labusch, and Daniel Schneegaß Institute for Neuro- and Bioinformatics University of Lübeck D-23538 Lübeck,

More information

A Practical Guide to Support Vector Classification

A Practical Guide to Support Vector Classification A Practical Guide to Support Vector Classification Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin Department of Computer Science and Information Engineering National Taiwan University Taipei 106, Taiwan

More information

An efficient face recognition algorithm based on multi-kernel regularization learning

An efficient face recognition algorithm based on multi-kernel regularization learning Acta Technica 61, No. 4A/2016, 75 84 c 2017 Institute of Thermomechanics CAS, v.v.i. An efficient face recognition algorithm based on multi-kernel regularization learning Bi Rongrong 1 Abstract. A novel

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines & Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector

More information

SELF-ORGANIZING methods such as the Self-

SELF-ORGANIZING methods such as the Self- Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 Maximal Margin Learning Vector Quantisation Trung Le, Dat Tran, Van Nguyen, and Wanli Ma Abstract

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Adaptive Sparse Kernel Principal Component Analysis for Computation and Store Space Constrained-based Feature Extraction

Adaptive Sparse Kernel Principal Component Analysis for Computation and Store Space Constrained-based Feature Extraction Journal of Information Hiding and Multimedia Signal Processing c 2015 ISSN 2073-4212 Ubiquitous International Volume 6, Number 4, July 2015 Adaptive Sparse Kernel Principal Component Analysis for Computation

More information

Feature scaling in support vector data description

Feature scaling in support vector data description Feature scaling in support vector data description P. Juszczak, D.M.J. Tax, R.P.W. Duin Pattern Recognition Group, Department of Applied Physics, Faculty of Applied Sciences, Delft University of Technology,

More information

Software Documentation of the Potential Support Vector Machine

Software Documentation of the Potential Support Vector Machine Software Documentation of the Potential Support Vector Machine Tilman Knebel and Sepp Hochreiter Department of Electrical Engineering and Computer Science Technische Universität Berlin 10587 Berlin, Germany

More information

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017 Data Analysis 3 Support Vector Machines Jan Platoš October 30, 2017 Department of Computer Science Faculty of Electrical Engineering and Computer Science VŠB - Technical University of Ostrava Table of

More information

KBSVM: KMeans-based SVM for Business Intelligence

KBSVM: KMeans-based SVM for Business Intelligence Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2004 Proceedings Americas Conference on Information Systems (AMCIS) December 2004 KBSVM: KMeans-based SVM for Business Intelligence

More information

Linear methods for supervised learning

Linear methods for supervised learning Linear methods for supervised learning LDA Logistic regression Naïve Bayes PLA Maximum margin hyperplanes Soft-margin hyperplanes Least squares resgression Ridge regression Nonlinear feature maps Sometimes

More information

Multiclass Classifiers Based on Dimension Reduction

Multiclass Classifiers Based on Dimension Reduction Multiclass Classifiers Based on Dimension Reduction with Generalized LDA Hyunsoo Kim Barry L Drake Haesun Park College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA Abstract Linear

More information

Support Vector Machines

Support Vector Machines Support Vector Machines . Importance of SVM SVM is a discriminative method that brings together:. computational learning theory. previously known methods in linear discriminant functions 3. optimization

More information

A KERNEL MACHINE BASED APPROACH FOR MULTI- VIEW FACE RECOGNITION

A KERNEL MACHINE BASED APPROACH FOR MULTI- VIEW FACE RECOGNITION A KERNEL MACHINE BASED APPROACH FOR MULI- VIEW FACE RECOGNIION Mohammad Alwawi 025384 erm project report submitted in partial fulfillment of the requirement for the course of estimation and detection Supervised

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

Support Vector Machines

Support Vector Machines Support Vector Machines About the Name... A Support Vector A training sample used to define classification boundaries in SVMs located near class boundaries Support Vector Machines Binary classifiers whose

More information

Information-Theoretic Feature Selection Algorithms for Text Classification

Information-Theoretic Feature Selection Algorithms for Text Classification Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute

More information

Kernel Methods and Visualization for Interval Data Mining

Kernel Methods and Visualization for Interval Data Mining Kernel Methods and Visualization for Interval Data Mining Thanh-Nghi Do 1 and François Poulet 2 1 College of Information Technology, Can Tho University, 1 Ly Tu Trong Street, Can Tho, VietNam (e-mail:

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Michael Tagare De Guzman May 19, 2012 Support Vector Machines Linear Learning Machines and The Maximal Margin Classifier In Supervised Learning, a learning machine is given a training

More information

Rule extraction from support vector machines

Rule extraction from support vector machines Rule extraction from support vector machines Haydemar Núñez 1,3 Cecilio Angulo 1,2 Andreu Català 1,2 1 Dept. of Systems Engineering, Polytechnical University of Catalonia Avda. Victor Balaguer s/n E-08800

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Leave-One-Out Support Vector Machines

Leave-One-Out Support Vector Machines Leave-One-Out Support Vector Machines Jason Weston Department of Computer Science Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 OEX, UK. Abstract We present a new learning algorithm

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

5.2.1 Principal Component Analysis Kernel Principal Component Analysis Fuzzy Roughset Feature Selection

5.2.1 Principal Component Analysis Kernel Principal Component Analysis Fuzzy Roughset Feature Selection ENHANCED FUZZY ROUGHSET BASED FEATURE SELECTION 5 TECHNIQUE USING DIFFERENTIAL EVOLUTION 5.1 Data Reduction 5.1.1 Dimensionality Reduction 5.2 Feature Transformation 5.2.1 Principal Component Analysis

More information

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Chapter 12 Feature Selection

Chapter 12 Feature Selection Chapter 12 Feature Selection Xiaogang Su Department of Statistics University of Central Florida - 1 - Outline Why Feature Selection? Categorization of Feature Selection Methods Filter Methods Wrapper Methods

More information

Fuzzy-Kernel Learning Vector Quantization

Fuzzy-Kernel Learning Vector Quantization Fuzzy-Kernel Learning Vector Quantization Daoqiang Zhang 1, Songcan Chen 1 and Zhi-Hua Zhou 2 1 Department of Computer Science and Engineering Nanjing University of Aeronautics and Astronautics Nanjing

More information

Binarization of Color Character Strings in Scene Images Using K-means Clustering and Support Vector Machines

Binarization of Color Character Strings in Scene Images Using K-means Clustering and Support Vector Machines 2011 International Conference on Document Analysis and Recognition Binarization of Color Character Strings in Scene Images Using K-means Clustering and Support Vector Machines Toru Wakahara Kohei Kita

More information

12 Classification using Support Vector Machines

12 Classification using Support Vector Machines 160 Bioinformatics I, WS 14/15, D. Huson, January 28, 2015 12 Classification using Support Vector Machines This lecture is based on the following sources, which are all recommended reading: F. Markowetz.

More information

S. Sreenivasan Research Scholar, School of Advanced Sciences, VIT University, Chennai Campus, Vandalur-Kelambakkam Road, Chennai, Tamil Nadu, India

S. Sreenivasan Research Scholar, School of Advanced Sciences, VIT University, Chennai Campus, Vandalur-Kelambakkam Road, Chennai, Tamil Nadu, India International Journal of Civil Engineering and Technology (IJCIET) Volume 9, Issue 10, October 2018, pp. 1322 1330, Article ID: IJCIET_09_10_132 Available online at http://www.iaeme.com/ijciet/issues.asp?jtype=ijciet&vtype=9&itype=10

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines CS 536: Machine Learning Littman (Wu, TA) Administration Slides borrowed from Martin Law (from the web). 1 Outline History of support vector machines (SVM) Two classes,

More information

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty) Supervised Learning (contd) Linear Separation Mausam (based on slides by UW-AI faculty) Images as Vectors Binary handwritten characters Treat an image as a highdimensional vector (e.g., by reading pixel

More information

Support Vector Machines

Support Vector Machines Support Vector Machines SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing

More information

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Bagging and Boosting Algorithms for Support Vector Machine Classifiers Bagging and Boosting Algorithms for Support Vector Machine Classifiers Noritaka SHIGEI and Hiromi MIYAJIMA Dept. of Electrical and Electronics Engineering, Kagoshima University 1-21-40, Korimoto, Kagoshima

More information

Trace Ratio Criterion for Feature Selection

Trace Ratio Criterion for Feature Selection Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Trace Ratio Criterion for Feature Selection Feiping Nie 1, Shiming Xiang 1, Yangqing Jia 1, Changshui Zhang 1 and Shuicheng

More information

Cluster homogeneity as a semi-supervised principle for feature selection using mutual information

Cluster homogeneity as a semi-supervised principle for feature selection using mutual information Cluster homogeneity as a semi-supervised principle for feature selection using mutual information Frederico Coelho 1 and Antonio Padua Braga 1 andmichelverleysen 2 1- Universidade Federal de Minas Gerais

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs) Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based

More information

Content-based image and video analysis. Machine learning

Content-based image and video analysis. Machine learning Content-based image and video analysis Machine learning for multimedia retrieval 04.05.2009 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Global Metric Learning by Gradient Descent

Global Metric Learning by Gradient Descent Global Metric Learning by Gradient Descent Jens Hocke and Thomas Martinetz University of Lübeck - Institute for Neuro- and Bioinformatics Ratzeburger Allee 160, 23538 Lübeck, Germany hocke@inb.uni-luebeck.de

More information

Advanced Machine Learning Practical 1: Manifold Learning (PCA and Kernel PCA)

Advanced Machine Learning Practical 1: Manifold Learning (PCA and Kernel PCA) Advanced Machine Learning Practical : Manifold Learning (PCA and Kernel PCA) Professor: Aude Billard Assistants: Nadia Figueroa, Ilaria Lauzana and Brice Platerrier E-mails: aude.billard@epfl.ch, nadia.figueroafernandez@epfl.ch

More information

Filter methods for feature selection. A comparative study

Filter methods for feature selection. A comparative study Filter methods for feature selection. A comparative study Noelia Sánchez-Maroño, Amparo Alonso-Betanzos, and María Tombilla-Sanromán University of A Coruña, Department of Computer Science, 15071 A Coruña,

More information

A supervised strategy for deep kernel machine

A supervised strategy for deep kernel machine A supervised strategy for deep kernel machine Florian Yger, Maxime Berar, Gilles Gasso and Alain Rakotomamonjy LITIS EA 4108 - Université de Rouen/ INSA de Rouen, 76800 Saint Etienne du Rouvray - France

More information

Feature Selection with Decision Tree Criterion

Feature Selection with Decision Tree Criterion Feature Selection with Decision Tree Criterion Krzysztof Grąbczewski and Norbert Jankowski Department of Computer Methods Nicolaus Copernicus University Toruń, Poland kgrabcze,norbert@phys.uni.torun.pl

More information

Speeding Up Multi-class SVM Evaluation by PCA and Feature Selection

Speeding Up Multi-class SVM Evaluation by PCA and Feature Selection Speeding Up Multi-class SVM Evaluation by PCA and Feature Selection Hansheng Lei, Venu Govindaraju CUBS, Center for Unified Biometrics and Sensors State University of New York at Buffalo Amherst, NY 1460

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

Kernel PCA of HOG features for posture detection

Kernel PCA of HOG features for posture detection University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2009 Kernel PCA of HOG features for posture detection Peng Cheng University

More information

A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

More information

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017 Kernel SVM Course: MAHDI YAZDIAN-DEHKORDI FALL 2017 1 Outlines SVM Lagrangian Primal & Dual Problem Non-linear SVM & Kernel SVM SVM Advantages Toolboxes 2 SVM Lagrangian Primal/DualProblem 3 SVM LagrangianPrimalProblem

More information

A Direct Evolutionary Feature Extraction Algorithm for Classifying High Dimensional Data

A Direct Evolutionary Feature Extraction Algorithm for Classifying High Dimensional Data A Direct Evolutionary Feature Extraction Algorithm for Classifying High Dimensional Data Qijun Zhao and David Zhang Department of Computing The Hong Kong Polytechnic University Hung Hom, Kowloon, Hong

More information

Local Linear Approximation for Kernel Methods: The Railway Kernel

Local Linear Approximation for Kernel Methods: The Railway Kernel Local Linear Approximation for Kernel Methods: The Railway Kernel Alberto Muñoz 1,JavierGonzález 1, and Isaac Martín de Diego 1 University Carlos III de Madrid, c/ Madrid 16, 890 Getafe, Spain {alberto.munoz,

More information

Adaptive Binary Quantization for Fast Nearest Neighbor Search

Adaptive Binary Quantization for Fast Nearest Neighbor Search IBM Research Adaptive Binary Quantization for Fast Nearest Neighbor Search Zhujin Li 1, Xianglong Liu 1*, Junjie Wu 1, and Hao Su 2 1 Beihang University, Beijing, China 2 Stanford University, Stanford,

More information

Classification by Support Vector Machines

Classification by Support Vector Machines Classification by Support Vector Machines Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Practical DNA Microarray Analysis 2003 1 Overview I II III

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 1st, 2018 1 Exploratory Data Analysis & Feature Construction How to explore a dataset Understanding the variables (values, ranges, and empirical

More information

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017 CPSC 340: Machine Learning and Data Mining Kernel Trick Fall 2017 Admin Assignment 3: Due Friday. Midterm: Can view your exam during instructor office hours or after class this week. Digression: the other

More information

Graph Laplacian Kernels for Object Classification from a Single Example

Graph Laplacian Kernels for Object Classification from a Single Example Graph Laplacian Kernels for Object Classification from a Single Example Hong Chang & Dit-Yan Yeung Department of Computer Science, Hong Kong University of Science and Technology {hongch,dyyeung}@cs.ust.hk

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2015 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows K-Nearest

More information

FEATURE GENERATION USING GENETIC PROGRAMMING BASED ON FISHER CRITERION

FEATURE GENERATION USING GENETIC PROGRAMMING BASED ON FISHER CRITERION FEATURE GENERATION USING GENETIC PROGRAMMING BASED ON FISHER CRITERION Hong Guo, Qing Zhang and Asoke K. Nandi Signal Processing and Communications Group, Department of Electrical Engineering and Electronics,

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

Discriminate Analysis

Discriminate Analysis Discriminate Analysis Outline Introduction Linear Discriminant Analysis Examples 1 Introduction What is Discriminant Analysis? Statistical technique to classify objects into mutually exclusive and exhaustive

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

Kernel Combination Versus Classifier Combination

Kernel Combination Versus Classifier Combination Kernel Combination Versus Classifier Combination Wan-Jui Lee 1, Sergey Verzakov 2, and Robert P.W. Duin 2 1 EE Department, National Sun Yat-Sen University, Kaohsiung, Taiwan wrlee@water.ee.nsysu.edu.tw

More information

Hyperspectral Data Classification via Sparse Representation in Homotopy

Hyperspectral Data Classification via Sparse Representation in Homotopy Hyperspectral Data Classification via Sparse Representation in Homotopy Qazi Sami ul Haq,Lixin Shi,Linmi Tao,Shiqiang Yang Key Laboratory of Pervasive Computing, Ministry of Education Department of Computer

More information

Conflict Graphs for Parallel Stochastic Gradient Descent

Conflict Graphs for Parallel Stochastic Gradient Descent Conflict Graphs for Parallel Stochastic Gradient Descent Darshan Thaker*, Guneet Singh Dhillon* Abstract We present various methods for inducing a conflict graph in order to effectively parallelize Pegasos.

More information

Feature Ranking Using Linear SVM

Feature Ranking Using Linear SVM JMLR: Workshop and Conference Proceedings 3: 53-64 WCCI2008 workshop on causality Feature Ranking Using Linear SVM Yin-Wen Chang Chih-Jen Lin Department of Computer Science, National Taiwan University

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from LECTURE 5: DUAL PROBLEMS AND KERNELS * Most of the slides in this lecture are from http://www.robots.ox.ac.uk/~az/lectures/ml Optimization Loss function Loss functions SVM review PRIMAL-DUAL PROBLEM Max-min

More information

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection Based on Locality Preserving Projection 2 Information & Technology College, Hebei University of Economics & Business, 05006 Shijiazhuang, China E-mail: 92475577@qq.com Xiaoqing Weng Information & Technology

More information

Tensor Sparse PCA and Face Recognition: A Novel Approach

Tensor Sparse PCA and Face Recognition: A Novel Approach Tensor Sparse PCA and Face Recognition: A Novel Approach Loc Tran Laboratoire CHArt EA4004 EPHE-PSL University, France tran0398@umn.edu Linh Tran Ho Chi Minh University of Technology, Vietnam linhtran.ut@gmail.com

More information

Query By Committee Made Real

Query By Committee Made Real Query By Committee Made Real Ran Gilad-Bachrach Amir Navot Naftali Tishby School of Computer Science and Engineering Interdisciplinary Center for Neural Computation The Hebrew University, Jerusalem, Israel.

More information

Kernel Discriminant Analysis and information complexity: advanced models for micro-data mining and micro-marketing solutions

Kernel Discriminant Analysis and information complexity: advanced models for micro-data mining and micro-marketing solutions Data Mining VII: Data, Text and Web Mining and their Business Applications 115 Kernel Discriminant Analysis and information complexity: advanced models for micro-data mining and micro-marketing solutions

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions

More information