Feature Selection in a Kernel Space

Size: px

Start display at page:

Download "Feature Selection in a Kernel Space"

Rolf Summers
6 years ago
Views:

1 Bin Cao Peking University, Beijing, China Dou Shen Hong Kong University of Science and Technology, Hong Kong Jian-Tao Sun Microsoft Research Asia, 49 Zhichun Road, Beijing, China Qiang Yang Hong Kong University of Science and Technology, Hong Kong Zheng Chen Microsoft Research Asia, 49 Zhichun Road, Beijing, China Abstract We address the problem of feature selection in a kernel space to select the most discriminative and informative features for classification and data analysis. This is a difficult problem because the dimension of a kernel space may be infinite. In the past, little work has been done on feature selection in a kernel space. To solve this problem, we derive a basis set in the kernel space as a first step for feature selection. Using the basis set, we then extend the margin-based feature selection algorithms that are proven effective even when many features are dependent. The selected features form a subspace of the kernel space, in which different state-of-the-art classification algorithms can be applied for classification. We conduct extensive experiments over real and simulated data to compare our proposed method with four baseline algorithms. Both theoretical analysis and experimental results validate the effectiveness of our proposed method. Appearing in Proceedings of the 24 th International Conference on Machine Learning, Corvallis, OR, 27. Copyright 27 by the author(s)/owner(s). 1. Introduction Finding a proper representation of data from their original features is a fundamental problem in machine learning. Generally speaking, not all original features are beneficial for classification tasks. Some of the features, which can even be noise, will hurt the classification performance. Finding a proper representation is essential for removing the noisy features or for deriving some new features from the original space, so that we only need to keep the necessary information for classification and data analysis. Solutions for feature selection problems can be broadly classified into two groups: feature extraction (such as PCA(I.T.Jolliffe., 22), LDA(Fukunaga, 199)) and feature selection (such as IG, MI(Guyon & Elisseeff, 23), Relief (Kira & Rendell, 1992)). Traditional linear feature selection and extraction operations are conducted in the original input space, and thus cannot handle nonlinear relationships in the data well. For example, the principal components of features may be nonlinearly related to the input variables and the data of different categories are not separable by a hyperplane. To tackle this problem, kernel methods are introduced by mapping the data from an original space (or an input space) to a kernel space using a specially designed mapping function. Among these methods, Kernel Principle Component Analysis (KPCA) (Scholkopf & A.J.Smola., 22) was proposed to find nonlinear principal components. However, as an unsupervised learning method, KPCA does not take the 121

2 class 1 class Figure 1: Illustrating the XOR data distribution. The left figure shows the data distribution in an original space, which is non-linearly separable. The middle figure shows the data distribution in the kernel space, which is linearly separable. The rightmost sub-figure shows the data distribution in a kernel subspace, where the data are linearly separable in a space with low dimensions category information into consideration. Kernel Fisher Discriminant Analysis () (Mika et al., 1999a) and Generalized Discriminant Analysis () (Baudat & Anouar, 2) can make use of category information. However, the number of extracted features by these methods is limited by the number of categories which is at most n 1 if the category count is n. In practice, it is not uncommon that there are more than n 1 features that are useful for classification. In addition, these kernelized methods has high computational complexity, since they rely on the computation of SVD (Horn & Johnson, 1985), which requires at least O(N 2 k) time, to our best knowledge, where N is the number data points to be considered and k is the number of eigenvectors. Intuitively, we can adapt information gain (IG) or mutual information (MI) methods in the kernel space. However, the mapping functions from the original space to the kernel space are usually implicit. Even if the function is explicit, the dimension of kernel space may be infinite. Furthermore, feature dependency often prevents us from considering the features in a kernel space by IG or MI. In this paper, we present a novel method for feature selection directly in the kernel space, without moving back to the original feature space as in (Bradley & Mangasarian, 1998; Liang & Zhao, 26). To do this, we first construct a basis set in the kernel space. We then extend the margin-based feature-selection methods to select the best bases for classification. Finally, we classify the test data in the kernel space with the selected basis set directly. Compared to methods such as KPCA and, the computational complexity of our method is reduced significantly to O(N 2 ). We conduct extensive experiments over the IDA data sets to compare our method with four state-of-the-art baselines. Theoretical analysis and experimental results validate the effectiveness of our proposed method. 2. Related Works Traditional feature selection methods such as Information Gain and χ 2 -Test consider the contribution of each feature to the classification task independently. They have been proven to be effective in many applications (Guyon & Elisseeff, 23; Yang & Pedersen, 1997). However, they do not work well when features are not independent from each other as in the XOR problem on the leftmost sub-figure of Figure 1. Recently, margin based feature selection or feature weighting methods are proposed including ReliefF (Kononenko, 1994), Simba(Gilad-Bachrach et al., 24), I-Relief (Sun & Li, 26). They are derived from the basic Relief algorithm(kira & Rendell, 1992) which weights each feature to get maximal margin. The margin of a data point x used in these methods is defined as the distance between the the nearest same-labelled data (nearhit) to x and the nearest different-labeled data (nearmiss) to x, respectively. The advantage of margin based feature selection methods is optimal weights of features can be estimated by considering the global information. Thus these methods work well even for data with dependent features (Gilad-Bachrach et al., 24). The concept of margins has been widely used in machine learning in the past decades to measure the models generalization ability. Our work is also related to kernel methods for feature selection (Bradley & Mangasarian, 1998; Liang & Zhao, 26). However, most of the previous work in this area uses kernels to help select features in the original space. In contrast, our goal is to select features in the kernel space, because we cannot change non-separability nature of the data through feature selection in the original space. However, feature selection in a kernel space can handle this problem, as illustrated in the right sub-figure of Figure 1. Kernel based feature-extraction methods have also been designed to address the nonlinear projection problem. Kernel Principle Component Analysis (Scholkopf & A.J.Smola., 22) was proposed to find principal components in high-dimensional feature spaces. KPCA is shown to be effective in many tasks such as image de-noising (Mika et al., 1999b). However, KPCA does not consider the category information. As a result, features extracted by KPCA may be irrelevant to a classification task. To tackle this problem, Kernel based LDA methods such as FKD (Mika et al., 1999a), (Baudat & Anouar, 2) are proposed. extracts the features that are discriminative and beneficial for classification in a kernel space. However, the number of extracted features by is limited by the number of categories. 122

3 (Weston et al., 23) discussed the problem of kernel feature selection when the mapping function between the orignal space and the kernel space can be explicitly expressed. This work is limited to only certain types of kernels, such as the polynomial kernels. However, our method does not have this restriction. We are able to deal with any kernels as a result. (Baudat & Anouar, 23) proposed a method for featurevector selection using data selection. However, similar to KPCA, the method does not find discriminative feature vectors. In (Niijima & Kuhara, 26), Niijima proposed a wrapper method that utilizes the kernelbased classifiers for gene-subset selection. However, the wrapper method faces the problem of high computational complexity. In (Wu et al., 25), the author proposed the Sparse Large Margin Classifiers. They find that sparse SVM classifiers in a kernel space usually have better generalization ability. However, their work is restricted to SVM classifiers, whereas our method can be applied to different classifiers. 3. Feature Selection in a Kernel Space Let D = {(x (n),y (n) )} N n=1 R I {±1} denote a training dataset, where I is the data dimensionality. K(x,x ) is the kernel function with an implicit mapping function φ(x) = z where x R I,z F, where F is a high-dimensional space. Generally speaking, the features in the kernel space are not assumed to be independent. Therefore, feature selection methods that consider each feature individually are unlikely to work well in a kernel space. However, a margin-based feature-selection method can handle the feature-dependency problem successfully, as explored in (Gilad-Bachrach et al., 24). In the following, we give a brief introduction to the basic margin-based feature-selection method Relief (Kononenko, 1994; Sun & Li, 26), and then present our algorithm to extend Relief in a kernel space Margin Based Feature Selection (Weighting) Methods Margin-based feature-selection methods such as ReliefF, Simba, I-Relief have been developed based on the idea of Relief, which uses a weighting-based method for feature selection. The weight of each feature reflects its ability to discriminate different categories. Relief tries to find weights by maximizing the margins between data of different categories. The margin of a data point x used in Relief is defined by the distance between the nearhit NH(x) and nearmiss NM(x) of the target data point, which are the nearest point with the same and different label, respectively. The margin of the whole training dataset is the sum of the margin of all data points, as shown in the following equation. max w D (D w (x (n),nm(x (n) )) D w (x (n),nh(x (n) ))) n=1 s.t. w 2 2 = 1,w where D w (x,x ) = i=1 w i x i x i is a projection of the difference-vector onto a dimension, and w i is the weight of the i th feature. Thus, the feature weighting problem is converted to an optimization problem, for maximizing the above function. The solution to the above problem can be found by solving for the weight vector w = (m) + / (m) + 2, where m = D n=1 (x(n) NM(x (n) )) (x (n) NH(x (n) )) which is the margin vector and (m i ) + = max(m i,). We refer the readers to (Kononenko, 1994; Sun & Li, 26) for details of Relief Margin Based Feature Selection in Kernel Space Now we discuss how to extend Relief in a kernel space. Note that Relief calculates the weight of each feature by w i = e (i) (m) + / (m) + 2, where e (i) is the i th basis in a Euclidean space and m is a margin vector. Similarly, we can derive the weight of a feature in the kernel space by w i = e (i) (m ) + / (m ) + 2, where e (i) is the basis of the kernel space and m is a margin vector in the kernel space. Recall that the margin defined in Relief is determined by the distance defined in a space. We can extend Relief to a kernel space by using a distance function induced by the corresponding kernel function, as shown in Equation 1. The nearhit and nearmiss points are found in the kernel space based on the distance. D K (x,x ) = φ(x) φ(x ),φ(x) φ(x ) = K(x,x) + K(x,x ) 2K(x,x ) (1) where φ(x) is the mapping function corresponding to the kernel K(x,x ). When we want to calculate the distance in the weighted kernel space, we need to use the weighted kernel K w as follows K w (x,x ) = i w i φ i (x)φ i (x ) where φ i (x) is the i th feature of φ(x). In the kernel space, the index of i is allowed to range to infinity. Since we do not have φ i (x) explicitly, we cannot calculate the above equation directly. However, if we have 123

4 an orthogonal basis set {η (i) } in the kernel space, then the function K w (x,x ) can be calculated as K w (x,x ) = i w i φ(x),η (i) φ(x ),η (i) Given a data set D, we let the bases of a kernel space be expressed by combinations of the data in high dimensional spaces. That is: η (i) = j α ijφ(x (j) ). We have the following equation: D K w (x,x ) = w i (φ(x) η (i) )(φ(x ) η (i) ) i D = w i ( i j α ij K(x, x (j) ))( j α ij K(x, x (j) )) We should point out that when the data are not sufficient or the dimension of the kernel space is infinite, not all the bases of kernel space can be expressed by combinations of the data. Therefore this method can only get an approximation K w(x,x ) of K w (x,x ). However, the following theorem shows the approximation is reasonable. Theorem 1. Let S be the space spanned by the available high dimensional data. If φ(x ) S, then we have K w (x,x ) = K w(x,x ) Proof. K w (x,x ) = K w(x,x ) + i= D +1 w i(φ(x) η (i) )(φ(x ) η (i) ), where η (i),i = D + 1 are bases of orthogonal complement subspace of S. Since φ(x) is contained in S, therefore η (i) φ(x) =,i = D + 1. Hence we have K w(x,x ) = K w (x,x ). If φ(x ) / S, the difference between K w(x,x ) and K w (x,x ) is controlled by φ(x ) φ (x ), where φ (x ) is the projection vector of φ(x ) on S. If we assume the training and test data are sampled from the same distribution, then this term can be regarded as extremely small and the approximation would be reasonable Finding the Basis Set From Section 3.2 we can see that, to extend Relief in a kernel space, a critical issue is to find a basis set in the kernel space. In this section, we propose to use the Kernel Gram-Schmidt Process to find an orthogonal basis set. Given a data set in an original space D and a kernel function K(x,x ), as well as an implicit mapping function φ(x), the mapped data in the kernel space are D = {φ(x (i) ) i = 1... D }. Let µ = Algorithm 1 Kernel Gram-Schmidt Process Input: data x (i) (i = 1..N) Output: an orthogonal set of basis vectors for i = 1 to N do v (i) = φ(x (i) ) for j = 1 to i 1 do v (i) v (i) φ(x (i) ),v (j) v (j) end for Normalize: v (i) end for Output: basis set {v (i) } v(i) v (i) (φ(x (1) ),φ(x (2) ),...,φ(x ( D ) )) T. Assume that the rank of D is D. The Kernel Gram-Schmidt Process can find a subset of a orthogonal basis set in a kernel space. This is done by Algorithm (1), where as output, each basis is a linear combination of the data in high dimensions. That is v (i) = j α ijφ(x (j) ). Thus, we wish to find the mixture weights calculated as follows: α (i) = e (i) i 1 k=1 α(k) j α kjk(x (k),x (j) ) (e (i) i 1 k=1 α(k) j α kjk(x (k),x (j) )) µ where e (i) is a vector in which the i th element equals to one and all other elements are equal to zero. α (i) = (α i1,α i2,...,α i D ). When the input data are mapped into a kernel space, the dimensionality of the mapped data are usually very high, and may even reach infinity. We do not expect to find all the basis in the kernel space. In fact, the number of orthogonal basis we get by Gram-Schmidt Process is the rank of the mapped data set {φ(x)}. This rank also equals to the numerical rank of the kernel matrix. Besides the Kernel Gram-Schmidt Process, other methods such as Kernel PCA can also be used to find a basis set in a kernel space. Since the orthogonal basis sets found by different methods are all orthogonal in the kernel subspace S, and since the number of the basis are equal to the rank of the high-dimensional data set {φ(x)}, the basis set found by the Gram-Schmidt Process or by other methods are equivalent under an orthogonal transformation. Now we show that the selection of a basis set does not affect the distance measure learned in the kernel space. Theorem 2. {e (i) } and {e (i) } are two different orthogonal basis sets in a kernel space, which are equivalent under an orthogonal transformation. Then, a weighting transformation under {e (i) } is also a weighting transformation under {e (i) }. 124

5 Proof. Suppose e (i) = j ξ ije (j), then φ w (x) = w i φ(x),e (i) i = w i φ(x), ξ ij e (j) i j = w j φ(x),e (j) j where w j = i w iξ ij (3) Algorithm 2 FSKS Input: training data x i, label y i Output: selected features in the kernel space step 1: Constructing a basis set by either Kernel GP or Kernel PCA step 2: Calculating w i by Kernel Relief step 3: Ranking implicit features by w i, select features based on the rank step 4: Projecting the data into the learned subspace Theorem 3. If the orthogonal basis sets of a kernel space are equivalent under an orthogonal transformation, then the optimal K w (x,x ) learned under different basis sets are equivalent. Dimension Class 1 Class 2 Class 3 Dimension Class 1 Class 2 Class 3 Proof. Suppose {e (i) } and {e (i) } are two different orthogonal basis sets of kernel space which are equivalent under an orthogonal transformation. Based on Theorem 2, the weighting problem under {e (i) } can be converted to the weighting problem under {e (i) }. Therefore there exist K w (x,x ) under {e (i) } equivalent to K w (x,x ) under {e (i) }. Since the optimal K w (x,x ) is unique, the above conclusion can be draw. From Theorem 3, we can see that no matter what particular method is used to find the basis set in a kernel space, the optimal distance K w (x,x ) remains the same. However, although the weighted feature-space kernels K w (x,x ) are all equivalent, when we discretize the above-computed weights to binary values (zero or one) for feature selection, the results may still be different. We plan to investigate this difference in our future work. In this paper, we just use the Kernel Gram- Schmidt Process and Kernel PCA to find the orthogonal basis set in the kernel space Feature-Selection Algorithms Our feature-selection framework is summarized in Algorithm (2), where by using different basis-computing method, we can get different algorithms. When Kernel PCA is applied for finding the kernel basis set, we call the algorithm FSKSPCA. When Kernel Gram- Schmidt Process is used, we call the algorithm Discussion Feature selection in the kernel space has very close relationship with SVM regularization methods such as (Wu et al., 25). We share the interest to find a discriminating subspace of the feature space F. The Dimension Dimension 9 Figure 2: Iris data in different kernel subspaces difference is that our method finds an explicit expression of the subspace and removes the dependence on certain classification algorithms. 4. Experiments In this section, we conduct extensive experiments to compare our proposed feature-selection method with some state-of-the-art baselines. The experimental results validate the effectiveness of our method Case Study: Discriminative Power The iris-flower data set (Fisher, 1936) is widely used to test feature selection and feature extraction methods (Baudat & Anouar, 2; Yan et al., 25). The data set contains 15 examples in 3 categories. The original feature space of the data has four dimensions, which are sepal length, sepal width, petal length and petal width. We use this dataset to illustrate that our method, by extending Relief in kernel spaces, can effectively find discriminative features for classification. The result of applying our method to the iris data is shown in Figure 2. In this experiment, an RBF kernel with σ = 1 is used. We construct 15 implicit features in the kernel space where they are ranked according to their weights. Features that are top-ranked are expected to contain more discriminative information. This is validated by the two sub-figures in Figure 2. The left sub-figure shows the data projected on the subspace constructed by the first and second topranked features, whereas the right sub-figure shows the 125

6 SVM Sigmoid SVM RBF Table 1: IDA data sets used in the experiments Data set #Train #Test #Features #Basis #Best german diabetis heart thyroid λ Figure 3: Plotting error rates against λ data projected on the subspace of the ninth and tenth features. From figure 2, it is clear that the first two features are much more discriminative for classification than the ninth and tenth features Case Study: Robustness Kernel selection has always been regarded as a critical issue in machine learning. For a given kernel function, if this function is not appropriately selected, one would generally obtain bad performance. However, our feature-selection method has a natural way of automatically correcting this improper selection: even when a kernel function is not appropriately selected, or the kernel parameters are not properly chosen, our method can still perform well. This idea is validated in the following experiment. In this experiment, we still use the iris dataset. We use a Sigmoid kernel as defined by K(x,x ) = tanh(λx x +1). We vary the parameter λ so that it ranges between 1 to 1. From figure 3, we can see that sigmoid kernel is not appropriate for classifying this dataset, since the performance of SVM is very poor as compared to the classifier using the RBF kernel function. However, our method can still achieve good results by selecting 1 features, where the classification result using the Sigmoid kernel is comparable to the performance achieved by the SVM using RBF kernels Tests on Classification Performance Data Sets and Baselines We next perform a series of classification experiments over two categories of datasets. The datasets that we use include a simulated dataset and a set of benchmark datasets from the IDA Benchmark reposiory 1. The data are given in 1 predefined splits into training and test samples. We use four datasets, which 1 are german, diabetis, heart and thyroid, in our experiments. The information of datasets is shown in Table 1 in which #Best is the number of features when best performance achieved. The results reported in this section are all average values over different training and test data splits. In order to verify the effectiveness of our methods, we selected several state-of-the-art baselines in our comparison: 1. Generalized Discriminant Analysis() (Baudat & Anouar, 2); 2. Kernel Fisher Discriminant Analysis() (Mika et al., 1999a); 3. Support Vector Machine (SVM); 4. Kernel K-Nearest Neighbor (KKNN). and can generate at most n 1 features if the total number of categories is n. In the following experiments, we use n 1 features for and uniformly. KNN is the K-Nearest Neighbor algorithm based on a distance function induced by the kernel function. In the experiments, we let K = Experiment on Simulated Data We first generated some simulated data as shown in Figure 4. In this experiment, we use the RBF kernel with σ = 1. The rank of mapped data in the kernel space is 5, which corresponds to the number of implicit features in the kernel space. Given the number of the selected features, we map the data into the subspace in the kernel space and then calculate a distance ratio of this data. The distance ratio is defined as: i,j y r = (i) =y (j) Distance(x(i),x (j) ) i,j Distance(x(i),x (j) ) The numerator is the sum of the distance between the inner-class data pair, while the denominator is the sum of the distance between all data pairs. The smaller the r is, the easier it is for the data to be separated. The relationship between the distance ratio and the number of implicit features is shown in Figure 5. The dashed line shows the ratio in the kernel space, where we consider all the features. The solid line shows the ratio in the subspace of the kernel space. From figure 5, we can see that the first few implicit features reduce the distance ratio. This means that the first few 126

7 class +1 class 1 Distance Ratio All Features Selected Features FSGP SVM KKNN Number of Selected Features Figure 4: Data in original space Figure 5: Distance Ratio against number of features Figure 8: Classification result of diabetis by SVM Figure 9: Classification result of diabetis by KKNN SVM KKNN SVM KNN The Figure 6: Classification result of german by SVM Figure 7: Classification result of german by KKNN Figure 1: Classification result of thyroid by SVM Figure 11: Classification result of thyroid by KKNN features contain much discriminative information for classification. However, when more implicit features are added, the distance ratio increases, which indicates that the newly added features are noisy and more or less irrelevant for our classification task. When all implicit features are considered, we return to the kernel space, where the distance ratio is the worst. This experiment also indicates that even when the data are located in a relative low dimension subspace in the kernel space, feature selection is still needed to reduce the noisy features Experiments on IDA Data We test our feature selection algorithm on four IDA datasets using two classifiers: SVM and KNN. The results of using SVM are shown in Figures 6, 8, 1, 12 and the results of using the KNN classifier are shown in Figures 7, 9, 11, 13. In these figures, the X-axis is the number of selected features, and the Y-axis is the predictive error of the classifiers. From the experiment results as shown in Figures 6 to 13, we can see that our method can achieve comparable performance, if not better, to other state-of-the-art algorithms. Often, our method achieves the best result when we use 1-2 features. When using SVM as the classifier, our methods can reduce the error by more than 1% and 5% on the thyroid data set and the heart data set as compared to the other baselines, respectively. When using KNN as the classifier, the improvement is over 5% on average. Considering that the computational complexity of our method is O(N 2 ) while those of and are O(N 2 k), the performance of our method is encouraging. Another observation from the experimental results is that in most cases, the best performance of and are similar. The first few features found by are better than. However, s computational complexity is much higher than because of the SVD computation SVM Figure 12: Classification result of heart by SVM KKNN Figure 13: Classification result of heart by KKNN 127

8 5. Conclusions and Future Work In this paper, we have presented a novel method for feature selection in a kernel space. By utilizing the Kernel Gram-Schmidt Process algorithm, we construct an orthogonal basis set in the kernel space. We then extend the margin-based feature-selection methods in the kernel space, so that we can select a subspace of the kernel space which contains the most discriminative information for classification. We have illustrated the advantages of our method over simulated and real data, where we compared our methods against some state-of-the-art baselines. The experimental results show that our methods can achieve comparable or better performance than the best baseline while having a lower computational complexity. In the future, we plan to further explore the relationship between our method and SVM regularization methods. We also want to theoretically estimate the error bound when we apply feature selection in a kernel space. Acknowledgments We would like to thank Jianmin Wu for the valuable discussions. Dou Shen and Qiang Yang are supported by a grant from Hong Kong RGC References Aha, D. W. (199). A study of instance-based algorithms for supervised learning tasks: Mathematical, empirical, and psychological evaluations. Doctoral dissertation, Department of Information & Computer Science, University of California, Irvine. Baudat, G., & Anouar, F. (2). Generalized discriminant analysis using a kernel approach. Neural Computation, 12, Baudat, G., & Anouar, F. (23). Feature vector selection and projection using kernels. Neurocomputing, 55, Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. ICML 98 (pp. 82 9). Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals Eugen., 7, Fukunaga, K. (199). Introduction to statistical pattern recognition (2nd ed.). San Diego, CA, USA: Academic Press Professional, Inc. Gilad-Bachrach, R., Navot, A., & Tishby, N. (24). Margin based feature selection - theory and algorithms. ICML 4 (p. 43). Guyon, I., & Elisseeff, A. (23). An introduction to variable and feature selection. J. Mach. Learn. Res., 3, Horn, R. A., & Johnson, C. R. (1985). Matrix analysis. Cambridge University Press. I.T.Jolliffe. (22). Springer Verlag. Principal components analysis. Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. ML92 (pp ). Kononenko, I. (1994). Estimating attributes: analysis and extensions of relief. ECML-94: Proceedings of the European conference on machine learning on Machine Learning (pp ). Liang, Z., & Zhao, T. (26). Feature selection for linear support vector machines. ICPR 6 (pp ). Mika, S., Ratsch, G., Weston, J., Scholkopf, B., & Muller, K. (1999,a). Fisher discriminant analysis with kernels. Mika, S., Schölkopf, B., Smola, A. J., Müller, K.-R., Scholz, M., & Rätsch, G. (1999b). Kernel PCA and de noising in feature spaces. Advances in Neural Information Processing Systems 11. MIT Press. Niijima, S., & Kuhara, S. (26). Gene subset selection in kernel-induced feature space. Pattern Recognition Letters, 27, Scholkopf, B., & A.J.Smola. (22). Learnin with kernels. Cambridge, MA,: The MIT Press. Sun, Y., & Li, J. (26). Iterative relief for feature weighting. ICML 6 (pp ). Weston, J., Elisseeff, A., Schölkopf, B., & Tipping, M. (23). Use of the zero norm with linear models and kernel methods. J. Mach. Learn. Res., 3, Wu, M., Schölkopf, B., & Bakir, G. (25). Building sparse large margin classifiers. ICML 5 (pp ). Yan, J., Liu, N., Zhang, B., Yan, S., Chen, Z., Cheng, Q., Fan, W., & Ma, W.-Y. (25). Ocfs: optimal orthogonal centroid feature selection for text categorization. SIGIR 5 (pp ). Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. ICML 97 (pp ). 128

OBJECT CLASSIFICATION USING SUPPORT VECTOR MACHINES WITH KERNEL-BASED DATA PREPROCESSING

Image Processing & Communications, vol. 21, no. 3, pp.45-54 DOI: 10.1515/ipc-2016-0015 45 OBJECT CLASSIFICATION USING SUPPORT VECTOR MACHINES WITH KERNEL-BASED DATA PREPROCESSING KRZYSZTOF ADAMIAK PIOTR