OBJECT CLASSIFICATION USING SUPPORT VECTOR MACHINES WITH KERNEL-BASED DATA PREPROCESSING

Size: px

Start display at page:

Download "OBJECT CLASSIFICATION USING SUPPORT VECTOR MACHINES WITH KERNEL-BASED DATA PREPROCESSING"

Aron Allen
5 years ago
Views:

1 Image Processing & Communications, vol. 21, no. 3, pp DOI: /ipc OBJECT CLASSIFICATION USING SUPPORT VECTOR MACHINES WITH KERNEL-BASED DATA PREPROCESSING KRZYSZTOF ADAMIAK PIOTR DUCH KRZYSZTOF ŚLOT Institute of Applied Computer Science, Lodz University of Technology, Abstract. The paper explores possibility of improving Support Vector Machine-based classification performance by introducing an input data dimensionality reduction step. Feature extraction by means of two different kernel methods are considered: kernel Principal Component Analysis (kpca) and Supervised kernel Principal Component Analysis. It is hypothesized that input domain transformation, aimed at emphasizing between-class differences, would facilitate classification problem. Experiments, performed on three different datasets show that one can benefit from the proposed approach, as it provides lower variability in classification performance at similar, high recognition rates. 1 Introduction The main objective of the paper is to explore whether introduction of data preprocessing may improve classification performance of Support Vector Machine (SVM) classifiers [3]. SVM classification is considered to be a state of the art method, which outperforms other existing data classification approaches in several tasks. A core of the SVM concept is a search for a decision hyperplane that maximizes the between-class margin in a high-dimensional feature space, which hosts projections of original samples. This hyperplane corresponds to an optimal nonlinear decision surface in an original problem domain. Calculations in high-dimensional spaces are made implicitly, by using kernel functions that operate on original samples. The research summarized in the presented paper was aimed at checking, whether appropriate data preprocessing can improve SVM-based classification accuracy. Research on SVM classification usually does not assume any data preprocessing - support vectors are being determined based on raw input data. We hypothesize that an appropriate transformation of raw samples could facilitate further classification, as one can emphasize discriminative properties of class distributions and reduce irrelevant ones. We propose to perform a feature extraction as initial data preprocessing, prior to classification step, so that SVM would operate on appropriately transformed samples. For the purpose of feature extraction we propose to use two nonlinear methods: kernel Principal Component Analysis (denoted henceforth as kpca) [15] and Super-

2 46 K. Adamiak, P. Duch, K. Ślot vised kernel Principal Component Analysis (SkPCA) [1]. Both approaches use a concept of kernel-based processing, which is similar to the one used by SVM. However, criteria underlying dimensionality reduction with kernels are different than in case of SVM, so that both steps of the proposed procedure are not necessarily correlated. To verify the proposed concept, series of experiments involving three different, publicly available pattern recognition datasets have been performed. We have shown that preprocessing is beneficial as it significantly reduces sensitivity to a non-optimal SVM procedure parameter choice. Also, classification rates in kernel-transformed feature spaces are comparable with SVM-only approach also in case of other strategies, which has been shown in case of a k-nn method. A structure of the paper is the following. Key concepts for the proposed classification method: support vector machines, kpca and SkPCA have been briefly explained in Section 2. Section 3 provides details of the proposed procedure and Section 4 summarizes experiment results. 2 Related Work Support Vector Machine classification is a well-known concept that has been extensively presented in numerous publications [3, 5]. Also, an impressive amount of its successful applications in numerous fields of engineering [12], image and signal analysis [8], object detection [11] or bioinformatics [13], has been reported so far. SVM derives a decision function f(x) of the form: ( ) f(x) = signum α i y i K(x, x i ) + b (1) i where K(.,.) is a kernel function, summation is made over support vectors x i with weights α i and responses y i, and b is a threshold. The expression (1) is a solution to a constrained maximization problem, which, in case of the so called, soft-margin SVM [3], can be expressed as: minj(w, b) = 1 2 w 2 + C ξ k (2) k subject to: y k (w T x k + b) 1 ξ k, k = 1...n (3) where w is the separating hyperplane vector, n is a total number of samples, ξ k are slack variables and C is a parameter that controls a mutual role of two objectives: margin maximization and misclassification penalty. Commonly used kernel functions include radial basis, sigmoid, polynomial and linear, which add a set of additional parameters that, together with the parameter C of the equation (2) need to be carefully chosen to provide good classification performance. Development of various kernel methods for data classification and processing gained momentum after a success of the SVM concept [2, 7, 10, 14, 17]. In particular, several kernel-based data preprocessing methods were proposed, including kernel Principal Component Analysis (kpca) and its supervised version - SkPCA. Kernel Principal Component Analysis, proposed in [15], extends classical Principal Component Analysis concept and produces nonlinear directions of the maximum scatter that exists among samples. As it was in case of SVM, a concept of problem-solving in highdimensional spaces has been applied, and kernels provide a means for making the relevant computations feasible. An objective of kpca is to find directions of maximum variability for samples x i projected to high dimensional space, using some transformation Φ(.) (i.e. X i = Φ(x i )), that is to find eigenvectors V = [v 0, v 1,...] of the projection covariance matrix: (X M)(X M) T V = ΛV (4)

3 Image Processing & Communications, vol. 21, no. 3, pp where M is a matrix of mean-valued vectors m, computed for projections in high-dimensional space, and Λ is a diagonal matrix of eigenvalues. As eigenvectors lie in a subspace defined by projected samples: n 1 v i = αj(x i j m) = (X m)a i, j=0 premultiplying the equation (4) by the term (X M) T yields alternative formulation of the eigenproblem: (X M) T (X M)A = ΛA (5) where A = [a 0, a 1,...] comprises vectors of coefficients that become a solution to the modified eigenproblem. Observe, that only dot products are involved in computations of the eigenproblem (5), so they can be replaced by kernels. Introducing a Gramm matrix, with elements G i,j = ˆK(x i, x j ), where ˆK is some kernel function centered in high-dimensional space, one can rewrite (6) in a compact form: GA = ΛA (6) A solution to (5), which can be computed for reasonable number of samples, defines directions of the maximum variability in a high-dimensional space and can be used directly for projecting unknown samples: (Φ(z) m) T v i = (Φ(z) m) T (X m)a i = [ ˆK(z, x0 ),... ˆK(z, x n 1 )] a i (7) As it can be seen, projections for each eigenvector v i can be determined in the original, low-dimensional space, using kernel operations and the computed coefficient vectors a i. The last concept of interest to the presented paper is a supervised version of kpca - SkPCA, introduced in [1]. The proposed idea is to use Hilbert-Schmidt Independence Criterion (HSIC) [16] as an objective function that is to be maximized. HSIC measures a level of crosscovariance between samples and their labels: C x,y = E(X m x )(Y m y ) T = E(XH)(YH) T (8) where X is a matrix of input samples with a mean vector m x, Y is a matrix of labels, with their mean m y, and H is a centering matrix. HSIC uses a Hilbert-Schmidt norm, which, in essence, aggregates squared entries of the crosscovariance (8). It can be easily shown that this can be expressed as: HSIC = k tr(c x,y C T x,y) (9) where tr denotes a trace and k is a scaling factor. the criterion (9) involves dot products, one can introduce kernels: on input samples - K = [k(x i, x j )] and on labels - L = [l(y i, y j )], and rewrite the criterion in the form: As HSIC = k tr(khlh) (10) An objective of SkPCA procedure is to find such a transformation matrix U of original samples x, which maximizes the criterion (10). As it is the case for linear feature extraction with PCA and its supervised versions, performance of the kernelized supervised approach outperforms kpca [1]. Therefore, this method become a primary focus of the presented research. 3 SVM classification with input data preprocessing Four different data classification schemes have been considered in the reported research. The first one was simple SVM classification performed on raw input data, whereas the remaining ones included a feature extraction step, performed by either kpca or SkPCA, followed by either SVM or k-nn classification of the projected samples. In every case appropriate parameter selection proce-

4 48 K. Adamiak, P. Duch, K. Ślot dures were run to find the optimal values of classification procedure parameters. A grid-search algorithm, which iteratively narrows down a search domain around the best performing parameter set (proposed in [4]), was used to do the task in case of the considered kernel methods. Search parameters included a constant C of the SVM objective function (2) and parameters of the adopted kernel functions. Four commonly used kernels that are parametrized with a single variable, were used in the research. The simplest one - a linear kernel, of the form: k(x i, x j ) = x T i x j (11) was primarily used as an indicator of classification problem complexity. The second one is a polynomial kernel: k(x i, x j ) = (x T i x j + 1) d (12) with a parameter d, which represents a polynomial s degree. The third kernel was a sigmoid kernel (hyperbolic tangent): k(x i, x j ) = tanh(α x T i x j + β) (13) with two parameters, controlling the slope (α) and shift (β). Finally, the last kernel was Gaussian, defined as: k(x i, x j ) = exp ( γ x i x j 2) (14) The last classification scenario involved a k-nn method performed on transformed samples and it was introduced to asses, whether high recognition rates can also be achieved using this simple classification approach. 4 Experimental evaluation of the strategies Three pattern recognition datasets were used for evaluation of the proposed data classification schemes. The first Tab. 1: Datasets used in experiments: Name Classes Samples Attributes GLASS LEAVES Pedestrian one was a Glass identification dataset (available at [9]), the second one Leaves identification set (also available at [9]) and the last one was INRIA pedestrian detection dataset (available at [6]). Basic properties of the datasets are presented in Tab. 1 (from the Leaves dataset only classes with at least 48 examples were used). The former two datasets contain labeled feature vectors, derived for objects from multiple classes. In case of INRIA pedestrian dataset, samples are images (see Fig. 1) supplemented with coordinates of bounding boxes that contain persons (if persons are present in an image). Therefore, an additional procedure for feature extraction needs to be executed. First, for positive examples (i.e. for these that contain people) regions of interest were extracted based on the provided bounding box coordinates. These regions were subsequently scaled to a uniform size of 128 rows by 64 columns and were used as a basis for feature vector derivation. To represent objects, histograms of gradients (HoG), which proved to be one of the best visual object descriptors, were used. HoG has been derived for all non-overlapping 8x8 pixel blocks. As a result every sample was represented by a 256-element feature vector (128 blocks x 2 components of a mean gradient within a block). Negative examples were produced by random sampling of images without persons, using the same procedure. The INRIA pedestrian dataset comprises large number of examples, making kernel-based preprocessing procedures computationally infeasible (matrices of sizes dozens of thousands by dozens of thousands are involved). Therefore, several classification experiments on randomly selected, one-thousand element subsets of the whole dataset,

Image Processing & Communications, vol. 21, no. 3, pp. 45-54 49 were performed. Data classification experiments for all three scenarios were run in a five-fold cross validation scheme.

The first step of experiments was concerned with selection of optimal parameters used in classification.

kernel) or d (for polynomial kernels). Sample grid search results for SVM classification with RBF kernel, performed on GLASS database, are shown in Fig. 2.

The first objective of the experiments was to evaluate the minimum dimensionality of derived feature spaces that is necessary for ensuring high classification rates. Results, summarized in Fig.

5 Image Processing & Communications, vol. 21, no. 3, pp were performed. Data classification experiments for all three scenarios were run in a five-fold cross validation scheme. Additionally, classification experiments were repeated twenty times and their results were averaged. The first step of experiments was concerned with selection of optimal parameters used in classification. Grid search was iteratively performed in parameter spaces comprising the misclassification penalty (C - see (2)) and a corresponding kernel parameter: either γ (for RBF kernel), α (for the sigmoid kernel) or d (for polynomial kernels). Sample grid search results for SVM classification with RBF kernel, performed on GLASS database, are shown in Fig. 2. Consecutive iterations are repeated over a subdomain around the best performing region of the previous step (four steps are depicted). The first objective of the experiments was to evaluate the minimum dimensionality of derived feature spaces that is necessary for ensuring high classification rates. Results, summarized in Fig. 3, show that for SkPCAbased feature extraction, classification performance stabilizes after just a few principal components are adopted (an exact number depends on a database and varies from two, for pedestrian dataset, to four, for LEAVES and GLASS datasets). To provide high classification rates in case of kpca-based feature extraction, no clear threshold value exists and much more components are required (from seven components for pedestrian dataset to 48 components for GLASS dataset). This means that if minimum distance or probabilistic approaches are to be used as a subsequent classification strategy, SkPCA is much more attractive, as it produces compact feature spaces that can prevent a curse of dimensionality problem. Classification performance of the considered kernelbased strategies have been summarized in Fig. 4. Separate plots are provided for different datasets. For each case, three different procedures were executed: SVM classification of raw data and two methods involving SVM classi- Fig. 1: Sample images from annotated INRIA pedestrian database that contain: positive examples, i.e. image regions containing humans (top) and negative examples (middle). Four regions of interest containing persons (extracted from positive examples) and background (negative examples) with superimposed gradient information (bottom)

(four consecutive iterations are shown from top to bottom).

domain; validation performance is shown using a color map provided on

3: Classification performance versus number of selected principal

6 50 K. Adamiak, P. Duch, K. Ślot Fig. 2: Grid search procedure for SVM classification parameter derivation (four consecutive iterations are shown from top to bottom). Misclassification penalty C and RBF kernel parameter γ form a search domain; validation performance is shown using a color map provided on the right Fig. 3: Classification performance versus number of selected principal components for kpca and SkPCA analysis for GLASS database (upper plots) and for INRIA database (lower plots). SVM with sigmoid, RBF, linear and polynomial kernels are used in classification

Image Processing & Communications, vol. 21, no. 3, pp. 45-54 51 Fig. 5: knn classification results in kpca- and SkPCAderived feature spaces for the GLASS dataset.

parameter selection procedure, involving four iterations of grid search procedure, was made only in case of RBF kernel.

4: Classification results for the considered datasets and the adopted methods: SVM-only and SVM in kpca- and SkPCA-derived feature spaces (both kpca and SkPCA were using RBF kernel, whereas different

SVM classification was made using four different kernel types (linear, polynomial, RBF and sigmoid).

7 Image Processing & Communications, vol. 21, no. 3, pp Fig. 5: knn classification results in kpca- and SkPCAderived feature spaces for the GLASS dataset. SVM classification results are shown for comparison (RBF kernel was used with 3 different gamma parameters) To test classification sensitivity on non-optimal choice of parameters, a thorough parameter selection procedure, involving four iterations of grid search procedure, was made only in case of RBF kernel. In the two remaining cases - for polynomial and sigmoid kernels, only coarse values were derived using a single-iteration search. Fig. 4: Classification results for the considered datasets and the adopted methods: SVM-only and SVM in kpca- and SkPCA-derived feature spaces (both kpca and SkPCA were using RBF kernel, whereas different kernels were tested for SVM) fication of preprocessed data. For the purpose of reduced feature space derivation RBF kernel was used both in case of kpca and SkPCA. SVM classification was made using four different kernel types (linear, polynomial, RBF and sigmoid). For INRIA pedestrian dataset, one hundred randomly selected subsets were drawn and processed in five-fold classification scheme, and the results were averaged. As it can be seen from Fig. 4, the simplest dataset is the pedestrian detection one, where the highest classification rates are obtained. Moreover, as linear SVM classification yields very good results, categories seem to be almost linearly separable. One can also observe a drop in SVM classification performance, when sigmoid kernel with coarsely-chosen parameters is used. Glass identification was the most difficult dataset. Here, an impact of classification sensitivity on parameter fine-tuning is clearly revealed. Performance of SVM classification on raw data drops between 4% and 8%, whereas, if supplied with SkPCA preprocessing, it stays within a 3% range. This sensitivity can be seen even better for the LEAVES dataset. Without fine-tuning of SVM parameters, classification drops by 20% for a nonlinear, sigmoid kernel. Also, it can be seen that decision surfaces for the dataset are clearly nonlinear. The last part of experiments was concerned with performance evaluation of k-nn classification in feature spaces

8 52 K. Adamiak, P. Duch, K. Ślot derived using kpca and SkPCA. Results for GLASS dataset, presented in Fig. 5, show that one can achieve classification performance comparable to SVM. 5 Conclusions The presented paper confirms that SVM classification, although it can achieve very high rates, is quite sensitive to classification parameter fine-tuning. This can become a problem in several real-world contexts, especially under presence of outliers or bad examples. Also, in case of big data analysis, SVM classifier derivation needs to be based on random subsets of reasonable size, so fine tuning of parameters that would be suitable for the entire population becomes questionable. The paper proposes to consider data preprocessing by means of a kernel-based dimensionality reduction step prior to classification, as a possible means for handling this problem. It has been shown that adopting such a step noticeably reduces recognition sensitivity to non-optimal parameter choice, while maintaining high recognition rates. Of two considered nonlinear feature extraction strategies: kernel Principal Component Analysis and its supervised version, the latter one provides better recognition performance. Moreover, SkPCA as opposed to kpca, results in derivation of a low dimensional space (at most four dimensional for the considered datasets), which is desirable for avoiding the curse of dimensionality problem. One needs to bear in mind that PCA-based data preprocessing is computationally expensive, which may prevent applications of the proposed concept in time-critical pattern recognition tasks. References [1] Barshan, E., Ghodsi, A., Azimifar, Z., Jahromi, M.Z. (2011). Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognition, 44(7), [2] Baudat, G., Anouar, F. (2003). Feature vector selection and projection using kernels. Neurocomputing, 55(1), [3] Burges, C.J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), [4] Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine learning, 46(1-3), [5] Cristianini, N., Shawe-Taylor, J. (2000). An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press [6] Dalal, N., Triggs, B. (2005, June). Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, CVPR IEEE Computer Society Conference on (Vol. 1, pp ). IEEE [7] Hofmann, T., Schölkopf, B., Smola, A.J. (2008). Kernel methods in machine learning. The annals of statistics, [8] Kim, K.I., Jung, K., Park, S.H., Kim, H.J. (2002). Support vector machines for texture classification. IEEE transactions on pattern analysis and machine intelligence, 24(11), [9] Lichman, M. (2013). UCI Machine Learning Repository Irvine, CA: University of California. School of Information and Computer Science, 213 [10] Mika, S., Ratsch, G., Weston, J., Schölkopf, B., Müllers, K. R. (1999, August). Fisher discriminant

9 Image Processing & Communications, vol. 21, no. 3, pp analysis with kernels. In Neural Networks for Signal Processing IX, Proceedings of the 1999 IEEE Signal Processing Society Workshop. (pp ). IEEE [11] Murase, H., Nayar, S.K. (1995). Visual learning and recognition of 3-D objects from appearance. International journal of computer vision, 14(1), 5-24 [12] Müller, K. R., Smola, A., Rätsch, G., Schölkopf, B., Kohlmorgen, J., Vapnik, V. (1999). Using support vector machines for time series prediction. Advances in kernel methods-support vector learning, [13] Rangwala, H., Karypis, G. (2005). Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23), [14] Schölkopf, B., Smola, A.J. (2002). Learning with Kernels. MIT Press, Cambridge, MA [15] Schölkopf, B., Smola, A., Müller, K.R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5), [16] Song, L., Smola, A., Gretton, A., Bedo, J., Borgwardt, K. (2012). Feature selection via dependence maximization. Journal of Machine Learning Research, 13(May), [17] Wang, M., Sha, F., Jordan, M. I. (2010). Unsupervised kernel dimension reduction. In Advances in Neural Information Processing Systems (pp )

Kernel Methods and Visualization for Interval Data Mining

Kernel Methods and Visualization for Interval Data Mining Thanh-Nghi Do 1 and François Poulet 2 1 College of Information Technology, Can Tho University, 1 Ly Tu Trong Street, Can Tho, VietNam (e-mail: