Non parametric feature discriminate analysis for high dimension

Size: px

Start display at page:

Download "Non parametric feature discriminate analysis for high dimension"

Chastity Aubrey Ward
5 years ago
Views:

1 Non parametric feature discriminate analysis for high dimension Wissal Drira and Faouzi Ghorbel GRIFT Research Group, CRISTAL Laboratory, National School of Computer Sciences, University of Manouba, Tunisia Abstract - A method for the linear discrimination of non parametric binary classification is presented. It searches for the discriminate direction which maximizes the generalized Patrick-Fischer distance between the projected classconditional densities. The theoretical background is introduced with a new estimator using orthogonal function according to the Patrick-Fischer distance that gives the best scalar and multivariate extractor. The application of this method to the classification of some binary real data set leads to results better than those based on the traditional linear discriminate analysis (LDA) and the recursive Kernel estimator of the Patrick-Fischer distance. Keywords: Reduction dimension, Patrick-Fischer distance, linear discriminate analysis, features extraction, classification 1 Introduction In order to design properly the pattern recognizer system, it is necessary to consider the feature extraction and data reduction dimension problems. It is evident that the number of features needed to successfully perform a given recognition task depends on the discriminatory qualities of the chosen features. A suitable approach to feature extraction consists of generating a set of features which will tend to maximize the separation between classes. Once it could be derive a set of features which upon combining with the patterns of two or more pattern population by a suitable transformation, will generate a set of feature pattern that exhibit an increase in separation as measured between population and a reduction in the dimension, these features can be interpreted as being representative of the dissimilarities between population. It is well known that there exist two main types of criteria for the discriminate analysis. First, those based on matrices of dispersion that are expressed only in terms of moments of order less than or equal to two (LDA). They thus have advantages in the convergence rate of their estimators. However they are carriers only part of the statistical dispersion of information [4]. The second family is defined criteria from the probability density functions whether conditional or mixture. The separation of pattern classes has been considered from the point of view of utilizing a linear transformation which maximizes a distance between two probability densities as Chernoff, Kolmogorov or Patrick- Fischer one. In spite of its theoretical interest, its use still limited in practice due to the unrealizable of an explicit estimator of this distances in the non parametric case. In [1], Patrick and Fischer have been proposed a non parametric solution based on probability density functions. In the same paper they have been introduced a kernel estimate for the Patrick-Fischer distance. This distance is used for binary classification. It is important to note that this approach considers only the scalar extractor. The multivariate case is generalized by a recursive procedure by using the scalar extractor method [3]. In this paper, we intend to introduce a new estimate of the Patrick-Fischer distance based on the orthogonal functions. This estimate is adapted to the scalar and multivariate dimensional reductions. Thus, the paper will be organized as the following. In the first section, we recall some distances between the conditional probability density functions suggested in the literature for the discriminate analysis of binary classification. The proposed estimates of Patrick- Fischer distance will be introduced in the third section. In the fourth section some simulations will illustrate the performance of the proposed estimates. In the same section, we apply the orthogonal estimator of the Patrick-Fischer distance for classification in real data sets. 2 Formulation It is well known that the most suitable criteria for the discriminate analysis is defined from the distance between probability density functions whether conditional or mixture. We cite here the most important. The following quantity: d, Log π f π f dx is known as the Chernoff distance between two probability densities of the observation vector conditionally to classes 1 and 2.

2 Kolmogorov distance has a particular conceptual significance since it admits a direct link with the Bayes probability of error: d, dx Despite its theoretical interest, its use still limited in practice. The Patrick-Fischer distance admits the following expression: d, dx Generally, all these distances have links of underestimate and increase with the probability error of classification. Its minimization formulates an ideal criterion for the discriminate analysis which is difficult to apply in practice in non parametric and especially in the case of high dimensions due to the complexity required of these algorithms. In the case where we know the laws of the different random vectors of the conditional observation about the classes, a certain number of these distances can be estimated or approximated analytically. However in the non parametric case, this task is not easy. The distance which gets ready best for such developments is that defined by Patrick and Fischer. These distances assume that we are in the context of binary classification, i.e. the problem with two classes. 3 Multivariate reduction by an estimator of the Patrick-Fischer distance As we indicated, this class of methods qualified as non parametric is realized with an estimator of the Patrick-Fischer distance obtained via non parametric estimators of probability density functions. The method of orthogonal functions is a primary technique and has at least two main advantages in the context of discriminate analysis. On one hand, its mathematical formulation allows a certain facility of analytical calculation in the multivariate problem. On the other hand, its possibility of adaptation to the topological nature of the support densities to estimate provides a way to avoid the Gibbs phenomenon. This last remark will be crucial in extending this approach. In this section, we introduce a linear multivariate extractor by a new estimator of the Patrick-Fischer distance expressed in the d-dimension reduced space. In this sense, the estimator based on orthogonal functions of a joint probability density of a random vector V is written as follows:. represents the estimator of Fourier coefficient. Where {, 1.. designates a supervised learning sample distributed according to the conditional random vector of dimension d. represents the truncation parameter which acts as a smoothing factor. The estimator of the Patrick- Fischer distance using orthogonal functions can be expressed as a finite sum of generalized kernel of the method introduced by Parzen: With: 1, K, K,,,,, K, Where K, is the generalized scalar kernel. Using the orthogonality of the basis functions, we have for each pair of integers m and n: K,y, z K x, yk x, zdx By replacing in the expression of the Patrick-Fischer distance the various quantities by corresponding estimators, we obtain the following quantity as generalization to the scalar estimator of the PFD: 1 N K, K, 2Re K,, d is an unbiased estimator of the Patrick-Fischer distance. The estimator in the reduced space expressed as a function of the linear transformation W in R D :

3 1 K, K, 2 K,, Where W V represents the scalar product of two vectors V and W of the space R D and Re(z) is the real part of a complex z. The criterion for multivariate reduction dimension corresponding to these estimates is as follows: max This expression does not admit analytical solution. A numerical method of optimization can reach a maximum. Unlike the iterative algorithm presented in [3] that uses discriminate information carried by the successive marginal conditional distributions, this estimator is global in its definition and its optimization step. Therefore it carries all the statistical information discriminating in the reduced space. In the simulation section, we show the superiority of the global algorithm relative to the iterative one. 4 Performance evaluation 4.1 Simulation Studies In this section, the performance of the proposed estimate of the Patrick-Fischer distance (OPF) was tested for binary classification and compared to those of the LDA and the bivariate case generalized by the recursive procedure using the scalar extractor method with a kernel estimate of Patrick- Fischer distance (R1D-KPF) [2,3]. This experiment (Figure 1) is synthetic with Gaussian classes (Example 1), bimodal uniform classes (Example 2 and 3) and mixture of Gaussian classes (Example 4). The 2D extracted subspace illustrated in Figure 1 (a), (b) and (c) yields quite different results respectively for LDA, R1D-KPF and OPF. The objective of these experiments is to justify that when the data do not follow a Gaussian distribution, or even if the classes are Gaussian, but with similar class-conditional means or with different class-conditional covariances (heteroscedastic conditions), the traditional LDA method fails to find the optimal projection subspace. In addition, the subspace extracted by R1D-KPF does not also give the best projection, in particular, when the reduced dimension d>1, it s due to the iterativity in the optimization step. Whereas the proposed method perform well, like expected, because of its consideration of the conditional probability density functions of each class. 4.2 Experiments with a Real Data Set Classification experiments were performed using five data sets taken from the UCI Repository of machine learning databases [6] that come from a variety of applications. These data sets, labeled (a) to (e), have a various numbers of attributes D and various sample sizes N for the binary classification problem (see Table 1). In order to determine property all three transformations, problems related to near singular covariance matrices should be avoided. Such a problem can be solved by performing a PCA on the train set of every of the five data sets, where only the principal components with an eigenvalue bigger than one millionth of the total variance are kept [7]. For data set (e), the number of test instances is given in Table 1 as designated by its donors, so the transformation matrices W were estimated from the training data, which was then transformed to a subspace of appropriate dimension. For all other data sets, a k-fold cross-validation (CV) was used (Table 1). We have estimated the misclassification rate on test samples used to compare different methods for dimension reduction, in the d-dimensional reduced feature space. The classification error is estimated empirically based on the K- Nearest Neighbors, the Linear and the Quadratic s, which are chosen because they stay close to the assumption that most of the relevant information is in the first and second order central moments, i.e., the means and the (co)variances [7]. The per-data set-performances of these three reduction techniques are compared. To this end, per classifier, data set and dimension d, the mean estimated classification error over the multiple runs (N it =10) is determined (see Table 2). This gives a final estimate of the classification error for the respective settings. The overall optimal error rate over all transforms is typeset in bold and a * is added in superscript. The transforms that also gives, comparing to the optimal transformation, statistically imperceptible classification errors are written in bold. To compare the results, we have used a signed rank test where the desired level of significance is set to 0.01[5]. Tables 2 also give the Mean Classification Error (MCE) obtained when not performing any dimension reduction noted FULL. Table 1. Dataset description Data set Label D PC N Validation Breast cancer (a) fold Liver disorders (b) fold Diabetes (c) fold Diagnostic breast (d) fold cancer Heart (e)

4 Figure 1. Various 3D binary class examples where LDA fails. The class probabilities are uniform, i.e., π 1 =π 2. The optimal 2D subspace according to different feature extraction methods: (a) LDA, (b) R1D-KPF, and (c) OPF. Table 2. Observed MCE for the 5 data sets (a) to (e) for the reduced dimensions d=1 and d=2, Using the three mentioned s K-Nearest Neighbors, Linear and Quadratic classifiers and the three different reduction techniques indicated by LDA, R1D-Kernel Patrick-Fischer and Orthogonal Patrick-Fischer. Reducer K-Nearest Neighbors Linear Quadratic FULL LDA R1D- Kernel Patrick-Fischer Orthogonal Patrick-Fischer d=1 d=2 d=1 d=2 d=1 d=2 (a) * (b) * (c) * (d) * (e) * (a) * (b) * (c) * (d) * * (e) * (a) * (b) * (c) * (d) * (e) *

5 We start with two general observations: First, the quadratic classifier, in general, gives better results for most of the data sets. This may indicate that in most data sets, there is indeed separation information present in the second order moments of the class distributions. Second, the average error rates after reduction to d=1 or d=2 remain, in general, smaller than those in the full space, thus confirming that a gain in performance can be achieved by reducing the dimensionality of the problem. Also, note that the average error rates of the PF method compare favorably to those of other techniques for many subspace dimensions d (d = [1, 2]). This advantage seems to correlate with the difficulty of the classification problem. In particular, for linear and quadratic classifier, PF is uniformly (over all d) superior to other methods. In case of using the nearest mean classifier; we can see that the proposed Patrick-Fischer criteria as well as the LDA ranked better result than R1D-KPF. For the quadratic and linear classifiers, the optimal results were provided by R1D-KPF and OPF, with the best overall performance significantly different from the best performances of the LDA technique. Note that the performance of LDA is seriously limited by the constraint d < K (number of classes equal two). 5 Conclusion In this paper, a new method for dimensionally reduction is proposed. Its novelty lies on the using of a new estimation of the Patrick-Fischer distance by the orthogonal Fourier series expansion. The simulation and the real dataset experiments show that the suggested method increases the separability measure between the projected classes onto the reduced space consistently better than the well-known LDA method and the kernel estimator of the Patrick-Fischer distance. Since results given by the method proposed in this paper are very promising and could be used as an efficient step before a classification process, we will concentrate in our future work on the evaluation of the effectiveness of this method by studying the classification accuracy of a Bayesian classifier in term of probability of error. 6 References [1] E.A. Patrick and F.P. Fisher. Non parametric feature selection. IEEE Trans.On Inf. Theory, vol. IT-15, , 1969 [2] A. Hillion, P. Masson and C. Roux. A nonparametric approach to linear feature extraction; Application to classification of binary synthetic textures. 9th ICPR, , [3] W. Drira and F. Ghorbel. Classification in face recognition by multiclass probabilistic discriminate analysis. the 16th IEEE Mediteranean Electrotechnical Conference MELECON 2012, Hammamet, March 2012 [4] W. Drira, W. Neji and F. Ghorbel. Dimension reduction by an orthogonal series estimate of the probabilistic dependence measure. The International Conference on Pattern Recognition Applications and Methods ICPRAM 2012, Portugal, February 2012 [5] F. Ghorbel, S. Derrode and O. Alata. Récentes avancées en reconnaissance de formes statistique. First edition Arts Pi Tunisia, May [6] R.A. Fisher. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, vol. 7, , [7] P.A. Devijver and J. Kittler. Pattern Recognition: A Statistical Approach. Prentice-Hall International, Inc., London, [8] P.M. Murphy and D.W. Aha. UCI Repository of Machine Learning Databases, [9] J.A. Rice. Mathematical Statistics and Data Analysis. Second ed. Belmont: Duxbury Press, [10] Loog M., Duin R.P.W., Haeb-Umbach R.. Multiclass Linear Dimension Reduction by Weighted Pairwise Fisher Criteria. IEEE transaction on PAMI, vol. 23, n 7, [11] Nenadic Z.. Information Discriminant Analysis: Feature Extraction with an Information-Theoric Objective. IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 29, n 8, [12] K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(6): , [13] K. Fukunaga and J. M. Mantock, Nonparametric data reduction. Trans. IEEE Pattern Anal. and Machine lntell., PAMI-6,pp , 1984 [14] Aladjem M. E.. Linear discriminant analysis for twoclasses via removal of classification structure. IEEE Trans. Pattern Anal. Mach. Intell, vol. 19, p , [15] J. T. Tou and R. C. Gonzales. Pattern recognition Principles. Addison- Wesley Publishing Company, Inc. Advanced Book Program, 1974.

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical