5.2.1 Principal Component Analysis Kernel Principal Component Analysis Fuzzy Roughset Feature Selection

Size: px

Start display at page:

Download "5.2.1 Principal Component Analysis Kernel Principal Component Analysis Fuzzy Roughset Feature Selection"

Abraham Morton
5 years ago
Views:

1 ENHANCED FUZZY ROUGHSET BASED FEATURE SELECTION 5 TECHNIQUE USING DIFFERENTIAL EVOLUTION 5.1 Data Reduction Dimensionality Reduction 5.2 Feature Transformation Principal Component Analysis Kernel Principal Component Analysis 5.3 Feature Selection Fuzzy Roughset Feature Selection Differential Evolution Maximization 5.4 Proposed FRFS with Differential Evolution 5.5 Experimental Results Feature Transformation Feature Selection 5.6 Chapter Summary A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 77

2 5.1 DATA REDUCTION Data reduction is an important technique that plays a major role in the context of data mining. It helps to amalgamate or aggregate the required information in high dimensional datasets into useful and manageable information chunks. Data reduction techniques are used to reduce the dimension of large datasets into a smaller one where the integrity between the original and reduced dataset is preserved. In classification process, a classifier model that learns from the reduced dataset can produce better results than the classifier model that learns from the original dataset. The main advantage of using this technique in classification process is that, it makes the learning process faster and accurate. Listed are the four types of data reduction strategies and they are graphically represented in Figure 5.1: Data Compression Numerosity Reduction Dimensionality Reduction and Concept Hierarchy Generation Data Compression (Campos, 2000) helps to reduce the size of an original file by assigning a small amount of bits to minimally used data values in the file. There are two different kinds of algorithms in data compression. They are lossy and lossless compression. These algorithms are often used in image processing, signal processing and time series data analysis. In numerosity reduction (Han and Kamber, 2006), the original dataset is replaced by an alternate and smaller data representation. These techniques are based on parametric or nonparametric models that can estimate the actual dataset. It also stores the parameters of the dataset instead of an original dataset for estimation. A typical numerosity reduction technique that is used for mining the patterns is sampling. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 78

3 Dimensionality reduction is a data reduction technique which helps to detect and remove the redundant, irrelevant and weakly relevant attributes in the dataset. If the dimension of the dataset increases then the data becomes sparser in its space. So, the main objective of dimensionality reduction technique is to find an optimal subset of attributes that improves the accuracy of the classification algorithm and removes the sparsity of the dataset. Some of the encoding mechanisms to reduce the dimension of the dataset are PCA, LDA, SVD, KPCA etc. Concept hierarchy generation (Han and Kamber, 2006) reduces the original dataset and replaces low level concepts using higher level concepts. Though few details are lost by this generalization, it gives meaningful patterns and is easier to interpret. Data Reduction Data Compression Numerosity Reduction Dimensionality Reduction Concept Hierarchy Generation Feature Transformation Feature Selection Feature Extraction Linear Non Linear Filter Wrapper Embedded Factor Analysis ICA SVD PCA Sammon s Mapping ISOMAP Diffusion Maps KPCA Univariate RelieF Linear Regressio Multivariate Correlation Rough set Fuzzy Rough set FSSE Forward Selection Backward Elimination Decision Trees Figure 5.1 Taxonomy of Data Reduction techniques A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 79

4 Data reduction techniques can be applied to various real time applications with data mining techniques to predict the results accurately. The datasets taken for experimentation may contain redundant, irrelevant or weakly relevant features which spoil the classification principle of support vector machines. It results in high computational complexity of the classification algorithm by increasing the unwanted calculations. This curse of dimensionality is a main impediment in data mining and machine learning algorithms. So, there is a prerequisite for dimensionality reduction techniques in this research work to enhance the results, better visualization and minimize the time as well as memory. Different types of dimensionality reduction techniques used in this research work are discussed in the following section Dimensionality Reduction Dimensionality reduction technique is used to find a feasible subset of features that are adequate to describe the actual dataset. It iteratively identifies and removes the irrelevant information and produces the feasible subset of features. In previous literature, different types of sophisticated dimensionality reduction techniques are developed to overcome the three emerging problems in classification process i.e. classifier complexity, model accuracy and comprehensiveness of induced concepts. Dimensionality reduction technique (Jensen, 2005) falls into three categories. They are given below Feature Transformation Feature Selection Feature Extraction Feature Transformation is a dimensionality reduction technique that projects the original high dimensional dataset into lower dimensional space using some algebraic functions and finds a feasible solution in continuous space. Feature Selection algorithms are used to find an optimal subset of feature vectors according A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 80

5 to the given objective function in a discrete space. It improves the learning accuracy in the classification process by removing the irrelevant features. Feature Extraction (Ripley, 1996) is a powerful dimensionality reduction technique that is generally used to estimate and construct the linear combination of continuous features in the dataset which have good discriminatory power between the class labels. Feature selection and feature transformation are the two dimensionality reduction techniques that are required in this research work to reduce the complexity of high dimensional datasets and improve the performance of SVM classifier. 5.2 FEATURE TRANSFORMATION In dimensionality reduction, feature transformation maps the high dimensional datasets to a lower dimensional space such that the locality and geometric structures are preserved. Basically, feature transformation techniques are divided into two categories. They are linear transformation and nonlinear transformation. Linear dimensionality reduction technique determines the structure of a given dataset and its internal relationships using euclidean distance. Linear dimensionality reduction techniques like Principal Component Analysis (PCA), singular value decomposition and factor analysis are based on second-order statistics and they use covariance matrix for transformations. Nonlinear dimensionality reduction technique recovers the useful and meaningful sub manifolds from high dimensional datasets. It also helps to understand and visualize the recovered sub manifolds of complex real time datasets Techniques like Kernel Principal Component Analysis (KPCA), diffusion maps and sammon s mapping are some of the nonlinear dimensionality reduction techniques. They are comparatively simple and easy to code, as they involve classical matrix for calculations. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 81

6 Nowadays most of the real time problems are nonlinear with high dimensional datasets and it cannot be solved by the existing linear dimensionality techniques. So, nonlinear dimensionality reduction techniques are introduced. At the same time, linear dimensionality reduction techniques are not too restricted because still many applications require them to solve high dimensional problems. From the literature study, it is identified that PCA and KPCA are the well-known feature transformation methods that can be used in this research framework to improve the classifier performance. In this section, PCA and KPCA are discussed Principal Component Analysis (Lei and Govindaraju, 2005) PCA is a linear dimensionality reduction technique based on unsupervised learning. It transforms the original high dimensional datasets into a new and low dimensional space. In a new dimensional space, it determines the variance between the feature values and minimizes the reconstruction error. PCA transforms the number of possible correlated variables into a fewer number of uncorrelated variables called as principal components. It is a statistical technique that determines the key variables in a dataset to explain the difference between the observed feature values. This technique simplifies the data analysis and visualization of high dimensional dataset without loss of information. After centering the data for each feature vector, the principal components are ascertained by the Eigen value decomposition of a correlation / covariance / singular value decomposition data matrix. Here, PCA based on covariance matrix is chosen to reduce the dimensionality reduction because the variance of feature value is very high compared to correlation matrix. For different types of feature values, PCA based on type correlation is preferred. Similarly, the singular vector decomposition technique can be used for numerical feature values to improve the accuracy. Eigen vectors are often used to assess how much a principal component symbolizes the data. If the Eigen values of a principal component are higher, then A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 82

7 the data is more representative. The goal of principal component analysis is to calculate the meaningful basis to express a noisy dataset as a consistent one. PCA based on covariance is explained as follows: PCA based on Covariance Let χ ={ x 1,x 2,x 3,,x N } be the training dataset, where each x i represents a training vector, N is the size of training set and d is the dimensionality of the input vector. Using linear PCA, the maximum dimension of the projected subspace is minimum (N, d). Let X= [x1 x2 xn] denote the matrix containing the training vectors. PCA finds the eigenvectors of the covariance matrix, solving the following equation XX T e=λe (5.1) where e is an eigenvector and λ an eigen value. Using the Karhunen Loeve method (Kirby and Sirovich, 1987), pre-multiply both sides by X T, and then the equation (5.1) can be written as Kα=λα (5.2) where K = X T X and α = X T e. K is referred to as the inner product matrix of the training samples since K ij = (x i.x j ). This is a standard eigen value problem, which can be solved for α and λ. From (5.2), e = Xα (after normalization ). is calculated. The projections on the first q eigen vectors (corresponding to largest q eigen values) constitute the feature vector. For a test vector x, the principal component y corresponding to eigenvector e is given by y=e T x= (Xα) T x=α T X T x=σα i (x i.x) (5.3) where x i.x denotes the inner product of vectors x i and x. Most of the modern methods for nonlinear dimensionality reduction find their theoretical and algorithmic roots in PCA which is known for its robustness. PCA finds the mathematically optimal method, so it is sensitive to outliers in the data, that produce large errors and tries to avoid it. Hence it is common practice to remove outliers before performing PCA. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 83

8 5.2.2 Kernel Principal Component Analysis (Schölkopf et al. 1999) KPCA is a nonlinear dimensionality reduction technique that is used for nonlinear feature transformation, denoising, statistical estimation, visualization, classification and prediction based real time problems. Nonlinearity in dimensionality reduction is introduced using the kernel trick, which is the core concept of SVMs. In the kernel trick, data samples are nonlinearly mapped into a higher dimensional reproducing kernel Hilbert space (RKHS) called as feature space F by Φ. d : F, x ( x) (5.4) The dimension of the feature space F is very large, and it can be infinite. In order to avoid complex calculation in the feature space F, kernel functions are used. By defining a kernel function in the feature space F, optimization problems are transformed to dual optimization problems in R n. Hence, the computational complexity largely depends on the number of data samples used in this method. In the case of SVM, there are many optimization techniques such as chunking and sequential minimum optimization which produce sparse solutions. Thus, the limited number of data samples are stored for further calculations that are called as support vectors. If a new input data x is evaluated, then it is needed to evaluate the kernel function of x and support vectors. KPCA using Kernel Gram Matrix Let x 1,..., x n be d-dimensional data samples. Then, correlation operator of transformed samples Φ(x 1 ),..., Φ (x n ) is estimated by R N 1 : ( x i ) ( x i ) (5.5) n i 1 where.. denotes the operator that satisfies a b c c b a for all c. Then it transforms to r-dimensional eigen space i.e. U = [u 1.u r ] T and the projection is KPCA = U T U, where } { u r i 1 i is a set of eigen vectors of R Φ. Since the correlation A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 84

9 operator R Φ : F F is huge, a trick is used to obtain eigenvectors u i. Let an operator be S: R F be n n 1)... (xn )] (xi ) ei i 1 S [ (x (5.6) n where n { e i} is the standard basis in R n. The adjoint operator of A is denoted by i 1 A *. Since R Φ = 1/n SS *, the eigenvalue decomposition satisfies KPCA S ( r i 1 1 v i i v T i ) S *, (5.7) where T denotes the transpose of a vector or a matrix, and (K x ) ij = k(x i, x j ) is called the kernel Gram matrix. For an input vector x, its projection norm is called the kernel gram matrix and it is given by the equation KPCA (x) 2 n i 1 1 i 2 * v S (x)), i (5.8) where S * Φ(x) = [k(x, x 1 ),..., k(x, x n )] T Є R n is called the empirical kernel map of x. Consequently, KPCA requires to calculate the eigen value decomposition of K x Є R nxn in the learning phase and in test phase, n times evaluations of the kernel function are required. KPCA as a natural extension of PCA KPCA gives solution to an optimization problem by the following equation mi X : F F n n (xi) - X (xi) i 1 subject to rank( X ) r, N( X ) 2 R( S), (5.9) where R(S) denotes the range or the image of the matrix X, and N (X) denotes the null space or the kernel of the matrix X. From the above description, it is known that the null space and range are ignored in an input space R d since the data samples span the R d space if the number of data samples are adequate. But, in an infinite or in high dimensional feature A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 85

10 space F, the number of data samples is much smaller than the given number of dimensions. So, using an appropriate kernel function for nonlinear dimensionality reduction, the classification performance is improved when compared to the linear transformation and multivariate analysis. 5.3 FEATURE SELECTION Feature selection aims to find an optimal feature subset from a given problem domain where the accuracy obtained from the original dataset is retained. Unlike other dimensionality reduction techniques, it preserves the originality of a dataset after reduction and improves the performance of a classifier. The efficacy of a selected feature is established by its redundancy and relevancy. If the selected feature of a dataset is predictive to its decision variable then it is said to be relevant. Similarly, if the selected features are highly correlated with other features then it is called as redundant features. From the above strategies, it is identified that the selection of a good feature subset entails the feature vectors that are correlated with the decision variables and uncorrelated with other features. In dimensionality reduction, feature selection techniques are broadly classified into three types (Jensen, 2005). They are Filter approach Wrapper approach Embedded approach Feature selection algorithms that perform the selection process separately without any learning algorithms involvement are called as a filter approach. In this approach, irrelevant features are filtered before using an induction algorithm. This technique can be applied to most real world problems where it is not interrelated with particular induction process. Feature selection algorithms that are bound together with the learning algorithms to select the subset of features are called as wrapper approach. In this method, the selection process is based on the A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 86

11 approximated accuracy from an induction process. Embedded approach is similar to wrapper approach, where the selection process is built into the classifier model. Here, the searching process takes place in the combined space of hypotheses and feature subsets. Though the wrapper and embedded based feature selection algorithms perform better and produce improved results, they are cost expensive to execute and split the dataset into large number of feature vectors. This disadvantage is due to the usage of improper learning algorithms in the evaluation of feature subsets. Also, when dealing with high dimensional datasets wrapper and embedded method encounter with a main problem i.e. an infeasible selection of subset for a given dataset. So, it is better to use filter based feature selection approaches to improve the accuracy and reduce the complexity of the proposed classification framework Fuzzy Rough Set based Feature Selection (Chen et al. 2012) From the previous work, rough set theory has been proved as a successful filter based feature selection technique that performs better in data reduction and it can be applied to many real time problems. The three main aspects of the rough set theory are as follows Hidden facts in dataset are analyzed No additional information about the data is required Minimal knowledge is represented In real time applications, there are many cases where the feature values are crisp and real valued. Therefore, most traditional feature selection algorithms fail to perform well. To overcome this issue, an actual dataset is discretized before constructing a new dataset using crisp values. Here, the degree of membership of the feature values to the discretized values are not examined and it leads to an inadequacy. So, it is clear that there is a prerequisite for feature selection techniques that can reduce the real valued and crisp attributed datasets. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 87

12 Fuzzy theory and concept of fuzzification are the feature selection techniques that have emerged to provide an effective solution for real valued feature values. This technique allows the feature values that belong to more than one class label with different degrees of membership and models the vagueness in the dataset. Again, it is exploited with fuzzy concepts i.e. it enables an uncertainty in reasoning the dataset. To overcome the vagueness and indiscernibility in feature values, Fuzzy and rough set theory is encapsulated to remove an uncertainty in datasets. Fuzzy rough set theory is an extended version of the crisp rough set theory.it takes the degree of membership values within the range of [0, 1].It gives higher flexibility when compared to crisp rough sets where it deals only with zero or full set membership. Fuzzy rough set is described by two fuzzy sets. They are lower and upper approximation. FRQUICKREDUCT(C, D) C, the set of all conditional features D, the set of decision features R {}; γ best=0; γ prev=0 do T R γ prev= γ best for all x Є (C-R) if γ RU{x} (D)> γ T (D) T RU{x} γ best= γ T (D) R T until γ best== γ prev return R Figure 5.2 Fuzzy Rough QUICKREDUCT algorithm A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 88

13 Fuzzy Rough Feature Selection (FRFS) can be effectively used to reduce the discrete and real valued noisy attributes without any user information. In addition, this technique applies to both classification as well as regression problems that take the input value as continuous or nominal values. Information that is required to partition the fuzzy sets for feature vectors are automatically obtained from the datasets. It can also be replaced by other searching mechanisms like swarm intelligence, ant colony optimization and others. In FRFS, FRQuickreduct is the basic algorithm that has been developed to find a minimal subset of feature vectors and it is represented in Figure 5.2. It uses the fuzzy rough dependency function γ to select and add the feature values to reduct candidate. If adding any feature value to the reduct candidate fails to increase the degree of dependency, then the FRQuickreduct algorithm stops with the particular iteration. The FRQuickreduct algorithm calculates a reduct candidate with all possible subsets of feature values but it lacks in comprehensiveness.it starts the iteration with an empty set and adds a feature value one by one after checking the constraint that fuzzy rough set dependency should be increased or else it should produce a maximum value for the actual dataset. Thus, the dependency of each feature value is ascertained using FRQuickreduct algorithm and the feasible candidate is chosen. However, it is not assured that this algorithm will find a minimal subset of feature values. Sometimes the dependency function that discriminates between the reduct candidates leads to non minimal subset of features. So, it is not feasible to predict the combination of feature values that produce an optimal reduct based on the dependency function. Though the obtained result is close to the minimal subset, still it must be reduced greatly to achieve good results. As discussed earlier, this FRQuickreduct algorithm can be modified by introducing a new searching mechanism that optimizes the result which is discussed in the next section. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 89

14 5.3.2 Differential Evolution (DE) A potential elucidation for the FRQuickreduct algorithm is to modify the algorithm with an optimization technique that selects the most relevant and optimal feature subset that constructs an accurate and robust model for the classification process. From the state of the art, differential evolution (Storn and Price, 1997) has outperformed particle swarm optimization and evolutionary algorithms. DE algorithm is observed as a best technique for real world multiobjective optimization problems over continuous domain. This population-based stochastic technique is simple, robust and converges quickly in a particular optimum value. Additionally, DE has a minimum number of parameters and same parameter settings can be used for different domains. Similar to the other optimization techniques, DE has four main steps and they are initialization, mutation, recombination and selection. First, randomly generate the population vectors from the parent population. At each point of population matrix, a mutant vector is produced by selecting first two random vectors in population. Next, perform the weight difference and add the result to the third random vector. Then the mutant vector is crossed using the original vector to occupy the position in the original matrix. Result derived from this operation is known as trial vector. Thus the equivalent position in the population can contain either the trial vector or an original vector of the target that depends on the fitness function to achieve a higher accuracy. In the differential evolution algorithm, mutation is considered as an important step where it has a simple coding mechanism and is easy to implement. 5.4 PROPOSED FRFS with Differential Evolution To improve the performance and overcome the disadvantages of Fuzzy Rough Set Feature Selection technique, the FRQuickreduct algorithm is modified with the differential evolution based optimization technique i.e. FRFSDE. The pseudo code for FRFSDE is given in the Figure 5.3. This technique enables a fast convergence towards an optimal feature subset of the original dataset. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 90

15 FRFSDE QUICKREDUCT (C, D) C, the set of all conditional features; D, the set of decision features. Randomly select the position and velocity of the particles Initialize the population Evaluate the objective and fitness value Find the optimal feature subset as global Repeat Create a new feature subset Apply greedy selection strategy Evaluate the fitness and probability values If feature subset dominates the feature set, then the feature subset replaces the set If the feature set dominates the subset, then the feature subset is discarded Otherwise, the feature subset is added in the population Determine the best feature subset Memorize the best optimum feature subset Until the stopping criteria is satisfied Figure 5.3 Fuzzy Roughset Feature Selection with Differential Evolution In the FRFSDE algorithm, the conditional features and decision feature are taken as input data. The candidate is generated randomly for each feature vector in the parent population with lower and upper approximation bounds. Candidates are randomly generated by the particles in the population. Simultaneously, an empty feature subset is created and greedy selection strategy is used to find the fuzzy rough dependency function. If the parent dominates the candidate, then the candidate is discarded. Similarly if the candidate dominates the parent, then it is replaced by the candidate. Otherwise, the candidate is added to the population. The objective function for candidate is ascertained and evaluated using the fitness function. It is repeated until an optimal feature subset of the original dataset is derived. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 91

16 z 5.5 EXPERIMENTAL RESULTS Support vector machines do not support the automated internal relevance vector detection and hence it is necessary to perform dimensionality reduction techniques that reduce the feature set of an original dataset. In feature transformation, PCA and KPCA are used to transform the original features into new feature values. In feature selection, FRFS and proposed FRFSDE are used to select an optimal subset of features. While implementing the benchmark data sets, the performance of feature transformation techniques has slightly decreased due to uncertainty in feature labels and high computational complexity. To overcome the shortcomings of the feature transformation techniques and improve the performance of the adopted classifier, feature selection techniques have been reviewed and implemented. Table 5.1 Brief Summary of Benchmark Datasets taken for Feature Transformation Data Sets Size Attribute Iris Liver Heart Wine Abalone Pentagon Brief summary of the datasets taken for experimenting feature transformation techniques are given in Table 5.1. Feature selection techniques are experimented with all the benchmark and synthetic datasets that are detailed in chapter 2. Performance of the feature selection and feature transformation techniques are compared using different metrics. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 92

17 5.5.1 Feature Transformation PCA for Synthetic and UCI Data Sets In this section, Pentagon, Iris, Liver and Heart are the four datasets that are used to validate the methods. To evaluate the above method three visualizing tools like Scatter Plots, Bipareto analysis and Plotting in 3D (Hussain et al. 2011) are used and they are depicted below. A scatter plot (Pozdnoukhov et al. 2009) is often employed to identify potential associations between two variables. The Pareto Chart (Deb and Saxena, 2005; Wright and Manic 2010; Hussain et al. 2011) is basically a descending bar graph that shows the frequencies of occurrences or relative sizes of either: The various categories of all problems encountered, in order to determine which of the existing problems occur most frequently, or The various causes of a particular problem, in order to determine which of the causes of a particular problem arise most frequently. Figures 5.4 (a, b, c), 5.5 (a, b, c), 5.6 (a, b, c) and 5.7(a, b, c) represent the principal component analysis for pentagon, Iris, Liver and Heart datasets with Scatter Plot, Pareto Chart and 3D Plot. The PCA technique shows the feasibility by reducing the dimension of the dataset. Figure 5.4 (a) represents the scatter plot with principal component 1 in X axis and principal component 2 in Y axis and shows the divergence of data while using PCA for pentagon dataset. Figure 5.4 (b) illustrates Pareto chart for the synthetic dataset with variance and principal component as their dimensions. Figure 5.4 (c) symbolizes the 3D view with principal components. Figures 5.5, 5.6 and 5.7 represent the performance of PCA on Iris, Liver and heart datasets. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 93

18 Figure 5.4 (a) Scatter plot for Pentagon dataset Figure 5.4 (b) Pareto Chart for Pentagon dataset Figure 5.4 (c) 3D Plot for Pentagon dataset Figure 5.5 (a) Scatter plot for Iris dataset A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 94

19 Figure 5.5 (b) Pareto Chart for Iris dataset Figure 5.5 (c) 3D Plot for Iris dataset Figure 5.6 (a) Scatter plot for Liver dataset Figure 5.6 (b) Pareto Chart for Liver dataset A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 95

Figure 5.6 (c) 3D Plot for Liver dataset Figure 5.

7(b) Pareto Chart for Heart dataset Figure 5.

section, four datasets i.e. Pentagon, Iris, Wine and Abalone are used to validate the results.

11 represent KPCA for four datasets with data distribution and contour lines

20 Figure 5.6 (c) 3D Plot for Liver dataset Figure 5.7(a) Scatter plot for Heart dataset Figure 5.7(b) Pareto Chart for Heart dataset Figure 5.7 (c) 3D Plot for Heart dataset KPCA for Synthetic and UCI Data Sets In this section, four datasets i.e. Pentagon, Iris, Wine and Abalone are used to validate the results. Figures 5.8, 5.9, 5.10 and 5.11 represent KPCA for four datasets with data distribution and contour lines of different projection norms with parameter values (Huang et al. 2009). Here, the Gaussian kernel is used with r as A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 96

21 rank operators for the input dimension of the datasets and the kernel parameter c is 1 2 set to, where σ is variance for all elements. Since the number of samples in left 2 side is less than that on the right side, contour lines are biased to right side. This indicates that KPCA outperforms PCA. From the experimental results, it is proposed that KPCA can be used for high dimensional datasets before SVM classification, since KPCA is an extension of the PCA where it uses the kernel function to give a better transformation for kernel based classifiers. PCA is the root for all dimensionality reduction techniques, it can also be used for lower dimension data before classification. Even though KPCA performs better, the removed feature labels cannot be identified in feature transformation techniques because PCA or KPCA transforms the original dataset into a derived feature value with high computational cost. This is the major drawback of feature transformation and it can be overcome by feature selection techniques. Figure 5.8 Kernel PCA for Pentagon dataset A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 97

22 Figure 5.9 Kernel PCA for Iris dataset Figure 5.10 Kernel PCA for Wine dataset A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 98

Figure 5.11 Kernel PCA for Abalone dataset 5.5.2 Feature Selection In this section, two feature selection techniques i.e. FRFS and FRFSDE are implemented for binary and multiclass datasets.

23 Figure 5.11 Kernel PCA for Abalone dataset Feature Selection In this section, two feature selection techniques i.e. FRFS and FRFSDE are implemented for binary and multiclass datasets. Generally, goodness of a feature selection technique is defined by the information theoretic measures. To make these techniques more reliable and to compare their performance, four important information theoretic measures are considered. They are as follows: i. Fuzzy Entropy The fuzzy entropy of an attribute subset R is defined by the following equation F E( R) H( F) (5.10) Y F U / R Y U / R where R is an attribute subset, F is collection of fuzzy subsets, H(F) is the fuzzy entropy for a fuzzy subset, D is the set of classes and U is the nonempty set of finite objects. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 99

24 [ ii. Mutual Information The mutual information between two random variables X and Y is given by the equation I( X; Y) H( X ) H( X Y) (5.11) where H(X) is an entropy and H(X Y) is a conditional entropy. iii. Information Gain An information gain IG(S,A) is denoted by the equation Sv IG( S, A) Entropy( S) Entropy( Sv) (5.12) S v values ( A) where A is an attribute set and S is feature subsets. iv. Conditional Entropy The conditional entropy H(Y X) is given by the equation H( Y X ) p( x) H( Y X x) x X (5.13) where X, Y are the two random variables and p(x) is probability. Entropy is an information theoretic measure that checks an uncertainty of a random variable. Thus, the lowest entropy of feature values can be well suited for any classification process because these feature values are most informative. As such, the feature values with minimum fuzzy entropy and conditional entropy are considered as an informative feature subsets. Here, the average of entropy and mutual information of the selected feature values are ascertained for comparison. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 100

25 Table 5.2 Performance of FRFS and FRFSDE techniques for Binary and Multiclass datasets Datasets FRFS FRFSDE FME FMMI MIG CENT FME FMMI MIG CENT Iris Liver Heart Diabetes Breast Cancer Hepatitis Ripley Glass E-Coli Wine Balance Scale Lenses Pentagon Similarly, for information gain and mutual information, the value of the feature subsets should be maximum. Mutual information measures how much information these subsets can contribute to make a correct classification. In selecting feature subsets, if the mutual information reaches its maximum level then it is represented as a perfect indicator for class labels. Information gain selects the features that have highest discrimination and it helps in finding an optimal subset of features. Performance of FRFS and FRFSDE are compared using the above discussed information theoretic measures and it is depicted in the Table 5.2. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 101

26 Table 5.3 Brief summary and size reduction of Binary and Multiclass datasets using FRFS and FRFSDE Datasets No. of. Instance s No. of. Feature s Order FRFS No. of. Feat. FRFSDE Order No. of. Feat. Iris {3,4,1} 3 {3,4} 2 Liver {5,4,2,1,3} 5 {5,4,2} 3 Heart {5,1,4,13} 4 {1,4,5} 3 Diabetes {2,3,8,13} 5 {3,13,8} 3 Breast Cancer {1,2,5,8} 4 {2,5,8} 2 Hepatitis {17,14,12,16} 4 {12,14,16} 3 Ripley {2,1} 2 {2,1} 2 Glass {4,3,6,7,8} 5 {4,3,6} 3 E-Coli {6,1,3,2} 4 {1,2,6} 3 Wine {1,2,7,13} 4 {7,13} 2 Balance Scale {1,2,3,4} 4 {1,2} 2 Lenses 24 4 {4,3,1} 3 {4,3} 2 Pentagon 99 2 {1,2} 2 {1,2} 2 From the Table 5.2, it is inferred that the proposed FRFSDE outperforms FRFS by increasing the information gain and mutual information, and decreasing the fuzzy and conditional entropy. Thus the obtained feature subsets are the optimal and feasible subsets that can be used further for the classification process. Table 5.3 represents the datasets, number of instances, number of features and the subset of features that are selected using FRFS and FRFSDE. From the Table 5.3, it is deduced that an optimal subset of feature sets are derived using the proposed FRFSDE technique when compared to FRFS technique. Thus, it is suggested to use the proposed FRFSDE technique to improve the classifier performance and reduce the dimensionality in the research framework. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 102

27 5.6 CHAPTER SUMMARY This chapter presents in detail the dimensionality reduction techniques and the need for it in different perspectives. In classification process, there may be feature values that are common to more than one class and they can contribute less information to the results. Thus, removing the features that has minimum information can improve the consistency and performance of a classifier. It also avoids curse of dimensionality and overfitting. The feature values with higher information increases the performance of a classification process and at the same time it reduces the size of a classifier model which leads to an efficient memory usage. Reducing the feature values may seem to be instinctively wrong but from the experiments, it is shown that results are either retained or improved. Here, the modified Fuzzy Rough Set Differential Evolution (FRFSDE) based feature selection outperforms the existing fuzzy rough set. It proves to be a feasible technique and can be used in the research framework for improving the performance. A Framework for Admissible Kernel Function in Support Vector Machines using Lévy Distribution 103

Dimension Reduction CS534

Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of