Empirical Comparison of Bagging-based Ensemble Classifiers

Size: px

Start display at page:

Download "Empirical Comparison of Bagging-based Ensemble Classifiers"

Jocelin Hall
6 years ago
Views:

1 Empirical Comparison of Bagging-based Ensemble Classifiers Ren Ye, P.N. Suganthan School of Electrical and Electronic Engineering Nanyang Technological University, Singapore {re0003ye, Abstract This paper compares empirically four baggingbased ensemble classifiers, namely the ensemble adaptive neurofuzzy inference system (ANFIS), the ensemble support vector machine (SVM), the ensemble extreme learning machine (ELM) and the random forest. The comparison of these four ensemble classifiers is novel because it has not been reported in the existing literature. The classifiers are evaluated with thirteen binary class datasets and the empirical results show that the ensemble methods employed in the four ensemble classifiers boost the testing accuracy by 1-5% on average from their base classifiers. In addition, the testing accuracy can be improved by increasing the number of base classifiers. The empirical results also show that the bagging SVM is the most favorable ensemble classifier among them. I. INTRODUCTION Classification is a common problem in machine learning, which is to group the new observations/samples based on a training set whose identifications (labels) are provided (supervised learning) [1] or unprovided (unsupervised learning, not discussed in the paper) [2]. Classification can be applied in various areas such as medicine, manufacturing, bioinformatics, finance, etc. There are several well-developed tools for classification such as Bayesian, K-nearest neighbors, decision trees, fuzzy neural networks (FNN), support vector machines (SVM) and so on. There are two phases to design a classifier: (i) supervised learning (unsupervised learning is not discussed in the paper) and (ii) evaluation of generalization performance. However, due to the variance in the training data, the bias of the attributes, etc., different accuracies may be obtained with different classifiers, or same classifiers with different parameters. In other words, there is no universal classifier for all types of data distributions [1]. In order to improve the performance of classifiers, ensemble learning can be used. Ensemble learning is to use multiple models to achieve a better predictive performance than that could be obtained from any of the constituent models [3]. There are two categories of ensemble learning, one is competitive ensemble learning and the other is cooperative ensemble learning [4]. The base classifiers work individually on one set of a problem and vote for the final decision is competitive ensemble learning whereas the base classifiers work associatively with each other and aggregate the decisions to the final decision is cooperative ensemble learning. The advantage of ensemble learning is the improvement of the accuracy [5]. The key determinant to better accuracy and generalization is ensemble diversity. There are mainly three strategies for ensemble diversity: (i) kernel diversity, (ii) parameter diversity and (iii) data diversity [4]. Kernel diversity: Kernel function converts the data space to a non-linear space so that a linear decision boundary can be used in the non-linear space to classify the data. Kernel functions play an important role in SVM and ELM. Varying the kernel functions for different ensembles will diversify the base classifiers. Parameter diversity: Different parameters will vary the performance of classifiers. For example, the parameters that affect the performance of fuzzy neural network (FNN) are the membership function, number of nodes, rules, aggregation methods, etc. and for SVM, the parameters affects the performance are the margin parameter and the parameters associated with the kernel functions. Data diversity: For a finite dataset, different data perturbations may lead to significantly different classification configurations. Data diversity minimizes the bias and variance and increases the generalization capability of the classifiers. Cross validation (CV) is a commonly used method in machine learning to avoid overfitting [6]. Other advanced data diversity algorithms such as bootstrap aggregation (bagging) [7] and boosting [8] are widely used in ensemble methods. The ensemble classifiers can make use one or more of the strategies mentioned above in the actual implementation. This paper will focus on data diversity, especially on bagging-based ensemble classifiers. Bagging is a method for generating multiple datasets for a classifier and use the classification results of each datasets to obtain an aggregated classifier. Bagging-based ensemble learning can be categorized as competitive learning and data diversity strategy. In the literature, there are a few bagging classifiers reported. The original bagging algorithm was applied to classification trees [7]. The bagging classification tree improved the performance of the classification tree significantly in terms of error reduction on eight datasets. In [9], the bagging algorithm was employed to a Bayesian predictor. In [10], the bagging algorithm was employed to a tree-based ranking predictor. The 917

2 empirical results showed that with bagging, both predictors outperformed the original predictors on most of the datasets. An improved bagging-based classification tree is reported in [2]. The algorithm is called Random Forest. Random forest introduces an additional mode of randomness to bagging, thereby increasing the data diversity. In addition to bootstrapping on training data, which is a standard procedure in bagging, random forest improves the construction of the classification trees. The normal way of constructing a classification tree is to split the node at the best split among all attributes whereas for random forest, the node is split using the best split among a randomly chosen subset of attributes [11]. The authors claimed that random forest yields good accuracy and robustness. Bagging can also be applied to neural networks and FNNs [12], [13]. In [13], an ensemble fuzzy classifier was created with bagging and adaptive neuro-fuzzy inference systems (ANFIS). Twenty datasets were evaluated and the bagging ANFIS outperformed the original ANFIS on most of the datasets in terms of accuracy. Support Vector Machine (SVM) is a machine learning tool for classification as well as regression [14]. The advantage of SVM is that it can be implemented in a high-dimensional linear attribute space but with a non-linear classifier by appropriately choosing a kernel function. The kernel function enables SVM to perform well for high-dimensional datasets. However, choosing a kernel function and the parameters of the kernel function usually undergoes an exhaustive search process [5]. Improperly chosen kernel functions and parameters will reduce the accuracy of the SVM. Two variations of bagging-based ensemble SVMs were reported in [15], [16]. A low bias bagged SVM was reported in [16]. The authors claimed that the bagging algorithm in the learning phase should be subjected to minimizing bias even with a trade-off on increasing variance. Therefore, the bagging algorithm was modified to a lobag algorithm. The lobag algorithm first estimated the parameters of an SVM that minimizes the bias and then based on the chosen parameters creating bags for the SVM classifiers. In [15], a double SVM bagging was presented. The authors modified the conventional bagging SVM with two phase classification: first phase was an SVM classification and second phase was a decision tree classification. Unlike the conventional bagging SVM classifiers, the reported double SVM bagging algorithm used the out-of-bag samples for training and combined the bagging sample to a probability matrix. In the testing phase, decision tree was employed and the final decision was voted by the confidence levels. Both ensemble SVMs returned a conclusion of improved performance. An ensemble SVM based on cooperative ensemble learning was presented in [1]. The data diversity was realized by iteratively learning from the training data and adding the false predicted data in the previous iteration as support vectors in the next iteration. The accuracy was determined by aggregating the decisions of all iterations. However, the authors noticed that for certain datasets, there were no convergence but oscillations and the oscillation problem is still un-solved. Extreme learning machine (ELM) is an effective machine learning algorithm based on single-hidden layer feedforward neural network (SLFN) architecture [17]. It is a fast algorithm and can provide good generalization. There are several ensemble ELM developed in the literature. A parameter diversity based ensemble ELM was reported in [18]. The ELM classifiers were created by k-fold CV. For each fold, there were a few training iterations and the parameters w and b of the output neuron feature space H were tuned for better performance in each training iteration. The best performed parameters were saved for each ELM classifiers. The testing results were decided by a majority vote. Another parameter diversity based ensemble ELM was presented in [19]. The authors created the ensembles by assigning different initial weight values to different ELM classifiers and followed by training with ELM. Therefore, a lot of ELM classifiers were created. Finally the testing result was aggregated by a majority voting. Both of the ensemble ELMs [18], [19] outperformed the original ELM in classification. A bagging-based ensemble ELM regressor was developed in [20]. The bagging algorithm was employed to bootstrap the training dataset to several sub-training datasets. ELM regressor was applied on each sub-training datasets and the error rates were calculated and stored as weights of vote. The testing results were decided by a weigh vote combining method. This paper will use the similar algorithm but with ELM classifier for empirical study on binary classification problem. The above-mentioned classifiers belong to FNN, kernel machine and decision trees. This paper will study four typical classifiers from these four categories, they are ANFIS of FNN, SVM of kernel machine, ELM of FNN and kernel machine and random forest of decision trees. The objective of this paper is to evaluate four bagging-based ensemble classifiers: the bagging ANFIS, the bagging SVM, the bagging ELM and the random forest. The performance of these four bagging ensemble methods are compared. The evaluation is based on thirteen UCI binary class datasets. To the authors knowledge, the comparison of the four ensemble classifiers has not been studied in the existing literature. The remaining of the paper is organized as follows: Section II elaborates the details of the bagging algorithm and four classifiers; Section III details the ensemble SVM, the ensemble ANFIS, the ensemble ELM and the random forest; Section IV discusses and compares the empirical results and finally Section V concludes the paper. II. CLASSIFIERS A. Preliminary Definition Let D = {x be a dataset with n samples and each sample has d attributes. Let T be a training set and S beatestingset, where D = T + S. There is an array of identification labels y { 1, 1 mapping to D and therefore y = y T +y S, where y T and y S correspond to the labels of T and S respectively. The general idea of a classifier is to find ŷ S based on a 918

3 wx+b=1 wx+b=0 Wx+b=-1 Fig. 1: A Typical ANFIS Architecture function ŷ S = t(s, y S ), where the function t( ) is determined by a learning function l(t, y T ). B. ANFIS Adaptive Neural Fuzzy Inference System (ANFIS) [21] is an FNN structure based on TSK fuzzy inference system model [22]. It processes the data in a fuzzy space. The adaptive and learning abilities are realized by the ANN structure. The architecture of a typical ANFIS is shown in Fig. 1. As shown in Fig. 1, there are six layers in the ANFIS architecture (six-layer feed-forward neural network). The example ANFIS architecture has two inputs x 1,x 2, and one output y. Each input is fuzzified by two fuzzy sets A 1,A 2,B 1,B 2 corresponding to four membership functions. Layer 1 is the input layer. No specific function is applied. O 1 i = x i (1) Layer 2 is the fuzzification layer. Inputs from Layer 1 are converted into fuzzy numbers by the membership functions μ( ). The membership functions are usually chosen to be non-linear bell shaped functions. O 2 i = μ Ai (x i ) (2) Layer 3 is the rule layer. Each fuzzified input undergoes an IF-ELSE rule and combined by multiplication (T- NORM). O 3 i = μ Ai (x 1 ) μ Bi (x 2 ) (3) Layer 4 is the normalization layer. The outputs of the rule layers are normalized by the firing strength (the weights) and then are combined by summation (S-NORM). Oi 4 = Oi 3 Oj 3 (4) j Layer 5 is the defuzzification layer. The defuzzification layer formed a series of linear equations based on the inputs and the outputs from the normalization layer. O 5 i = O 4 i f i = O 4 i (p i x 1 + q i x 2 + r i ) (5) Layer 6 is the output layer. The linear equations are summed to form a single linear equation. The output of the ANFIS is the output of the linear equation. O 6 = i O 5 i (6) Fig. 2: A Linearly Separable SVM Case The learning algorithm of ANFIS comprises two parts, one is the back-propagation gradient descent method which is usually used in ANN and the other is the least-square estimator which is usually used in linear statistics. The gradient descent method is mainly used for updating the antecedent parameters and the least-square estimator is mainly used for updating the consequent parameters. The users can choose different combinations of learning algorithms, such as gradient descent only, gradient descent plus one pass of least square estimator, etc. based on the design objectives (speed v.s. accuracy). However, ANFIS has a fixed architecture for a particular dataset. The learning algorithm can only change the parameters of the membership functions and the coefficients of the output linear equations but not the fuzzy rules. This limitation reduces the adaptability of ANFIS for complex datasets with large number of attributes and samples. In addition, the output of ANFIS is a floating number and therefore, the output should be round off to the nearest values of y for final decision. C. SVM d-dimensional samples can be mapped on a d-dimensional hyperspace. The basic idea of SVM is to draw the best hyperplane in the hyperspace to separate the samples according to their corresponding labels y. The hyperplane is optimum when the margin of separation between samples with different labels is maximized. The samples sitting on the margins are called support vectors [14]. Suppose the samples are linearly separable, then a hyperplane in the hyperspace can be defined as: wx i + b (7) where w is a vector perpendicular to the hyperplane and b is abias. For a linearly-separable case, two classes are separated by wx + b =0(a solid line in Fig. 2) and there is also a region bounded by wx + b =1and wx + b = 1 (two dashed lines in Fig. 2) in between the two classes so that a soft margin is created. 919

4 2 The distance between two margins is w. So, the best hyperplane is obtained by solving: subject to min 1 2 wt w (8) y i (wx i + b) 1 For linearly un-separable samples, certain samples will be misclassified. They are represented by non-negative variables ξ i such that: y i (wx i + b) 1 ξ i. (9) Therefore, the optimization problem becomes: min 1 n 2 wt w + C ξ i (10) subject to i=1 y i (wφ(x i )+b) 1 ξ i and ξ i 0 where C is the penalty factor to weight the misclassification and φ( ) is a nonlinear function to map the input to higher dimensional space. If C is large, the solution converges towards the solution obtained by the optimal separating hyperplane with more tolerance of misclassification. If C is small, the solution converges to one where the margin maximization dominates [23]. The optimization equation (10) can be represented more formally by Lagrange method as shown: n max L(α) = α 1 n n α i α j y i y j φ(x i ) T φ(x j ) (11) 2 subject to i=1 i=1 j=1 n α i y i =0and 0 α i C i=1 where the term φ(x i ) T φ(x j ) can be represented by a kernel function K(x i,x j ). The advantage of the kernel function is to save the computation effort of computing the higher-dimensional mapping of x i and x j separately by computing the internal product of x i and x j first and then map to the specific higher-dimensional space. There are several popular kernel functions such as polynomial, radial basis function (RBF) and sigmoid, etc. The users can choose a proper kernel with proper kernel parameters for a specific dataset. D. ELM ELM has a neural network like structure, but it is advantageous over neural network because the hidden layer needs not be neuron alike and needs not be tuned [17]. The output function of an ELM binary classifier is a function of the hidden layer feature space h( ) and the output weights β as shown: f L (x) =sign(h(x)β) (12) This function is equivalent to mapping a d-dimensional input datatoanl-dimensional hidden layer feature space h and then scale by the output weights of L hidden neurons. The output weights β is determined by the hidden layer output matrix H = [h(x 1 )...h(x T )] as well as the predefined labels y T shown: { (H T H) 1 H T y T H T H is nonsingular, β = H(HH T ) 1 y T HH T (13) is nonsingular. To minimize the error ŷ S y S = Hβ y S and β, Lagrange multiplier method is used and finally the equation of binary ELM classifier is shown: f(x) =sign(h(x)h T ( I C + HHT ) 1 )y T ) (14) ELM can be easily adapted to multi-class classification and regression. The advantage of ELM is the fast training speed due to its neural network alike structure. E. Bagging Usually T is small and with a large variance. The bootstrap phase in bagging under-samples data with replacement from T according to the uniform probability distribution and creates B independent datasets (bags) T (1)...T (B). The purpose of making B bags is to reduce the variance of the original dataset T. The aggregation phase in bagging aggregates the decisions drawn from each bag by voting or other equivalent methods to reduce bias [7], [16]. Therefore, bagging can reduce variance and in the mean time reduce bias. The schematic diagram of bagging is shown in Fig. 3. The reasons why bagging over performs a single classifier are listed in [24]. There are three reasons: statistical, computational and expressiveness. The statistical reason is that the learning in bagging aggregates different possible hypotheses to obtain the best hypotheses, which is advantageous over a single hypothesis. The computational reason is that the learning algorithm in bagging has less chance to be trapped in a local minimum than the single classifier. The expressive reason is that the bagging classifier can expand the search space to higher dimensional space which improves the expressiveness from the single classifier. F. Random Forest Random forest is developed from decision trees. The basic idea of decision trees is to classify a sample by binary partition. At each node of the tree, a test is applied to the input sample and a decision (the direction of the preceding branches) is made by evaluating the outcome of the test. The test terminates at a leaf node (no more preceding branches) and the final classification is made. Decision tree is fast, easy to understand and insensitive to a missing data [25]. Random forest improves the decision trees by bagging and random subspace method. A random forest consists of a set of decision tree classifiers where each classifier is trained by independent identically distributed random vectors and the finial decision is aggregated by a majority vote [2]. 920

5 Testing Data Bag (1) Classifier (1) Prediction (1) Training Data Sampling with Replacement Bag (2) Bag (3) Classifier (2) Classifier (3) Prediction (2) Prediction (3) Majority Vote Aggregated Prediction Bag (B) Classifier (B) Prediction (B) Fig. 3: A Schematic Diagram of Bagging The advantage of using bagging for creating the decision trees is to increase the diversity of the base classifiers. In each node, random forest employs random subspace method for node splitting, which adds one more degree of randomness to the base classifier. Therefore, the author claimed that the random forest is a promising ensemble method [2]. III. ENSEMBLE CLASSIFIERS This section shows the detailed algorithms of the bagging ANFIS, the bagging SVM, the bagging ELM and the random forest. The generalized algorithm of the bagging-based ensemble classifier is shown in Fig. 4. There are four nested loops in the algorithm: the outer loop is to repeat the classification process to get an average testing result; the first inner loop is to train a classification model for each bag; and the second and third inner loop is to find the best parameters for the single classifier for that particular bag based on a k-fold CV. The testing phase includes a majority vote to obtain the final decision. For the bagging ANFIS, the best performing ANFIS model is obtained by k-fold CV only because ANFIS will learn and update the parameters by its back-propagation and least-square estimator. For the bagging SVM, the parameter C and kernel parameter γ (for RBF kernel) will be tuned by a grid search. For the bagging ELM (sigmoid kernel), the parameter C and the number of hidden layer nodes L will be tuned by a grid search as well. In order not to overfit each base classifier, the parameter is searched once before bootstrapping. If the grid search is implemented for each base classifier after bootstrapping, the base classifiers will have high chance to be overfitted to the training data. The random forest is pre-compiled from a R package [11]. There are several parameters can be tuned in the program but the only parameter to be tuned in the experiment is the number of bags so that there will be a fair comparison against the other three ensemble classifiers. The number of stratified testing S is set to be 50 in the empirical analysis; the number of bags B is set to 20, 50 and 80 for three different experiments; and the ratio of the training data and the testing data is n T : n S = 70% : 30%. The training epoches of ANFIS is set to 10. The grid search of the bagging SVM has two dimensions: C {2 i,i = 5, 4,...,5 and γ {2 i,i = 4, 3,...,4. The grid search of the bagging ELM has two dimensions as well: C {2 i,i = 5, 4,...,5 and L {10, 20,...,1000. Inputs: D: Dataset y: Labels B: Number of bags S: Number of Stratified Testing C, σ, L: Parameters of the classifier FOR st =1to S, DO{ BootStrapping Phase: Randomly PARTITION D, y T, y T + S, y S, where T : S = n T : n S ; BOOTSTRAP T, y T T (i), y (i), i =1...B; Training Phase: FOR each parameters, DO{ FOR each fold in k-fold CV, DO{ M = Train(T, y T, P arameters); SAVE the best-performed parameters P best ; FOR i =1to B, DO{ M (i) = Train(T (i), y (i) T,P best); Testing Phase: FOR i =1to B, DO{ ŷ (i) S = Test(M (i), S, y S ); Aggregation Phase: Majority Vote ŷ (i) S ŷ S, i =1...B; A (st) = ŷ S y S y S ; RETURN A and σ A ; Fig. 4: The Bagging-based Ensemble Classification Algorithm IV. EMPIRICAL ANALYSIS The bagging ANFIS, the bagging SVM, the bagging ELM and the random forest were evaluated using thirteen UCI T 921

6 binary classification datasets [26] and shown in Table I. The ANFIS algorithm is the standard release in the Matlab Fuzzy Logic Toolbox based on [21]; The SVM algorithm is build from a package called libsvm [27]; The ELM algorithm is obtained from [17] and the random forest algorithm is obtained from [28] that is converted from the R package reported in [11]. In order to speed up the computation of ANFIS, three attributes were chosen for each datasets under ANFIS and the bagging ANFIS evaluation. The attributes were chosen based on one-pass single attribute ANFIS classification with ascending classification error. TABLE I: Description of Datasets Dataset No. of Samples No. of Attributes australian (credit) breast (wisconsin) bright credit diabetes dimdata heart hepatitis ionosphere liver mammo(graphic) mushroom pima The tuned parameters for SVM and ELM averaged over 5-fold CV are shown in Table II. The tuned parameters of SVM did not vary much among each fold of CV whereas the tuned parameters of ELM, especially L, had a large standard deviation among each fold of CV. TABLE II: Parameters of SVM and ELM Dataset SVM ELM log 2 C log 2 γ log 2 C L australian 0[0] -2[0] -1.8[3.56] 364[376.94] breast 0.2[0.45] -2[0] 3.8[0.84] 650[409.39] bright 3.8[1.79] -2[0] 2.6[2.79] 880[188.28] credit 0.2[0.45] -2[0] 0[3.08] 424[310.45] diabetes 0[0] -1.8[0.45] 1.4[2.88] 348[341.20] dimdata 1[0] -2[0] 0[1.87] 730[236.43] heart 1.2[2.17] -2[0] -2.8[3.49] 586[386.43] hepatitis 5[0] -0.4[2.19] 1.8[3.35] 294[191] ionosphere 4[2.24] -2[0] 1.6[1.82] 590[389.55] liver 1.2[0.84] -1.2[0.84] 0.8[5.31] 500[445.93] mammo 2.2[1.3] -2[0] 4.4[1.34] 412[470.61] mushroom 5[0] 2[0] -2[2.92] 514[439.29] pima 0.2[0.45] -2[0] -0.6[2.97] 258[394.23] The base classifier accuracies (averaged over 80 base classifiers) before majority vote is shown in Table III. Compared with the aggregated accuracy shown in Table VI, the average of base classifier accuracies is generally 1 5% less. For example, the improvements of the average testing accuracies for ionosphere from before to after aggregation are 8.03%, 2.54%, 2.26% and 7.80% for bagging ANFIS, bagging SVM, bagging ELM and random forest, respectively. In terms of diversity, the standard deviations of the base classifiers are generally more than unity. These two tables show that bagging indeed increases the accuracy by aggregating base classifiers decisions. For example, the standard deviations of the base classifiers testing accuracies of ionosphere are 1.94, 2.27, 3.35 and 3.15 for bagging ANFIS, bagging SVM, bagging ELM and random forest, respectively. TABLE III: Base Classifier Accuracy (%, μ[σ]) of Four Ensemble Classifiers with 80 Bags Dataset Bagging Bagging Bagging Random ANFIS SVM ELM Forest australian 83.29[0.49] 84.36[1.67] 85.20[2.41] 79.96[4.43] breast 92.77[2.00] 95.87[1.45] 95.01[1.26] 93.76[1.23] bright 97.10[0.38] 99.03[0.31] 99.06[0.38] 97.59[0.77] credit 85.12[1.69] 82.96[2.32] 81.55[2.53] 79.44[3.68] diabetes 73.86[1.22] 74.67[2.30] 72.98[3.40] 67.51[3.35] dimdata 88.34[0.48] 95.53[0.51] 95.60[0.54] 92.01[0.88] heart 80.68[2.04] 77.27[5.23] 80.86[5.14] 76.63[4.37] hepatitis 80.20[4.45] 79.21[1.85] 78.09[4.70] 76.36[5.98] ionosphere 81.02[1.94] 92.15[2.27] 85.76[3.35] 85.17[3.15] liver 59.92[3.55] 68.17[4.11] 67.21[4.19] 61.72[5.44] mammo 81.40[1.61] 82.20[1.42] 81.03[2.09] 78.30[2.00] mushroom 74.69[0.75] 88.90[0.75] 88.99[0.76] 83.04[3.11] pima 75.14[1.34] 74.79[2.21] 75.93[2.58] 68.36[2.75] Tables IV, V and VI show the testing accuracies of four ensemble classifiers with 20 bags, 50 bags and 80 bags, respectively. The testing accuracies are averaged over 50 stratified testings. Generally, the testing accuracies of the four bagging-based ensemble classifiers increase when the number of bags increases. This is because that the data diversity is related to the number of base classifiers [24]. For the four bagging-based ensemble classifiers with the same number of bags, the bagging SVM is the most favorable ensemble classifier among them. When there are 20 bags, the bagging SVM has the best predicted accuracies on 7 datasets out of 13; when there are 50 bags, the bagging SVM has the best predicted accuracies on 6 datasets out of 13; and when there are 80 bags, the bagging SVM has the best predicted accuracies on 8 datasets out of 13. The second most favorable ensemble classifier is the random forest. When there are 20 bags, the random forest has the best predicted accuracies on 3 datasets out of 13; when there are 50 bags, the random forest has the best predicted accuracies on 5 datasets out of 13; and when there are 80 bags, the random forest has the best predicted accuracies on 4 datasets out of 13. The four ensemble classifiers are ranked by their testing accuracies of each dataset for 20 bags, 50 bags and 80 bags, respectively. The overall rank is calculated by the summation of the individual ranks. The overall rank of the four ensemble classifiers for different bags is: bagging SVM < random forest < bagging ELM < bagging ANFIS. V. CONCLUSION This paper has discussed four bagging-based ensemble classifiers, namely the ensemble ANFIS, the ensemble SVM, the ensemble ELM and the random forest. The discussion of the four ensemble classifiers is novel because it is not reported in the existing literature. The ensemble classifiers have been evaluated on thirteen UCI binary class datasets with different bagging numbers (20, 50 and 80). The empirical results have 922

7 TABLE IV: Testing Accuracy (%, μ[σ](rank)) offouren- semble Classifiers with 20 Bags Dataset Bagging Bagging Bagging Random ANFIS SVM ELM Forest australian 86.73[2.23] 85.57[1.99] 85.19[2.33] 86.66[1.82] (1) (3) (4) (2) breast 96.07[1.05] 96.40[1.09] 96.25[1.24] 96.51[1.12] (4) (2) (3) (1) bright 98.95[0.29] 99.32[0.29] 99.20[0.29] 99.12[0.33] credit 84.54[2.18] 84.77[2.16] 85.14[2.32] 86.52[2.07] diabetes 76.22[1.90] 76.61[2.27] 75.53[2.57] 75.75[2.14] (2) (1) (4) (3) dimdata 94.92[0.49] 96.23[0.51] 95.71[0.57] 95.44[0.56] heart 78.36[3.74] 81.89[2.61] 79.89[2.88] 81.07[2.89] hepatitis 76.00[4.21] 82.30[4.05] 80.04[4.82] 82.00[4.14] ionosphere 88.13[2.91] 94.57[1.77] 88.48[3.04] 93.30[1.84] liver 65.96[3.55] 69.34[4.22] 67.61[4.19] 69.90[3.64] (4) (2) (3) (1) mammo 83.19[1.89] 82.53[2.05] 81.75[1.80] 81.88[1.86] (1) (2) (4) (3) mushroom 78.66[1.12] 89.10[0.50] 89.12[0.52] 88.45[0.77] (4) (2) (1) (3) pima 75.02[2.54] 76.06[2.07] 75.56[2.85] 75.43[2.41] Rank (44) (21) (36) (29) TABLE VI: Testing Accuracy (%, μ[σ](rank)) of Four Ensemble Classifiers with 80 Bags Dataset Bagging Bagging Bagging Random ANFIS SVM ELM Forest australian 86.33[2.10] 85.52[2.04] 85.52[1.92] 86.78[2.06] (2) (4) (3) (1) breast 95.91[1.01] 96.56[1.12] 96.58[1.10] 96.77[0.95] bright 98.86[0.37] 99.38[0.24] 99.27[0.23] 99.17[0.32] credit 84.35[1.88] 85.08[1.81] 85.27[1.98] 86.94[1.64] diabetes 76.92[2.41] 77.36[1.92] 76.65[2.07] 76.98[1.64] (3) (1) (4) (2) dimdata 94.84[0.54] 96.29[0.48] 96.00[0.52] 95.72[0.49] heart 78.82[3.64] 81.57[3.13] 80.32[3.64] 80.89[3.05] hepatitis 75.13[4.08] 83.61[3.17] 81.30[3.83] 82.48[3.61] ionosphere 89.05[2.53] 94.69[1.89] 88.02[2.89] 92.97[2.06] (3) (1) (4) (2) liver 66.91[4.13] 69.96[4.20] 70.95[4.24] 71.83[4.43] mammo 83.48[2.08] 82.59[2.06] 82.03[2.17] 81.90[1.84] (1) (2) (3) (4) mushroom 79.04[0.68] 89.27[0.52] 89.25[0.57] 88.90[0.57] pima 75.86[2.14] 77.03[1.92] 76.04[2.41] 76.10[1.97] Rank (45) (23) (35) (27) TABLE V: Testing Accuracy (%, μ[σ](rank)) of Four Ensemble Classifiers with 50 Bags Dataset Bagging Bagging Bagging Random ANFIS SVM ELM Forest australian 87.10[1.78] 86.25[1.91] 86.19[2.07] 87.52[2.05] (2) (3) (4) (1) breast 95.83[1.23] 96.55[1.05] 96.62[1.07] 96.73[1.10] bright 98.83[0.32] 99.35[0.23] 99.24[0.26] 99.15[0.28] credit 84.45[2.35] 85.50[2.14] 85.74[2.17] 86.91[1.85] diabetes 76.83[2.33] 77.30[2.42] 76.33[3.07] 76.52[2.31] (2) (1) (4) (3) dimdata 94.99[0.49] 96.31[0.35] 95.96[0.42] 95.77[0.48] heart 77.77[3.56] 81.02[3.65] 79.84[4.19] 80.52[3.97] hepatitis 75.26[4.79] 82.22[3.31] 81.87[4.23] 82.70[3.83] (4) (2) (3) (1) ionosphere 88.25[2.42] 94.82[1.92] 88.29[2.36] 93.35[2.01] liver 67.22[3.65] 70.02[4.22] 69.38[4.55] 70.87[3.83] (4) (2) (3) (1) mammo 83.53[1.78] 82.61[1.85] 81.84[2.08] 81.85[1.75] (1) (2) (4) (3) mushroom 79.00[0.82] 89.21[0.59] 89.27[0.55] 88.91[0.70] (4) (2) (1) (3) pima 75.37[2.38] 76.38[2.47] 75.97[2.89] 75.73[2.61] Rank (45) (23) (35) (27) shown that all four ensemble classifiers have returned high testing accuracy. The effect of data diversity has been reflected from the comparison of base classifier accuracies and the ensemble classifier accuracies. In addition, the effect of data diversity has also been reflected from the number of bags because the results have shown that the testing accuracies increase with increasing of the base classifiers. Among the four ensemble classifiers, the ensemble SVM has been identified to be the most favorable ensemble classifier and the random forest has been identified to be the second most favorable one. The bagging ELM and the bagging ANFIS have been identified to be the third and fourth most favorable classifiers. REFERENCES [1] M. Bhardwaj, T. Gupta, T. Grover, and V. Bhatnagar, An efficient classifier ensemble using SVM, in International Conference on Methods and Models in Computer Science, Dec. 2009, pp [2] L. Breiman, Random forests, Machine Learning, vol. 45, pp. 5 32, [3] D. Opitz and R. Maclin, Popular ensemble methods: An empirical study, Journal of Artificial Intelligence Research, vol. 11, pp , [4] L. Yu, S. Wang, and K. K. Lai, Investigation of diversity strategies in SVM ensemble learning, in International Conference on Natural Computation, vol. 7, Oct. 2008, pp [5] R. Lei, X. Kong, and X. Wang, An ensemble SVM using entropy-based attribute selection, in Chinese Control and Decision Conference, May 2010, pp [6] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in International Joint Conference on Artificial Intelligence, vol. 2, no. 12, 1995, pp [7] L. Breiman, Bagging predictors, Machine Learning, vol. 24, pp , [8] Y. Freund and R. E. Schapire, A short introduction to boosting, Journal of Japanese Scociety for Artificial Intelligence, vol. 14, no. 5, pp , Sept [9] S. Kurogi and K. Harashima, Improving generalization performance of bagging ensemble via Bayesian approach, in IEEE International Symposium of Computational Intelligence in Robotics and Automation, Mar. 2009, pp

8 [10] S. Clemenon, M. Depecker, and N. Vayatis, Bagging ranking trees, in International Conference on Machine Learning and Applications, Dec. 2009, pp [11] A. Liaw and M. Wiener, Classification and regression by randomforest, RNews, vol. 2/3, pp , Dec [12] R. Chen and J. Yu, An improved bagging neural network ensemble algorithm and its application, in International Conference on Natural Computation, Aug. 2007, pp [13] J. Canul-Reich, L. Shoemaker, and L. O. Hall, Ensembles of fuzzy classifiers, in IEEE International Fuzzy Systems Conference, July 2007, pp [14] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 20, no. 3, pp , [15] F. M. Zaman and H. Hirose, Double SVMbagging: A new double bagging with support vector machine, Engineering Letters, vol. 17, no. 2, pp , [16] G. Valentini and T. G. Dietterich, Low bias bagged support vector machines, in International Conference on Machine Learning, 2003, pp [17] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, Extreme learning machine for regression and multi-class classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. PP, no. 99, pp. 1 17, Oct [18] N. Liu and H. Wang, Ensemble based extreme learning machine, IEEE Signal Processing Letters, vol. 17, no. 8, pp , Aug [19] Y. Liu, X. Xu, and C. Wang, Simple ensemble of extreme learning machine, in International Congress on Image and Signal Processing, Oct. 2009, pp [20] H. Tian and B. Meng, A new modeling method based on bagging elm for day-ahead electricity price prediction, in IEEE International Conference on Bio-Inspired Computing: Theories and Applications, Sept. 2010, pp [21] J.-S. R. Jang, ANFIS: Adaptive-network-based fuzzy inference system, IEEE Transactions on Systems, Man and Cybernetics, vol. 23, no. 3, pp , May [22] M. Sugeno and G. Kang, Structure identification of fuzzy model, Fuzzy Sets and Systems, vol. 28, pp , [23] J. Suyken and C. Alzate, Support vector machines and kernel methods: new approaches in unsupervised learning, in Tutorial WCCI 2010, July [24] T. G. Dietterich, Machine-learning research: Four current directions, The AI Magazine, vol. 18, no. 4, pp , [25] W.-Y. Loh, Classification and regression trees, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 1, pp , Jan [26] [Online]. Available: [27] C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 1 27, [Online]. Available: cjlin/libsvm [28] A. Jaiantilal. (2009) Classification and regression by randomforest-matlab. [Online]. Available: randomforest-matlab 924

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract