COMPARISON OF SUBSAMPLING TECHNIQUES FOR RANDOM SUBSPACE ENSEMBLES

COMPARISON OF SUBSAMPLING TECHNIQUES FOR RANDOM SUBSPACE ENSEMBLES SANTHOSH PATHICAL 1, GURSEL SERPEN 1 1 Elecrical Engineering and Computer Science Department, University of Toledo, Toledo, OH, 43606, USA E-MAIL: santhosh.pathical@utoledo.edu, gursel.serpen@utoledo.edu Abstract: This paper presents the comparison of three subsampling techniques for random subspace ensemble classifiers through an empirical study. A version of random subspace ensemble designed to address the challenges of high dimensional classification, entitled random subsample ensemble, within the voting combiner framework was evaluated for its performance for three different sampling methods which entailed random sampling without replacement, random sampling with replacement, and random partitioning. The random subsample ensemble was instantiated using three different base learners including C4.5, k-nearest neighbor, and naïve Bayes, and tested on five high-dimensional benchmark data sets in machine learning. Simulation results helped ascertain the optimal sampling technique for the ensemble, which turned out to be the sampling without replacement. Keywords: Random subsampling; curse of dimensionality; ensemble classification; random subspace 1. Introduction The need for classification in high dimensional feature spaces arises in contemporary fields like biomedical, finance, satellite imagery, customer relationship management, network intrusion detection etc. [1]-[4]. Classifying high dimensional datasets is a challenging task due to several reasons. The machine learning algorithms used for classification are exponential in terms of their computational complexity with respect to the number of dimensions [2]. The required number of labeled training samples for supervised classification increases exponentially with the dimensionality [5]. Sparsity of data points in the higher dimensions makes the learning task very difficult if not impossible. High dimensionality also causes scalability problems for machine learning algorithms [3]. Such inability to scale exhibits itself in the form of substantial computational cost, which translates into indefinitely long training periods, or possibly (in the worst case) the non-learnability of the classification task associated with that dataset. This adverse effect is also called as the curse of dimensionality [1], [6]. To address the challenges associated with algorithm scalability, data sparsity and information loss due to the curse of dimensionality, the authors in [7] presented a novel application of the random subspace ensembles. The proposed ensemble, called the random subsample ensemble (RSE), generated random lower dimensional projections/subspaces of the original high dimensional feature space. Each of the subspaces constituted a (much smaller, particularly when compared to the original) classification subtask. The predictions of classifiers developed on these subtasks were aggregated to produce the final classification prediction or outcome within an ensemble framework. Accordingly, one can note that RSE employs the divide-and-conquer methodology to deal with the adverse effects of high dimensionality. RSE was presented as the alternative approach for managing high dimensionality to feature selection techniques and feature transformation methods. Among which, for instance, principal component analysis and K-L transform suffer from drawbacks like information loss and lack of interpretability. RSE makes the high dimensional learning task scalable by using lower dimensional projections. The problem of sparsity of data points in the high dimensional feature space is inherently addressed by RSE due to the lower dimensional projections. It was established through empirical testing on five high dimensional datasets in [7] that RSE is well positioned to address the problems of high dimensionality. In this paper, we explore the effect of the sampling technique on the performance of RSE by comparing the performance results of RSE using three different sampling techniques. An introduction to the classification approach based on the random subsample ensemble and the sampling techniques is presented in Section 2. In section 3, simulation results along with the results of the comparison of the three sampling techniques are presented and discussed. Finally, conclusions are presented in Section 4.

2. Random Subsample Ensemble The rationale behind the Random Subsample Ensemble (RSE) is to break down a complex high dimensional problem into several lower-dimensional sub-problems, and thus conquer the computational complexity aspects of the original problem. High dimensional feature space is projected onto a set of lower dimensions by selecting random feature subsets or subsamples from the original set. The lower-dimensional projections can be generated using a variety of techniques based on random subsampling. The three such specific methods used for the generation of the random subsamples of the original feature space, as presented in this paper, are: (1) random sampling without replacement, (2) random sampling with replacement, and (3) random mutually exclusive partitioning. Random sampling without replacement generates subsamples, where a selected feature is unique within a subsample. However, there is a possible overlap amongst different subsamples: the same feature might exist in more than one subsample. Random sampling with replacement allows for a single feature to be repeated within a subsample and across the set of subsamples. The subsamples generated using random partitioning are mutually exclusive, i.e. a selected feature is unique within a subsample and across the set of subsamples. Each of the feature subsamples may have the same number of features for all the three techniques although this is also another adjustable parameter. As an example, consider the original high-dimensional feature space to be represented by F={x 1, x 2 x k }. The feature space F is randomly projected onto N d-dimensional subspaces, f i = {x 1, x 2 x d } for, i=1 to N with d<<k. The values for d (cardinality of subsample set) and N (number of subsamples) are likely to be problem-dependent and require empirical search and/or theoretical analysis for optimality. The d-dimensional feature sets are generated either by randomly sampling d features N times from the original feature set without/with replacement or by randomly partitioning the feature space into N d-dimensional subsamples and hence, one of the following holds, respectively: and N fi x 1, x2,..., xk, (1) i 1 N fi x,,..., 1 x2 xk, (2) i 1 Random partitioning allows for all the original features to be selected exactly once in the set of subsamples which tends to minimize the potential information loss in terms of the individual features. While random sampling (with and without replacement) does not guarantee selection of all the features, the cardinality of the feature subsample and the number of subsamples can be manipulated to ensure that majority of the original features are selected in at least one subsample. Each lower dimensional subsample of the feature space is used to train a base learner of the random subsample ensemble. Each base learner is trained on the entire original set of training data instances, which has the advantage that all the classifiers are trained using full class representation [8]. Once the base learners are trained, their predictions are combined to get the final ensemble prediction. The architecture of random subsample ensemble (RSE) is depicted in Fig. 1. Figure 1. Conceptual architecture of RSE The results of the simulations with RSE on a set of high dimensional datasets as presented in [7] helped affirm the effectiveness of RSE as an approach for high dimensional classification. The sampling technique used for the instantiation of the RSE ensembles for these particular simulations was random sampling without replacement. It was established, within the context of this sampling technique, that a small subsample size combined with a high number of subsamples (high number of base learners) for RSE performed equally well when compared with other popular ensembles; the small subsample size being critical in this case so as to more effectively address the curse of dimensionality challenge. Ensemble classification methods similar to RSE which include Attribute Bagging, Input Decimation, Random Subspace, Multiple Feature Subset (MFS), and

Classification Ensembles from Random Partitions can be found respectively in [8], [9], [10], [11], [12]. Each of these methods subsample or partition the feature space in some form but except for MFS, to an extent, none of the methods explore the subsampling process itself. With MFS, the author uses both sampling with and without replacement for subsampling the feature space and tests the resulting ensembles on classification error rates but finds no significant difference between the two sampling techniques. However, MFS being a specialized classification algorithm intended to improve the performance of the nearest neighbor algorithm, the context along with consequent applicability and generality of the reported study is very limited. 3. Simulation study The aim of the simulation study is to profile the effect of feature space subsampling techniques on the performance of random subsampling ensemble (RSE). The three sampling techniques entail sampling without replacement, sampling with replacement and mutually exclusive partitioning. The datasets employed and their characteristics are presented in the Table 1. The datasets are from the UCI Machine Learning Repository. Madelon (Made) is a two class artificial dataset with continuous input variables. With the Isolet (Isol) dataset the goal is to predict which letter name was spoken by the human subjects while the Multiple Features (Mu Fe) dataset consists of features of handwritten numerals (`0' through`9') extracted from a collection of Dutch utility maps. Internet Ads (In Ad) dataset represents a set of possible advertisements on Internet pages. Dexter (Dext) is a text classification problem in a bag-of-word representation. There are two parameters that need to be set for sampling without replacement and sampling with replacement, namely subsample size and number of subsamples. For partitioning however, we only need to set either of the two parameters as the other would be dictated by the one that is set. For example, if the number of subsamples is set to 4 then the subsample size would be 25% of original feature set size and vice versa. Accordingly, different parameter values for the cardinality of the feature subsamples and the number of subsamples were explored for each dataset for both sampling without and with replacement. The values for the subsample size/cardinality was varied from five percentage to fifty percentage of original feature count in increments of five percentage points, i.e. from 5% to 50% in increments of 5%. The subsample count parameter that directly corresponds to the number of base learners/classifiers is explored for 3, 5, 7, 9, 11, 19 and 25. In the case of partitioning, the values selected for the number of partitions (which incidentally is the same as classifier count) were 3, 5, 7, 9, 11, 15, 19 and 25, and accordingly, subsample sizes were dictated as approximately 33%, 20%, 14%, 11%, 9%, 7%, 5% and 4%, respectively. Table 1. Data Sets Dataset Feature Count Training Instances Testing Instances Madelon 500 2600 Isolet 617 6238 1559 Multiple Features 649 2000 Internet Ads 1558 3279 Dexter 19999 600 Simulation experiments were conducted using the algorithm implementations in the WEKA environment (version 3.5.8) [13]. Among the machine learning algorithms experimented with as base learners for RSE are C4.5, naïve Bayes (nb), and k-nearest neighbor (knn). Default parameter values in WEKA were used for the learning algorithms, which is a common practice in literature [14], [15]. In all of the simulation experiments, ten-fold cross validation was used for base learners in the absence of a test set. The performance metrics recorded are prediction accuracy, SAR and cpu time. SAR is measured using the average of the squared error, accuracy and area under the ROC curve. The cpu time was calculated based on the processor time taken by the process of an ensemble/learning algorithm as implemented within WEKA as measured by the time utility on the Solaris TM OS (version 9) platform. All base classifiers were the same type, i.e. one of C4.5, knn or nb only, for a specific instantiation of RSE. Consequently, for each dataset there are 80 possible combinations of subsampling percentage vs. base classifier count to be explored for sampling without and with replacement. There are eight instantiations of RSE for partitioning, one for each base classifier count. The ensemble architecture chosen for modeling RSE was voting (average of the posterior probabilities) given its implementation simplicity, low computational overhead, and proven track record for a variety of applications as reported in the literature. The prediction accuracy results of RSE for the two random sampling techniques (not the partitioning) across the different combinations of subsampling percentages and number of base classifiers, using C4.5 as the base learning algorithm, for the Dexter dataset are shown in Tables 2 and 3. The results of the simulations for partitioning are shown in Table 4. This entire scenario as presented in Tables 2, 3, and 4 was repeated for the other two learning algorithms,

namely knn and naïve Bayes, and three performance metrics, namely prediction accuracy, SAR, and cpu time, although not presented here due to space limitations. Similar evaluations, again omitted due to space limitations, were also carried out for the other four datasets and three base learning algorithms recording values for all three performance metrics. Table 2. Accuracy results (%) for Dexter dataset across subsampling percentage vs. base classifier count for sampling without replacement using C4.5 Sub- Classifier Count samp. % 3 5 7 9 11 15 19 25 5 80.2 83.0 83.2 84.2 83.7 85.5 87.7 86.7 10 75.5 89.0 89.7 88.7 90.3 89.5 90.2 91.5 15 80.8 87.8 90.0 90.7 90.7 92.3 92.5 93.3 20 88.0 88.3 90.0 92.0 91.5 93.5 94.7 92.3 25 88.3 89.3 88.5 92.0 91.8 91.7 93.5 94.2 30 85.3 89.2 90.5 92.3 92.3 93.2 93.7 93.2 35 85.0 88.2 90.5 91.5 91.5 92.5 93.5 93.7 40 88.2 88.8 90.7 90.5 93.0 93.0 93.0 92.0 45 86.5 89.7 89.0 90.8 92.2 92.2 92.5 91.7 50 89.3 89.5 91.7 90.3 92.3 93.5 92.8 93.2 Table 3. Accuracy results (%) for Dexter dataset across subsampling percentage vs. base classifier count for sampling with replacement using C4.5 Sub- Classifier Count samp. % 3 5 7 9 11 15 19 25 5 67.7 78.8 84.2 86.0 84.3 89.3 85.0 89.2 10 75.5 85.3 86.7 86.5 89.8 90.7 91.2 93.2 15 82.8 84.5 87.5 87.2 88.8 90.8 91.3 93.7 20 86.2 87.3 89.2 89.5 88.8 89.0 93.3 92.7 25 84.3 88.0 88.0 90.0 91.3 91.7 93.7 94.0 30 89.7 90.5 89.0 92.3 92.3 92.3 94.0 93.3 35 87.7 89.7 91.5 92.2 93.2 92.8 92.7 92.8 40 86.2 89.8 91.0 92.0 91.2 92.2 92.8 92.7 45 88.3 90.5 92.8 90.8 92.5 92.8 91.8 93.5 50 89.5 89.8 89.7 92.0 92.3 93.7 91.0 92.7 Table 4. Accuracy results (%) for Dexter dataset for different base classifier counts for partitioning using C4.5 Classifier Accuracy Count 3 (33%) 88.5 5 (20%) 88.8 7 (14%) 91.2 9 (11%) 90.7 11 (9%) 89.2 15 (7%) 91.5 19 (5%) 90.3 25 (4%) 90.3 The comparative performance analysis procedure was divided into two parts. In the first part, sampling without replacement and sampling with replacement are compared using the Wilcoxon signed-ranks test [16]. The second part compares the three techniques using three-dimensional plots of the established performance results of instantiations of RSE for each of the sampling techniques. The comparative analysis had to be divided into two parts because it was not suitable to compare the three sampling techniques together using a formal statistical procedure due to the much smaller sample size of performance values for random partitioning as compared to the other two techniques (8 vs. 80). For the Wilcoxon signed-ranks test, a single sample consists of the performance values of RSE recorded for a single metric for the sampling technique (under consideration) for a single base learning algorithm on a single dataset. The simulation experiments were conducted on five datasets using three learning algorithms recording three performance metrics. Thus, there are 45 pairs of samples for the two sampling techniques: without and with replacement. Each pair of samples was compared using a single Wilcoxon signed-ranks test. It was hypothesized for each test that the sampling without replacement has better performance than sampling with replacement. The tests were conducted using α=0.05. The values of the z-statistic for each test are shown in Table 5. With α=0.05, the observed value of z is significant if it is greater than 1.645. The cases where sampling without replacement performs significantly better than sampling with replacement are shown in bold in the table while the cases where it seems to perform poorly are shown in italics. The case with NA as its entry is the one for which the value for non-zero pair differences is too small to be approximated by a normal distribution. In that case the value of W (Wilcoxon statistic) is 12, which is greater than the tabulated critical value and hence there is no statistically significant difference between the two techniques. Table 5 shows that for the accuracy metric, seven z scores that cover all three algorithms and four of the five datasets (Multiple Features excluded) favor the random subsampling without replacement. In four tests, z scores favor random subsampling with replacement while noting that distribution of four z scores cover all three algorithms and three out of five datasets (Multiple Features and Internet Ads excluded). For the remaining four cases, the tests indicate no statistically significant difference. The bias in favor of the random subsampling without replacement is much more apparent in the case of the SAR performance metric. Analysis for the SAR performance metric indicates that there is hardly any statistically significant difference in performances due to the two subsampling schemes. In ten of 15 cases, z scores indicate no statistically significant difference. For four z scores, random subsampling without replacement claims

statistically significant superiority in performance. This happens for all three learning algorithms but only two data sets, Madelon and Internet Ads. In only one case, knn for Dexter, there is a statistically significant difference in favor of random subsampling with replacement. It is also possible to suggest that there is no apparent bias for either sampling technique with respect to either accuracy or SAR metric if the learning algorithm is chosen as knn. Overall, the statistical tests show that the performance values for sampling without replacement are somewhat, but not to large degree, better than sampling with replacement at a statistically significant level. Table 5. The values for the z-statistic for the Wilcoxon signed-rank tests comparing sampling without replacement against sampling with replacement Datasets Algo. C4.5 knn nb Metric Made Isol Mu Fe In Ad Dext Acc. 3.72-1.96-1.62 4.51 1.42 SAR 2.91-0.37 0.38 1.88-1.19 cpu time 2.58 1.33 2.20 4.29-0.24 Acc. 3.84-2.23-1.21 2.31-1.74 SAR 1.75-0.37 0.00-0.32-1.66 cpu time -6.12-5.2-7.77 7.60-4.82 Acc. -1.86 2.09-0.25 3.58 3.44 SAR -1.58 1.62 NA 2.38 0.59 cpu time 6.74 5.93 6.96 4.57 1.20 The time complexity or cost appears to have been noticeably affected by the choice of the subsampling scheme. In eight out of fifteen cases, measurements suggest that random subsampling without replacement is more affordable at a statistically significant level. Interestingly enough, in four cases there are statistically significant differences in favor of the random subsampling with replacement and all four cases are for the knn algorithm only. A single sample of random partitioning has 8 data points while for the other two techniques the sample size is 80. Due to such disparity in sample sizes, it was decided not to use a formal statistical comparison procedure to compare partitioning to the other two sampling techniques. Instead, three-dimensional scatter plots of the performance values of the three sampling techniques were employed to facilitate visual comparison. In overall, there are 45 individual cases for comparison. Sample plots depicting the performance of the three sampling techniques on the Dexter dataset with C4.5 as the learning algorithm on all the three performance measures are shown in Figures 2 through 4. The green (lighter gray in non-color version) dots represent the performance values for sampling without replacement, orange (darker gray in non-color version) asterisks represent the values for sampling with replacement, and blue diamonds represent the values for random partitioning. Referring to the rest of the plots, which could not be included here due to space limitations, there are 12 cases where the two sampling techniques, with and without replacement, appear to be significantly better than random partitioning. On the Madelon dataset, the two techniques have higher values for prediction accuracy and SAR for C4.5 and knn algorithms. These two are better than random partitioning on accuracy for C4.5 on the Isolet dataset while they are better than partitioning on both accuracy and SAR for all the three algorithms on the Internet Ads dataset. Finally, SAR values for without and with replacement are higher than those for partitioning for C4.5 on Dexter. Random partitioning is better than the other two sampling techniques on the measure of SAR for nb on Dexter dataset. On all the other cases for accuracy and SAR, the performances of the three techniques are comparable. In conclusion, then, the random sampling with/without replacement duo has a lead over the random partitioning technique. Figure 2. Accuracy values for the three sampling techniques using C4.5 as the learning algorithm on the Dexter dataset For all the cpu time results, partitioning appears to be better than the other two algorithms. However, this is expected as for partitioning there are no cases where there are combinations of higher classifier counts and higher subsampling percentages. These are the combinations which have the highest cpu time costs for the other two sampling techniques.

4. Conclusions Three sampling techniques were explored for their effect on the performance of the random subsample ensemble, which is a variant of so-called random subspace ensembles in the literature. A simulation study was performed on high dimensional datasets from the UCI repository to establish the optimal sampling technique amongst sampling without replacement, sampling with replacement, and mutually exclusive partitioning. The results indicate a better performance of the ensemble for the sampling without replacement method over the other two techniques within the context of the simulation study. Figure 3. SAR values for the three sampling techniques using C4.5 as the learning algorithm on the Dexter dataset Figure 4. cpu time values for the three sampling techniques using C4.5 as the learning algorithm on the Dexter dataset References [1] D. L. Donoho, High dimensional data analysis: The curses and blessings of dimensionality, Aide-memoire of a American Math Society Lecture Math Challenges of the 21st Century, 2000. [2] L. Yu, H. Liu Feature selection for high-dimensional data: A fast correlation-based filter solution, Proc. 20 th ICML, 2003, pp. 856-863 [3] R. Caruana, N. Karampatziakis, A. Yessenalina, An Empirical Evaluation of Supervised Learning in High Dimensions, Proc. 25 th ICML, 2008, pp. 96-103. [4] Y. Zhao, Y. Chen, X. Zhang, A Novel Ensemble Approach for Cancer Data Classification, Proc. 4 th Intl. Symposium on Neural Networks, LNCS, vol. 4492, Springer, 2007, pp. 1211-1220. [5] O. Maimon, L. Rokach, Improving Supervised Learning by Feature Decomposition, Foundations of Information and Knowledge Systems, 2002, pp. 178-196. [6] R. Bellman, Adaptive control processes: A guided tour. Princeton University Press, Princeton, 1961. [7] G. Serpen, S. Pathical, Classification in High-Dimensional Feature Spaces: Random Subsample Ensemble, in Proc. ICMLA, 2009, pp. 740-745. [8] R. Bryll, R. Gutierrez-Osuna, F. Quek, Attribute Bagging: Improving Accuracy of Classifier Ensembles by using Random Feature Subsets, Pattern Recognition, vol. 36, 2003, pp. 1291-1302. [9] N. C. Oza, K. Tumer Input Decimation Ensembles: Decorrelation through Dimensonality Reduction, Proc. Intl. W shop on Multiple Classifier Systems, LNCS, vol. 2096, Springer, 2001, pp.238-247. [10] T. K. Ho, The Random Subspace Method for Constructing Decision Forests, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, 1998, pp. 832-844 [11] S. Bay, Combining Nearest Neighbor Classifiers through Multiple Feature Subsets, in Proc. 15 th ICML, 1998, pp. 37-45. [12] H. Ahn, H. Moon, M. J. Fazzari, N. Lim, J. Chen, R. Kodell, Classification by ensembles from random partitions of high-dimensional data, Computational Statistics and Data Analysis, vol. 51, 2007, pp. 6166-6179. [13] I. H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2005. [14] S. Dzeroski, B. Zenko, Is Combining Classifiers with Stacking Better than Selecting the Best One, Machine Learning, vol. 54, 2004, pp. 254-273. [15] J. J. Rodriguez, L. I. Kuncheva, C. J. Alonso, Rotation Forest: A New Classifier Ensemble Method, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, 2006, pp. 1619-1630. [16] F. Wilcoxon, Individual comparisons by ranking methods, Biometrics, vol. 1, 1945, pp. 80-83.