Ensemble-based Feature Selection Criteria

Size: px

Start display at page:

Download "Ensemble-based Feature Selection Criteria"

Kevin Gaines
5 years ago
Views:

1 Esemble-based Feature Selectio Criteria Terry Wideatt 1, Matthew Prior 1, Niv Effro 2, Natha Itrator 2 1 Cetre for Visio, Speech ad Sigal Proc (CVSSP), Uiversity of Surrey, Guildford, Surrey, Uited Kigdom GU2 7XH 2 School of Computer Sciece, Tel-Aviv Uiversity, Ramat-Aviv 69978, Israel [t.wideatt,m.prior]@surrey.ac.uk, ivefro@gmail.com, i@post.tau.ac.il Abstract. Recursive Feature Elimiatio (RFE) combied with feature rakig is a effective techique for elimiatig irrelevat features whe the feature dimesio is large, but it is difficult to distiguish betwee relevat ad redudat features. The usual method of determiig whe to stop elimiatig features is based o either a validatio set or cross-validatio techiques. I this paper, we preset feature selectio criteria based o out-of-bootstrap (OOB) ad class separability, both computed o the traiig set thereby obviatig the eed for validatio. The RFE method described i this paper uses a two-class eural etwork classifier ad the rakig of features is based o the magitude of eural etwork weights. This approach is compared experimetally with a oisy bootstrapped versio of Fisher s Liear Discrimiat (FLD) to rak features. The techiques are exteded to multi-class problems usig the Error-Correctig Output Codig (ECOC) method. Experimetal ivestigatio o artificial ad atural bechmark data demostrates the effectiveess of these criteria i selectig optimal umber of features ad classifier complexity. Furthermore, the kow locatio of ifluetial features i the simulated data permits the use of ROC (Receiver Operatig Curve) to demostrate the performace of RFE. Keywords: RFE, ECOC, Multiple Classifiers, feature selectio. 1 Itroductio Cosider a supervised learig problem, i which traiig patters cosist of a large umber of features, may of which are suspected to be irrelevat to the classificatio problem at had. To reduce dimesioality, a decisio eeds to be take whether to select or extract features. Oe of the most popular geeral purpose feature extractio techiques is Pricipal Compoet Aalysis (PCA), which is a mappig or projectio o to the pricipal directios ad is a effective method of feature space reductio. It is particularly importat to reduce the umber of features for small sample size problems (Sectio 3). I geeral, feature extractio techiques make use of all the origial features whe mappig to ew features. However, over-fittig may result if the dimesio space is high. Furthermore, it may ot be successful due to complex class distributios [1]. Fially, feature extractio methods are difficult to iterpret i terms of the importace of origial features. This loss of iterpretability is oe of the key reasos why feature selectio is preferred i may data miig ad bio-iformatics applicatios. I this paper, oly feature selectio methods are cosidered. Feature selectio has received attetio for may years from researchers i the fields of patter recogitio, machie learig ad statistics. Exhaustive eumeratio of all subsets

2 of features is impractical except for a few features. Various greedy algorithms have bee developed to fid the best subset of features. A popular approach is to rak features accordig to a suitable criterio. ad various strategies are possible oce the raked feature set is obtaied. For example, a fixed umber of features may be selected to desig a classifier or alteratively a threshold may be set o the rakig criterio to determie the umber of features. Selectig the optimal umber of features with respect to geeralizatio performace ormally requires a validatio set or cross-validatio techiques. I this paper, we preset feature selectio criteria based o out-of-bootstrap (OOB) ad class separability. They are computed o the traiig set, so that validatio is ot required. The separability measure defied i equatio (4) was proposed i [2] for selectig optimal base classifier complexity. The mai cotributio of this paper is to propose stoppig criteria based o class separability ad OOB estimate whe RFE is applied to a esemble. The paper is orgaized as follows. Sectio 2 describes the relevat cocepts for esemble methods, icludig the OOB estimate ad the Error-Correctig Output Codig (ECOC) method for solvig multi-class problems. Sectio 3 discusses feature rakig methods ad Recursive Feature Elimiatio (RFE). Sectio 4 describes the datasets used for experimetatio as well as the ROC curve for characterizig the performace o simulated data, for which the locatio of ifluetial features is kow. Experimetal results are give i Sectio 5, showig that optimal umber of features ad base classifier complexity may be selected without eed for validatio. 2 Esemble Methods The Multiple Classifier System (MCS) or committee/esemble approach, has emerged over recet years to address the practical problem of desigig classificatio systems with improved accuracy ad efficiecy. The aim is to desig a composite system that outperforms idividual classifiers by poolig together the decisios of all classifiers. The ratioale is that it may be more difficult to optimise the desig of a sigle complex classifier tha to optimise the desig of a combiatio of relatively simple (base) classifiers. I this paper, we assume a simple parallel MCS architecture with homogeous base classifiers. For two-class classifiers the combiig rule is majority vote while for multi-class the decisio rule is defied i equatio (7). Ijectig radomess ito the MCS framework has bee foud to be a good strategy for improvig geeralisatio performace. Radom perturbatios have bee show to be useful i patter space (Bootstrappig), feature space (Radom Subspace Method RSM [3]), Class Labels (Error-Correctig Output Codig: ECOC) as well as i base classifiers themselves. Of these four types of radom perturbatio methods, all are used i this paper except RSM. ECOC is described i Sectio 2.2, ad radom weights are used to iitialize eural etwork base classifiers i Sectio 5. Bootstrappig [4] is a popular esemble techique ad implies that if µ traiig patters are radomly sampled with replacemet, (1-1/µ)) µ 37% are removed with remaiig patters occurrig oe or more times. The out-of-bootstrap (OOB) estimate uses the patters left out. The idividual base classifier OOB should be distiguished from the esemble OOB. For the esemble OOB, all traiig patters cotribute to the estimate, but the oly participatig classifiers for each patter are those that have ot bee used with that patter for traiig (that is, approximately thirtyseve percet of classifiers). Note that OOB gives a biased estimate of the absolute value

3 of geeralizatio error [5], but i this paper, the estimate of the absolute value is ot importat. The OOB estimate for the ECOC esemble is give i Sectio 2.2. Selectig parameters for MCS desig should ideally be carried out usig oly the traiig set, but this is usually difficult ad results i a biased choice. Model selectio from traiig data is kow to require a built-i assumptio, sice realistic learig problems are i geeral ill-posed [6]. The assumptio i this paper is that base classifier complexity ad umber of features may be selected usig a bootstrap estimate ad that patters left out of the bootstrap may be used to determie optimal values with respect to geeralizatio error. A potetial problem with the bootstrap is that each base classifier sees oly approximately sixty three percet of the traiig set. It is show experimetally i Sectio 5 that the reduced umber of traiig patters does ot lead to a iaccurate estimate of the optimal values but may lead to a iaccurate estimate of the absolute value of geeralizatio error. Note that the bootstrap estimate does ot require ay assumptios regardig uderlyig probability distributios. 2.1 Diversity ad Class Separability Attempts to uderstad the effectiveess of the MCS framework have prompted the developmet of various measures. The Margi cocept was used origially to help explai Boostig ad Support Vector Machies. Bias ad Variace are cocepts from regressio theory that have motivated modified defiitios for 0/1 loss fuctio for characterisig Baggig ad other esemble techiques. Various diversity measures have bee studied with the itetio of determiig whether they correlate with esemble accuracy. However, the questio of whether the iformatio available from ay of these measures ca be used to assist MCS desig is ope. Most commoly, MCS parameters are set with the help of either a validatio set or cross-validatio techiques [7]. Diversity measures have received much attetio recetly sice it is recogized that diversity amog base classifiers is a ecessary coditio for improvemet i esemble performace. However, there is o geeral agreemet about how to quatify the otio of diversity amog a set of classifiers. Diversity measures ca be categorised ito two types [8], pair-wise ad o-pair-wise. I order to apply pair-wise measures to fidig overall diversity of a set of classifiers it is ecessary to average over the set. No-pair-wise measures attempt to measure diversity of a set of classifiers directly, based for example o variace, etropy or proportio of classifiers that fail o radomly selected patters. The mai difficulty with diversity measures is the so-called accuracy-diversity dilemma. As explaied i [9], as base classifiers approach the highest levels of accuracy, diversity must decrease so that it is expected that there will be a trade-off betwee diversity ad accuracy. The Diversity/Accuracy Dilemma leads us to expect that esemble performace may ot be optimized whe each idividual classifier is optimized [10]. There has bee o covicig theory or experimetal study to suggest that there exists ay measure that ca reliably predict geeralisatio error of a esemble. However, i [11] the OOB estimate was used to tue diversity via early-stoppig of a eural etwork esemble. Classical class separability measures refer to the ability to predict separatio of patters ito classes usig origial features ad rely o a Gaussia assumptio [12]. I [2] a class separability measure is proposed for MCS that is based o a biary feature represetatio, i which each patter is represeted by its biary MCS classifier decisios. It is restricted to two-class problems ad results i a biary-to-biary mappig. The problem with

4 applyig classical class separability measures is that the implicit Gaussia assumptio is ot appropriate for this mappig [13]. Let there be µ patters with the label ω m give to each patter x m where m = 1, µ. I a MCS framework, the mth patter may be represeted by the B-dimesioal vector formed from the B base classifier decisios give by x = ξ, ξ,, ξ ) ξ mi, ω m {0,1}, i = 1 B (1) m K ( m1 m2 mb I equatio (1) ω m =f(x m ) where f is the ukow biary-to-to biary mappig from classifier decisios to target label. Followig [8], the otatio i equatio (1) is modified so that the classifier decisio is 1 if it agrees with the target label ad 0 otherwise xm = ( ym 1, ym2, K, ymb ) y mi, ω m = {0,1}, y mi =1 iff ξ mi =ω m (2) Pairwise diversity measures, such as Q statistic, Correlatio Coefficiet, Double Fault ad Disagreemet measures [8] take o accout of class assiged to a patter. I cotrast, class separability [14] is computed betwee classifier decisios (equatio (2)) over pairs of patters of opposite class, usig four couts defied by logical AND ( ) operator B ab a b N = 1 0 ψ ω ω a,b {0,1}, ψ = y,ψ = y (3) =1ψ m j mj j, m The th patter for a two-class problem is assiged σ = 1 K σ N µ m = 1 11 N 11 m N µ m = 1 00 N 00 m (4) where K σ = µ N N + ab, N = µ µ m = 1 N m N m m = 1 m = 1 N ab m The motivatio for σ comes from estimatio of the first order spectral coefficiets [2] of the biary-to-biary mappig defied i equatio (1). Each patter is compared with all patters of the other class, ad the umber of joitly correctly ( N ) ad icorrectly 00 ( N ) classified patters are couted. Note that a classifier that correctly classifies oe patter but icorrectly classifies the other does ot cotribute. The two terms i equatio (4) represet the relative positive ad egative evidece that the patter comes from the target class. We sum over patters with positive coefficiet to produce a sigle umber betwee 1 ad +1 that represets the separability of a set of patters µ σ = σ, σ = 1 > 0 I our experimets i Sectio 4 we use the Q diversity measure, as recommeded i [8]. Diversity Q betwee ith ad jth classifiers is defied as 11 (5)

5 Q i j N N N N = (6) N N + N N where N ab µ = ψ m=1 B base classifiers a mj ψ b mj with a,b,ψ defied i equatio (3). The mea is take over 1 B B 2 Q 1) i= 1 j= i+ 1 Q = B( B. 2.2 Error-Correctig Output Codig ECOC There are several motivatios for decomposig a multi-class problem ito complemetary two-class problems. The decompositio meas that attetio ca be focused o developig a effective techique for the two-class classifier, without havig to cosider explicitly the desig of the multi-class case. This is useful, for example with MLPs, whe two-class classifiers do ot aturally scale up to multi-class. Also, it is hoped that the parameters of a base classifier ru several times may be easier to determie tha a complex classifier ru oce, ad perhaps facilitate faster ad more efficiet solutios. Fially, solvig differet two-class sub-problems repeatedly with radom perturbatio may help to reduce error i the origial problem. The ECOC method [15] is a example of distributed output codig [16], i which a patter is assiged to the class that is closest to a correspodig code word. Rows of the ECOC matrix act as the code words, ad are desiged usig error-correctig priciples to provide some error isesitivity with respect to idividual classificatio errors. The origial motivatio for ecodig multiple classifiers usig a error-correctig code was based o the idea of modelig the predictio task as a commuicatio problem, i which class iformatio is trasmitted over a chael. Errors itroduced ito the process arise from various stages of the learig algorithm, icludig features selected ad fiite traiig sample. From error-correctig theory, we kow that a matrix desiged to have d bits error-correctig capability implies that there is a miimum Hammig Distace 2d+1 betwee ay pair of code words. Assumig each bit is trasmitted idepedetly, it is the possible to correct a received patter havig fewer tha d bits i error, by assigig the patter to the code word closest i Hammig distace. Clearly, from this perspective it is desirable to use a matrix cotaiig code words havig high miimum Hammig distace betwee ay pair. To solve a multi-class problem i the ECOC framework we eed a set of codes to decompose the origial problem, a suitable two-class base classifier, ad a decisiomakig framework. For a K-class problem, each row of the K x B biary ECOC matrix Z acts as a code word for each class. Each of the B colums of Z partitios the traiig data ito two super-classes accordig to the value of the correspodig biary elemet. To classify patter x m, it is applied to the B traied base classifiers formig vector [x m1, x m2,..., x mb ] where x mj is the output of the jth base classifier. The L 1 orm distace L i (where i = 1. K) betwee output vector ad code word for each class is computed L i = b = Z j 1 x mj (7)

6 ad x m is assiged to the class ω m correspodig to closest code word. Patter x m is classified usig oly those classifiers that are i the set OOB m, defied as the set of classifiers for which x m is OOB. For the OOB estimate, the summatio i equatio (7) is therefore modified to j OOB m. I [17] it is show that ay variatio i Hammig distace betwee pairs of code words will reduce the effectiveess of the combiig strategy. I [18] it is show that maximisig the miimum Hammig Distace betwee code words implies miimisig upper bouds o geeralisatio error. I classical codig theory, theorems o error-correctig codes guaratee a reductio i the oise i a commuicatio chael, but the assumptio is that errors are idepedet. Whe applied to machie learig the situatio is more complex, i that error correlatio depeds o the data set, base classifier as well as the code matrix Z. I the origial ECOC approach [15], heuristics were employed to maximise the distace betwee the colums of Z to reduce error correlatio. Radom codes, provided that they are log eough, have frequetly bee employed with almost as good performace [17]. It would seem to be a matter of idividual iterpretatio whether log radom codes may be cosidered to approximate required error-correctig properties. I this paper, a radom code matrix with ear equal split of classes (approximately equal umber of 1 s i each colum) is chose, as proposed i [19]. 3 Feature Rakig The aim of feature selectio is to fid a feature subset from the origial set of features such that a iductio algorithm that is ru o data cotaiig oly those features geerates a classifier that has the highest possible accuracy []. Typically with tes of features i the origial set, a exhaustive search is computatioally prohibitive. Ideed the problem is kow to be NP-hard [], ad a greedy search scheme is required. However, some recet problems such as those i gee selectio ad text categorizatio require feature selectio to be applied to hudreds ad thousads of features. For these problems, classical feature selectio schemes are ot greedy eough, ad filter, wrapper ad embedded approaches have bee developed [21]. Oe-dimesioal feature rakig methods cosider each feature i isolatio ad rak the features accordig to a scorig fuctio, but are disadvataged by implicit orthogoality assumptios [21]. They are very efficiet but i geeral have bee show to be iferior to multi-dimesioal methods [21] that cosider all features simultaeously. A feature scorig method is a fuctio Score(j) where j=1 p is a feature, for which higher scores usually idicate more ifluetial features. Oe-dimesioal fuctios igore all p-1 remaiig features whereas a multi-dimesioal scorig fuctio cosiders correlatios with remaiig features. Four oe-dimesioal scorig fuctios are described i [22] ad compared with the oisy bootstrap (Sectio 3) ad other regularizatio techiques. The issue of feature relevace, redudacy ad irrelevace has bee explicitly addressed i may papers. As oted i [23] it is possible to thik up examples for which

7 two features may appear irrelevat by themselves but be relevat whe cosidered together. Also addig redudat features ca provide the desirable effect of oise reductio. It thus appears ecessary to do more tha just cosider idividual features by themselves as with oe-dimesioal methods. The most importat problem arises from the relatively small umber of patters relative to the umber of features. I Patter Recogitio this is kow as the small sample size problem, that is whe the umber of patters is less tha or of comparable size to the umber of features [1]. It meas that there is a risk of the classifier over-fittig the data, ad thereby capturig uwated idiosycrasies. A popular way to avoid this is to utilize simple, for example, liear classifiers. Feature rakig problems have received much attetio i the literature. However, there has bee relatively little work devoted to hadlig feature rakig explicitly i the cotext of MCS. Most previous work has focused o determiig feature subsets to combie, but differ i the way the subsets are chose. The Radom Subspace Method (RSM) [3] is the best kow method, ad it was show that a radom choice of feature subset, (allowig a sigle feature to be i more tha oe subset), improves performace for high-dimesioal problems. I [1], forward feature ad radom (without replacemet) selectio methods are used to sequetially determie disjoit optimal subsets. I [24], feature subsets are chose based o how well a feature correlates with a particular class. Rakig subsets of radomly chose features before combiig was reported i [25]. 3.1 Rakig by MLP weights The equatio for the output O of a sigle output sigle hidde-layer MLP, assumig sigmoid activatio fuctio S is give by 1 2 O S x W ) W (8) = j ( i j i where i,j are the iput ad hidde ode idices, x i is iput feature, W 1 is the first layer weight matrix ad W 2 is the output weight vector. I [26], a local feature selectio gai w i is derived form equatio (8) w i = 1 W W j The feature rakig strategy that uses equatio (9) will subsequetly be referred to as mod-. This product of weights strategy has bee foud i geeral ot to give a reliable feature rakig [27]. However, whe used with RFE it is oly required to fid the least relevat features. We have ot experimeted with ay more sophisticated strategies based o sesitivity aalysis [28]. 3.2 Rakig by Noisy Bootstrap Fisher s criterio measures the separatio betwee two sets of patters i a directio w, ad is defied for the projected patters as the differece i meas ormalized by the averaged variace m 1 m (10) 2 J ( w) = j σ 1 +σ 2 (9)

8 FLD is defied as the liear fuctio for which J(w) is maximized. It is coveiet to re-write J(w) as w T S B w J ( w) = (11) w T S W w where, S B is the betwee-class scatter matrix ad S W is the withi-class scatter matrix. The objective of FLD is to fid the trasformatio matrix w* that maximises J(w) i equatio (10) ad w* is kow to be the solutio of the followig eigevalue problem S B - S W Λ = 0 where Λ is a diagoal matrix whose elemets are the eigevalues of matrix S W -1 S B. Sice i practice S W is early always sigular, dimesioality reductio is required. Typically this is performed by Pricipal Compoets Aalysis (PCA) before solvig the eigevalue problem, but as oted i Sectio 1, that is ot appropriate for our iteded applicatio. The idea behid the oisy bootstrap is to estimate the oise i the data ad exted the traiig set by re-samplig with simulated oise. Therefore, the umber of patters may be icreased by usig a re-samplig rate greater tha 100 percet, thereby solvig the small sample size problem. The oise model assumes a multi-variate Gaussia distributio with zero mea ad diagoal covariace matrix. The reaso for assumig a diagoal matrix is that there are geerally isufficiet umber of patters to make a reliable estimate of ay correlatios betwee features. For each class, the stadard deviatio of each feature is used for the diagoal etry. Two parameters to tue are the oise added γ ad the sample to feature ratio s2f. Followig [22] we set for our experimets γ = 0.25 ad s2f = 10. Origially, the oisy bootstrap was combied with Fisher s criterio to produce a 1- dimesioal feature score [29], ad the subsequetly with the modulus of the FLD weights. I Sectio 5 we will refer to the feature rakig strategy as the oisy bootstrap ad assume that it icorporates the weight rakig defied by w* i equatio (10). 3.3 Recursive Feature Elimiatio (RFE) RFE is a simple algorithm [] ad operates recursively as follows: 1) Rak the features accordig to a suitable feature rakig method 2) Idetify ad remove the r least raked features If r 2, which is usually desirable from a efficiecy viewpoit, this produces a feature subset rakig. The mai advatage of RFE is that the oly requiremet to be successful is that at each recursio the least raked subset does ot cotai a strogly relevat feature [23]. However, the choice of whe to stop elimiatig features is difficult ad ormally requires a validatio set or cross-validatio techiques. 4 Datasets The artificial data is two-class accordig to [22], which is iteded to simulate a problem i gee selectio. Oe thousad patters are geerated per class usig a diagoal p x p covariace matrix that is estimated from the colo data [31]. The differece betwee

9 the two classes is i the first 2*p d features. Class ω 1 has all zero mea features whereas ω 2 has the first p d features set to c > 0 ad the ext p d features set to c, with remaiig features zero mea. Therefore 2*p d are ifluetial features, with remaiig features for both classes zero mea. For our experimets umber of traiig patters = 50(2.5%), 40,, dimesio p = 100, 500 p d = 25, c = The mai advatage of this simulated data is that we kow the potetially ifluetial features, so that it is possible to rate the feature rakigs. If we assume that features are ordered with highest score idicatig more ifluetial features, cosider what happes as we reduce the score threshold to iclude more features. Each feature ca be labeled as true positive or true egative, ad we ca plot a ROC (Receiver Operatig Curve), that is true positives versus true egatives as the threshold is chaged. The area uder the ROC curve is a sigle umber used to idicate the tradeoff betwee sesitivity ad specificity. The assumptio for plottig the area uder the ROC curve is that, at each recursive step, we reduce the threshold just eough to iclude the ext subset of features. I our cotext, higher area idicates a better feature rakig with respect to the locatio of the ifluetial features. Natural two-class ad multi-class bechmark problems have bee selected from [32] ad [33] ad are show i Table 1. For datasets with missig values the scheme suggested i [32] is used. For RFE testig i Sectio 5 the origial features are ormalized to mea 0 std 1 ad the umber of features icreased to oe hudred by addig oisy features (Gaussia std 0.25). Table 1: Bechmark Datasets showig umbers of patters, classes, cotiuous ad discrete features DATASET #pat #class #co #dis cacer card credita dermatology diabetes ecoli glass heart iris io segmet soybea vehicle vote vowel wave yeast

10 5 Experimetal Evidece All experimets use radom traiig/testig splits, ad the results are reported as mea over te rus. Two-class problems are split /80 (% traiig) ad use 100 base classifiers. Multi-class problems are also split /80 but use 0 base classifiers, oe for each two-class decompositio, described i Sectio 2.2. The purpose of the iitial experimet is to determie geeralizatio performace as the umber of hidde odes ad umber of traiig epochs of multi-layer perceptro (MLP) base classifiers are systematically varied. Each ode-epoch combiatio is repeated te times with the same umber of odes ad epochs used for each MLP. All other parameters of the base classifier MLPs are fixed at the same values over all rus. The umber of hidde odes is varied over 2-16 ad umber of traiig epochs over 1-69 (log scale). Radom perturbatio of the MLP base classifiers is caused by differet startig weights o each ru, combied with oe hudred percet bootstrapped traiig patters. The experimet is performed with oe hudred sigle hidde-layer MLP base classifiers, usig the Leveberg-Marquardt traiig algorithm with default parameters (µ iit =0.001, µ dec =0.1, µ ic =10). Error Rates % Error Rates % Coefficiet (a) Base Test (c) Base OOB (e) σ Number of Epochs (b) Esemble Test (d) Esemble OOB (f) Q Number of Epochs Figure 1: Mea test error rates, OOB estimates, measures σ, Q for Diabetes /80 with [2,4,8,16] odes Figure 1 shows Diabetes /80, a dataset that is kow to over-fit with Boostig ad other methods. Figure 1 (a) (b) shows base classifier ad esemble test error rates, (c) (d) the base classifier ad esemble OOB estimates ad (e) (f) the measures σ, Q defied i

11 equatios (5) ad (6) for various ode-epoch combiatios. It may be see that σ ad base classier OOB are good predictors of base classifier test error rates as base classifier complexity is varied. The correlatio betwee σ ad test error was thoroughly ivestigated i [10], showig high values of correlatio that were sigificat (95 % cofidece whe compared with radom chace). I [10] it was also show that bootstrappig did ot sigificatly chage the esemble error rates, actually improvig them slightly o average. The class separability measure σ shows that the base classifier test error rates are optimized whe the umber of epochs is chose to maximize class separability. Furthermore, at the optimal umber of epochs Q shows that diversity is miimized. It appears that base classifiers startig from radom weights icrease correlatio (reduce diversity) as complexity is icreased ad peaks as the classifier starts to over-fit the data. A possible explaatio of the over-fittig behavior is that classifiers produce differet fits of those patters i the regio where classes are overlapped [10]. Note from Figure 1 that the esemble is more resistat to over-fittig tha base classifier for epochs greater tha 7, ad the esemble OOB accurately predicts this tred. This experimet was performed for all the datasets, ad i geeral the esemble test error was foud to be more resistat to over-fittig for both two-class ad multi-class datasets. Figure 2 shows similar curves to Figure 1 averaged over all multi-class datasets. Based o these results 8 hidde odes was chose, with 7 epochs for two-class, epochs for multiclass ad 10 epochs for artificial data. Coefficiet Error Rates % Error Rates % (a) Base Test (c) Base OOB (e) σ Number of Epochs (b) Esemble Test (d) Esemble OOB (f) Q 0.1 Number of Epochs Figure 2: Mea test error rates, OOB estimates, measures σ, Q over te multiclass /80 datasets with [2,4,8,16] odes

12 Figure 3 shows RFE with oisy bootstrap feature rakig for two-class artificial data with oe hudred features. The recursive step size is chose usig a logarithmic scale to start at 100 ad fiish at 2 features with miimum step size of 1. Both base classifier OOB ad σ are see to correlate well with base classifier test error. Similarly, esemble OOB achieves miimum error at the same umber of features as esemble test error, with the exceptio of 1% sample size ( patters). We have ot ivestigated whether icreasig the umber of classifiers improves the estimate for small sample size. Note from Figure 3 (a) (b) (c) (d) that the OOB estimate is geerally a poor idicator of absolute geeralizatio error. Figure 4 (a) shows a typical ROC curve defied i Sectio 4 at 100 features. For RFE, it is difficult to compare feature subsets whe there is differet umber of features. Therefore, we keep a list of sorted features puttig the feature subset that has bee elimiated o each recursio at the ed of the list. At each recursive step, we the have 100 sorted features. The result is show i Figure 4 (b), idicatig that RFE cosistetly improves the feature rakig. The experimet was repeated without applyig RFE, that is the feature orderig obtaied at 100 features is used for each feature reductio. The differece i test error rates betwee the two is show i Figure 5, demostratig that RFE makes a large differece to error rates below features. To determie the effect of RFE o a rage of two-class ad multi-class problems, RFE was applied to the datasets show i Table 1. For each dataset, the umber of features is icreased to 100 by addig oisy features, as explaied i Sectio 4. The RFE curves (ot show) appeared similar to Figure 3, achievig a miimum at the umber of features predicted by OOB. For two-class problems, there was o sigificat differece betwee oisy bootstrap ad mod-. The mea over all features ad all datasets is show i Table 2. For compariso, usig origial features (Table 1) the mea error rate over all /80 problems was 14.1 % for two-class ad 17.8 % for multi-class. A potetial problem with bootstrappig is that each base classifier sees oly approximately 63% traiig patters. To determie the effect of the reduced sample size, the RFE experimet for artificial data was repeated without bootstrappig. The miimum error achieved was 0.5 % compared with 1.2 % with bootstrappig. However, the umber of features at which the OOB ad the test error started to rise did ot chage. For seve two-class problems with 100 features, the mea best error rate was 13.4 % compared with 13.7 % with bootstrappig. 6 Coclusio It is show i this paper that classifier complexity ad umber of features may be selected usig a out-of-bootstrap (OOB) error estimate. The base classifier OOB estimate achieves

13 a miimum whe the estimate of class separability reaches a maximum. The method is exteded to multi-class problems usig ECOC, ad is see to be less sesitive to overfittig whe the umber of features is reduced below the optimal umber. Both oisy bootstrap with Liear Descrimiat ad the modulus of eural etwork weights provide a good feature rakig criterio. However, for large umber of features it is better to combie wth RFE to recursively remove irrelevat features. Table 2: Mea best error rates (%) for artificial data (2.5/97.5), seve two-class problems (/80), te multi-class problems (/80) artificial 100 feats artificial 500 feats two-class 100 feats multi-class 100 feats mod- RFE oisyboot RFE Error Rates % Error Rates % Coefficiet (a) Base Test (c) Base OOB (e) σ umber of features (b) Esemble Test (d) Esemble OOB (f) Q umber of features Figure 3: Mea test error rates, OOB estimates, measures σ, Q for RFE with oisy bootstrap feature rakig, artificial data ad [1,1.5,2,2.5] % traiig patters

14 True Positive (a) ROC False Positive Coefficiet (b) Area uder ROC Number of Features Figure 4: (a) typical ROC curve at 100 features (b) area uder ROC curve for 100 sorted features usig RFE 0 (a) Base Test 0 (b) Esemble Test Error Rates % umber of features umber of features Figure 5: Test error rates with RFE mius test error rates without RFE, artificial data [1,1.5,2,2.5] % traiig patters Refereces 1 Skuruchia M. ad Dui R. P. W., Combiig feature subsets i feature selectio, Proc. 6th It. Workshop Multiple Classifier Systems, Editors: N. Oza, R. Polikar, F. Roli, J. Kittler, Seaside, Calif. USA, Jue, 05, Lecture otes i computer sciece, Spriger-Verlag, Wideatt T., Vote Coutig Measures for Esemble Classifiers, Patter Recogitio 36(12), 03, Ho T. K., The radom subspace method for costructig decisio forests, IEEE Tras. Patter Aalysis ad Machie Itelligece, (8) 1998, Efro B. ad Tibshirai R. J., A Itroductio to the Bootstrap, Chapma & Hall, Bylader T, Estimatig geeralisatio error two-class datasets usig out-of-bag estimate, Machie Learig 48, 02, Tikhoov A. N. ad Arsei V. A., Solutios of ill-posed problems, Wisto & Sos, Washigto, Hase L. K. ad Salamo P., Neural Network Esembles, IEEE Tras. Patter Aalysis ad Machie Itelligece, 12(10), 1990, Kucheva L. I. ad Whitaker C. J., Measures of diversity i classifier esembles, Machie Learig 51, 03, Kucheva L. I., Skurichia M. ad, Dui R. P. W. A experimetal study o diversity for baggig ad boostig with liear classifiers, Iformatio Fusio, 3 (2), 02,

15 10 Wideatt T., Accuracy/Diversity ad esemble classifier desig, IEEE Tras. Neural Networks 17(5), 06, Carey J. G. ad Cuigham, Tuig Diversity i bagged esembles, It. Joural Neural Systems, 10(4), 00, Fukuaga K., Itroductio to statistical patter recogitio, Academic Press (1990). 13 Ho T.K. ad Basu M., Complexity measures of supervised classificatio problems, IEEE Tras. PAMI 24(3), 02, Wideatt T. Diversity Measures for Multiple Classifier System Aalysis ad Desig, Iformatio Fusio, 6 (1), 04, Dietterich T. G. ad Bakiri G., Solvig multiclass learig problems via error-correctig output codes, Joral of Artificial Itelligece Research 2, 1995, Sejowski T. J. ad Roseberg C. R., Parallel etworks that lear to proouce eglish text, Joural of Complex Systems, 1(1), 1987, Wideatt T. ad Ghaderi R.., Codig ad Decodig Strategies for Multi-class Learig Problems, Iformatio Fusio, 4(1), 03, Allwei E. L., Schapire R. E. ad Siger Y., Reducig Multi-class to Biary: A Uifyig Approach for Margi Classifiers, Joural of Machie Learig Research 1, 00, Schapire R. E., Usig Output Codes to Boost Multi-class Learig Problems, 14th It. Cof. of Machie Learig, Morga Kaufma, 1997, Kohavi R. ad Joh G. H., Wrappers for feature subset selectio, Artificial Itelligece Joural, special issueo relevace, 97 (1-2), 1997, Guyo I. ad Elisseeff A. A itroductio to variable ad feature selectio, Joural of Machie Learig Research 3, 03, Efro N. ad Itrator N., Multi-dimesioal feature scorig for gee expressio data, submitted. 23 Yu L. ad Liu H., Efficiet feature selectio via aalysis of relevace ad redudacy, Joural of Machie Learig Research 5, 04, Oza N., ad Tumer K., Iput Decimatio esembles: decorrelatio through dimesioality reductio, Proc. 2d It. Workshop Multiple Classifier Systems, Editors: J. Kittler, F. Roli,, Cambridge, UK, July, 01, Lecture otes i computer sciece, Spriger-Verlag, Bryll R., Gutierrez-Osua R. ad Quek F. Attribute baggig: improvig accuracy of classifier esembles by usig radom feature subsets, Patter Recogitio 36, 03, Hsu C. Huag H. ad Schuschel D., The ANNIGMA-wrapper approach to fast feature selectio for eural ets, IEEE Tras. System, Ma ad Cyberetics-Part B:Cyberetics 32(2), 02, Wag W., Joes P. ad Partridge D. Assessig the impact of iput features i a feedforward eural etwork, Neural Computig ad Applicatios 9, 00, Motaa J. J. ad Palmer A., Numeric Sesitivity aalysis applied to feedforward eural etworks, Neural Computig ad Applicatios 12, 03, Efro N. ad Itrator N., The effect of oisy bootstrappig o the robustess of supervised classificatio of gee expressio data, IEEE It. Workshop o Machie Learig for Sigal Processig, Brazil, 04, Guyo I., Westo J., Barhill S. ad Vapik V., Gee selectio for cacer classificatio usig support vector machies, Machie Learig 46(1-3), 02, Alo U et al., Broad patters of gee expressio revealed by clusterig aalysis of tumor ad ormal colo tissues probed by oligoucleotide arrays, Proc. Natioal Acad. Sciece 96, 1999, Prechelt L., Probe1: A set of eural etwork Bechmark Problems ad Bechmarkig Rules, Tech Report 21/94, Uiv. Karlsruhe, Germay, Merz C. J., Murphy P. M., UCI repository of machie learig databases, 1998,

Investigating methods for improving Bagged k-nn classifiers

Investigating methods for improving Bagged k-nn classifiers Ivestigatig methods for improvig Bagged k-nn classifiers Fuad M. Alkoot Telecommuicatio & Navigatio Istitute, P.A.A.E.T. P.O.Box 4575, Alsalmia, 22046 Kuwait Abstract- We experimet with baggig knn classifiers