Decision Support Systems

Size: px

Start display at page:

Download "Decision Support Systems"

Ellen McCormick
5 years ago
Views:

Decisio Support Systems 50 (010) 93 10 Cotets lists available at ScieceDirect Decisio Support Systems joural homepage: www.elsevier.

Frak Fag c a Departmet of Idustrial ad Iformatio Maagemet Natioal Cheg Kug Uiversity, Taiwa b Divisio of Biostatistics ad Bioiformatics, Natioal Health Research Istitutes, Taiwa c Geographic

1 Decisio Support Systems 50 (010) Cotets lists available at ScieceDirect Decisio Support Systems joural homepage: The data complexity idex to costruct a efficiet cross-validatio method Der-Chiag Li a,, Yao-Hwei Fag b, Y.M. Frak Fag c a Departmet of Idustrial ad Iformatio Maagemet Natioal Cheg Kug Uiversity, Taiwa b Divisio of Biostatistics ad Bioiformatics, Natioal Health Research Istitutes, Taiwa c Geographic Iformatio System Research Ceter, Feg Chia Uiversity, Taiwa article ifo abstract Article history: Received 0 Jauary 009 Received i revised form 31 March 010 Accepted 1 July 010 Available olie 3 July 010 Keywords: Biary classificatio problem Cross-validatio Data complexity Cross-validatio is a widely used model evaluatio method i data miig applicatios. However, it usually takes a lot of effort to determie the appropriate parameter values, such as traiig data size ad the umber of experimet rus, to implemet a validated evaluatio. This study develops a efficiet cross-validatio method called Complexity-based Efficiet (CBE) cross-validatio for biary classificatio problems. CBE cross-validatio establishes a complexity idex, called the CBE idex, by explorig the geometric structure ad oise of data. The CBE idex is used to calculate the optimal traiig data size ad the umber of experimet rus to reduce model evaluatio time whe dealig with computatioally expesive classificatio data sets. A simulated ad three real data sets are employed to validate the performace of the proposed method i the study, while the validatio methods compared are repeated radom subsamplig validatio ad K-fold cross-validatio. The results show that CBE cross-validatio, repeated radom sub-samplig validatio ad K-fold cross-validatio have similar validatio performace, except that the traiig time required for CBE cross-validatio is ideed lower tha that for the other two methods. 010 Elsevier B.V. All rights reserved. 1. Itroductio I data miig applicatios, researchers geerally use crossvalidatio to evaluate the leared classificatio model [11]. However, this usually requires cosiderable computatioal costs. With K-fold cross-validatio, for example, the umber of experimet rus must icrease whe parameter K icreases, makig the traiig computatioally expesive [1]. Specifically, ((K 1)/K)% traiig data are theoretically eeded for learig a classificatio model, ad whe the data size is very large, ((K 1)/K)% traiig data makes computatio expesive [1]. I aother commo sceario, repeated radom sub-samplig validatio is usually repeated 30 or 50 times for model evaluatio [3]. However, if the data structure is simple or uiform, the umber of times sub-samplig validatio is repeated is much more tha what is eeded, ad thus the procedure is iefficiet. Our research develops a effective cross-validatio procedure, called Complexity-based Efficiet (CBE) cross-validatio, for biary classificatio problems. The CBE cross-validatio method ca be used to calculate the optimal traiig data size ad the umber of experimet rus to reduce model validatio time. The CBE crossvalidatio procedure systematically establishes a o-liear data complexity idex (defied i Sectio 3) called CBE idex by explorig the geometric structure ad oise of data. Correspodig author. Tel.: x addresses: lidc@mail.cku.edu.tw (D.-C. Li), yhfag@hri.org.tw (Y.-H. Fag), frakfag@gis.tw (Y.M.F. Fag). The desity-based clusterig algorithm (DBSCAN) is used to discover the geometric structure ad oise, while the betweedistace ad withi-distace of the clusters foud are used as the factors of the CBE idex. Based o this, this research develops a efficiet CBE cross-validatio procedure to calculate the optimal traiig data size ad umber of experimet rus. The rest of this paper is orgaized as follows: The literature review is give i Sectio while the detailed procedure of the proposed method is described i Sectio 3. Oe simulated ad three real data sets are used to illustrate the CBE cross-validatio model i Sectio 4, ad Sectio 5 cotais the coclusio ad discussio of our research.. Literature review I this sectio we review the cocept of liear data complexity (the defiitio is explaied i Sectio 3), the geometric structure ad oise of data, ad existig cross-validatio methods..1. Liear data complexity For liear data complexity, the idex used to detect the level of data complexity is Fisher's discrimiat ratio f [1,10]: ð f ¼ μ 1 μ Þ σ1 þσ where μ 1,μ,σ 1,ad σ are the meas ad variaces of the two classes i a data set, respectively. f is specific for oe feature dimesio case. ð1þ /$ see frot matter 010 Elsevier B.V. All rights reserved. doi: /j.dss

2 94 D.-C. Li et al. / Decisio Support Systems 50 (010) For a multidimesioal problem, the maximum f over all the feature dimesios is used to describe the problem. For problems with multidimesioal features, Li ad Fag proposed a Purity Level (PL) to measure liear data complexity [15]. The parameters of the idex are defied as follows: : the umber of data poits. k: the umber of dimesios of the data (k ). A ij +,A ij : the value of the j-th dimesio of the i-th data poit i the positive ad egative classes, respectively. Ā j +,Ā j : the average value of the j-th dimesio of the data i the positive ad egative classes, respectively. A j max,a j mi : the maximum ad the miimum values of the j-th dimesio, respectively. Usig the parameters listed above, the Purity Level is set as: 0vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi! vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k A þ! ij A j u j =1 k A 1 ij Aþ j u j =1 A t j max A j mi t A j max A jmi i =1 + B k 1 k 1 A Purity Level = 0vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi! vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k A þ ij A þ! j u j =1 k A ij A 1 j u j =1 t A j max A j mi t A jmax A jmi i =1 + B k 1 k 1 A where the umerator is the sum of the betwee-class distace of the whole data set, ad the deomiator is the sum of the withi-class distace of the whole data set. The results show that the smaller the PL value, the higher the liear data complexity, ad vice versa. However, either Fisher's discrimiat ratio or PL cosiders the geometric structure ad oise of data... The cocept of geometric structure ad oise of data Rubiov [1] discussed the relatioship betwee classes ad clusters i data sets, ad examied the distributio of classes withi the obtaied clusters. He foud that some characteristics lik data poits more strogly tha the classes they belog to. We thus believe that the geometric structure of data is a essetial characteristic for classifyig data sets. I a study o the effect of oise i data processig, Lee et al. [14] combied the fuzzy adaptive resoace theory ad the geeral regressio eural etwork ito a hybrid model, which assisted the removal of oise embedded i traiig data i order to improve the classificatio ability. Ha et al. [9] proposed a revised Expectatio- Maximizatio (EM) algorithm to discover ad remove oise to improve the oe-agaist-the-rest method i biary text classificatio. Cao et al. [] proposed a data preprocessig method for traiig data to remove oise or outliers, ad used the remaiig data to obtai the decisio fuctio. However, the drawback of this method is that it is difficult to remove oise ad outliers without the assistace of problem domai kowledge..3. Commo types of cross-validatio method Cross-validatio is a model evaluatio method that is better tha residual aalysis. The weakess of residual evaluatio is that it does ot give a idicatio of how well the learer will do whe it is used to make predictios for usee data. Oe way to overcome this problem is to leave out part of the data poits from the data set whe traiig a classifier, So that whe traiig is fiished the removed data are used to test the performace of the model. This is the basic idea for the model evaluatio method called cross-validatio [4]. ðþ Two widely used such methods, repeated radom sub-samplig validatio ad K-fold cross-validatio, are described below Repeated radom sub-samplig validatio This method radomly splits a data set ito traiig ad validatio data sets ad the repeats this procedure several times. For each split, the classifier is traied with the traiig data ad validated with the validatio data. The results from each split ca be averaged. This method is usually applied i small sample learig cases that use a small amout of traiig data to lear the model ad large amout of validatio data to validate it [16,17]..3.. K-fold cross-validatio I K-fold cross-validatio, the origial sample is partitioed ito K partitios. A partitio is the used as the validatio data for testig the model, ad the remaiig K 1 partitios are used as the traiig data. The cross-validatio process is the repeated K times, with each of the K partitios used as the validatio data exactly oce. The K results from the folds ca be averaged to produce a sigle estimatio [4]. The advatage of this method over the repeated radom subsamplig validatio method is that all observatios are used for both traiig ad validatio, ad each observatio is used for validatio exactly oce. 10-fold cross-validatio is commoly used by researchers. 3. Proposed method With biary classificatio problems, data complexity is defied as the level of complexity for separatig data ito classes. Whe the data complexity is high this meas it is hard to classify. Complexities ca be subdivided ito liear ad o-liear cases: liear data complexity meas a complex level for separatig the data usig a liear hyperplae; while o-liear data complexity meas a complex level for separatig the data usig a o-liear hyperplae. Takig the XOR problem as a example, we usually use a o-liear hyperplae to separate the data rather tha a liear oe. This research focuses o fidig a effective way to classify data by calculatig the o-liear data complexity for high dimesioal classificatio problems. We develop the CBE idex by improvig the Purity Level (PL) method [15], ad cosider the geometric structure ad oise of data to precisely measure the level of o-liear separability. We the use the CBE idex to form a sample size determiatio method to develop a efficiet CBE cross-validatio method to improve computatioal efficiecy. The proposed Complexity-based Efficiet (CBE) idex is described i detail i subsectio 3.1, ad the proposed CBE cross-validatio is described i subsectio CBE idex Research o patter recogitio suffers from the ucertaity cocerig the match betwee kowledge ad a problem due to the strog depedece of classifyig performace o available data. I other words, the accuracy of a classifier is highly depedet o the data characteristics [10]. Ufortuately, this ucertaity ofte remais because of a lack of uderstadig of the full data characteristics [1], ad this situatio also occurs i model validatio. Therefore, i this work we cosider more descriptors, such as the geometric structure ad oise of data, to further uderstad the data characteristics with the goal of improvig validatio efficiecy. The CBE idex relies heavily o the realizatio of the data's geometric structure, because, i our experiece, whe the ceter of the data belogig to a class is ot located i the data cluster (such as with the XOR problem i Fig. 1), it is ot reasoable to use a liear idex, such as a F-test statistic or purity level, to measure the data complexity. We thus develop the o-liear CBE idex to fid multiple ceters accordig to the geometric structures of data. I

D.-C. Li et al. / Decisio Support Systems 50 (010) 93 10 95 Table 1 The pseudo code of the DBSCAN algorithm. Fig. 1. The structure of a XOR problem.

3 D.-C. Li et al. / Decisio Support Systems 50 (010) Table 1 The pseudo code of the DBSCAN algorithm. Fig. 1. The structure of a XOR problem. that we calculate the ceters of data clusters ad let the ceters be located i the data. Note that the liear idex cocept is a special case of the o-liear oe whe it has oly oe cluster i each class. To discover the geometric structure ad oise of data, researchers usually rely o prior kowledge, although this is experiece orieted ad icoclusive []. This research thus proposes a o-liear data complexity idex, the CBE idex, to systematically reflect the geometric structure ad oise of data precisely. This study uses the desity-based clusterig (DBSCAN) algorithm to discover the geometric structure ad oise of data to fid the complexity level to separate data ito classes, as explaied below DBSCAN algorithm DBSCAN is a clusterig algorithm suitable for a data set with a large amout of data with high dimesioality [7]. DBSCAN gathers together high desity data as clusters ad the shape of each cluster are arbitrary. The algorithm fids the clusters ad the deletes data that does ot belog to ay of them. It searches for clusters by checkig the surroudigs of each data poit withi a scope called the ε-eighborhood. If the ε-eighborhood of a data poit cotais other data which has a data size that is more tha a certai pre-defied umber (MiPts), a cluster with this data (called the core object) is created; otherwise, the data is treated as oise which will be evetually deleted. DBSCAN iteratively collects directly desityreachable data (data withi the ε-eighborhood of a core object) util o ew data ca be added to ay cluster, ad this may ivolve mergig some clusters. We apply the DBSCAN algorithm to each class to detect the geometric structure ad oise of data i biary classificatio. Table 1 shows the DASCAN algorithm pseudo code. Cosider the radius of a default ε, obtaied by cosiderig the fractio of objects to be selected ðk = mþ ad the volume V [6]. We exted this cocept to biary classificatio ad suppose that is the dimesio of the data, k is the umber of MiPts, Γ is the gamma fuctio, m + ad m are the amouts of data ithepositive ad egative classes, repectively, ad V þ = j rage x þ j ad V = j rage x j forj = 1; ::; k are the data rages i the positive ad egative classes, respectively. The followig are the formula sets for ε +, ad ε for positive ad egative classes, respectively: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k = m ε þ = þ Vþ Γðk = +1Þ pffiffiffiffiffi π sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðk = m ε = ÞV Γðk = +1Þ pffiffiffiffiffi π ð3þ ð4þ Daszykowski et al. proposed a default MiPts calculatio formula [5]. We exted this formula to biary classificatio ad defie: MiPts þ = iteger m þ ; for a positiveclass ð5þ 5 MiPts = iteger m ; for a egative class ð6þ 5 For a data set with umerous data poits of positive ad egative classes (m + or m ), we suggest that MiPts + or MiPts be equal to The calculatio of the CBE idex This research uses the CBE idex to depict the level of o-liear data complexity. The CBE idex of biary classificatio ca be regarded as the relative distace of clusters discovered by the DBSCAN algorithm for each class, ad it is foud as follows: Let X¼fX 1 ;:::;X N g be a data set that icludes positive samples þ X ¼ þ X 1 ;:::; þ X þ gad egative samples X ¼ X 1 ;:::; X g, where + + =N. Let þ C = þ C 1 ;:::; þ Cj þc j g be a set that cosists of þ C j positive clusters, C = C 1 ;:::; Cj C j g be a set cosistig of C j egative clusters, d X i ;X j be the distace betwee Xi ad X j,ad þ C i = þ X i 1 ;:::; þ Xi þ ig m be the i-th positive cluster, where + m i is the umber of positive samples i the i-th cluster, ad i =1;:::; þ C j. Similarly, let C i = X i 1 ;:::; g Xi be the i-th egative cluster, mi where m i is the umber of egative samples i i-th cluster, ad i =1;:::; C j.wefirst calculate the miimum average distace betwee a pair of clusters which belog to differet classes as Mi_Bet: þm k m l d ðþ X k i ; X l g >< j Þ i =1 j =1 Mi Bet ¼ Mi ð7þ k=1; ; þ C j þm k m l l=1; ; j C j>: A large value of Mi_Bet idicates that the data are widely scattered ad easy to classify.

4 96 D.-C. Li et al. / Decisio Support Systems 50 (010) We the calculate the average distace withi all clusters of the positive class as: Withi þ = j þ C j þm k i =1 m k j =1 d ð þx k i ; X l j Þ k=1 þm k ð þ m k 1Þ ad for all clusters of the egative class as: Withi ¼ þm k j þ C j i =1 k =1 m k d ð X k i ; X l j Þ i =1 m k ð m k 1Þ If the value of the average distace withi all clusters of a class ðwithi þ ad Withi Þ is small, it meas that these clusters cogregate with each other. The calculatio of the CBE idex is defied as follows: CBE idex ¼ MiBet Withi þ + Withi þ C j + j C j j The determiatio of the CBE idex takes three steps: ðþ ð9þ ð10þ Step 1: Normalize the data For differet uits of dimesios, the data is ormalized before calculatig the CBE idex. Step : Discover the geometric structure ad oise of data Use the DBSCAN algorithm i the biary classes with the suggested parameter settigs: ε +,ε, MiPts +,ad MiPts, to detect the geometric structure ad remove the data oise. Step 3: Calculate the CBE idex Calculate Mi_Bet, Withi +, ad Withi to obtai the CBE idex. The CBE idex has the followig properties: (1) 0 CBE idexb. () The smaller the CBE idex is, the higher the data complexity is. (3) The larger the CBE idex is, the lower the data complexity is. 3.. CBE cross-validatio method We apply the CBE idex to develop the CBE cross-validatio method, where we first radomly select a certai small proportio (for example, 5%) of samples as the traiig data ad calculate the CBE idex. This process is repeated 30 times to calculate the averages X CBE ad the stadard deviatios S CBE. I order to achieve a stable CBE idex for the optimal traiig data size N, this process is iterated while icreasig the proportio of the traiig data ad checkig the differece of X CBE as: Whe X % CBE b0:01; THEN N = Maxfo:of % samples; o: of 10% samplesg data size CBE X +1% ð11þ Whe the differece decreases by a level smaller tha 0.01, we cosider the structure of the traiig data to be stable, ad use this traiig data size as the optimal oe. Where 0.01 is oly a empirical suggestio ad 10% is also a empirical save low sample size limit. For the umber of experimet rus, we repeat the process 30 times to calculate the average ad stadard deviatio of CBE. Note that the sample distributio of the CBE idex will coverge to a ormal distributio accordig to the Cetral Limit Theorem (CLT) [3], ad the optimal traiig data size (average X % CBE ad stadard deviatios S % CBE ) is used to calculate the umber of experimet rus. The umber of experimet rus K is determied as: S CALCULATE Z % α= CBE = k; THEN ð0:05 X CBE % Þ Max fk; 5g =K ð1þ where α is the sigificace level, Z α / is the value with α/% i the tail of the cumulative stadard Normal distributio, ad 0:05 X % CBE is set as the desired margi of error, where 5 is agai our suggestio. 4. Experimet I this sectio, we use oe simulated ad three real data sets to verify the performace of the Complexity-based Efficiet (CBE) crossvalidatio method. I the simulatio experimets, a support vector machie (SVM) [1], a Back-propagatio Network (BPN) [,0], ad a Naive Bayes Classifier (NBC) [4] are used as the classificatio tools, while i the three real data sets, oly SVM is used. To fid the relatioship betwee CBE idex ad classificatio accuracy, we radomly select 10% of the total samples ad calculate the CBE idex with the suggested ε +,ε,mipt +,ad MiPt i Sectio 3 to measure the relatioship for all data sets. This process is repeated 10 times, where SVM, BPN, ad NBC are used as the classifiers with the resubstitutio method (all available data are used for traiig ad testig) [13]. To implemet the CBE cross-validatio, we radomly select a small proportio of the data as the traiig set (such as 5%), ad calculate the CBE idexes. This procedure is repeated 30 times. The traiig data size is gradually icreased, where we calculate the average ad the stadard deviatio of the CBE idex i order to fid the optimal traiig data size ad the umber of experimet rus Simulated data experimets This research uses the Parametric Equatio of a Hypersphere [16], briefly itroduced below, to geerate simulated data. The -hypersphere (ofte simply called the -sphere) is a geeralizatio of a object with dimesios i R (the circle ad sphere are called the two-sphere ad three-sphere, respectively). The -sphere cetered at the origi ca therefore be defied as a set of poits ðx 1 ; x ; ; x k Þ such that: x 1 + x + + x = r ð13þ Table The CBE idex ad classificatio accuracies of the three classifiers for the simulated data sets with 1% oise where Average o. of oise samples foud meas for the average umber of oise foud, Average o. of clusters foud meas for the average umber of cluster foud by usig DBSCAN algorithm. CBE idex Average o. of oisy samples foud Average o. of clusters foud (pos., eg.) (, ) (, ) (, ) (, ) (, ) (, ) (, ) (, ) (, ) (, ) Accuracy of SVM Accuracy of BPN Accuracy of NBC

5 D.-C. Li et al. / Decisio Support Systems 50 (010) The hypersphere ca be specified i a parametric equatios as: x 1 = r siθ 1 siθ siθ 1 x = r siθ 1 siθ cosθ 1 >< x 3 = r si θ 1 siθ cosθ x 4 = r siθ 1 siθ cosθ 3 ð14þ >: x 1 = r siθ 1 cosθ x = r cosθ 1 where r is the radius ad θ 1 ; θ ; ; θ 1 ½0; πš are the agles of the hypersphere. The formula of parametric equatios is ot uique, but must satisfy the idetity x 1 +x + +x =1. We cosider the two-cluster coditio i each class ad isert oise ito the data. We geerate 0 five-dimesio data (404 positive ad 404 egative samples) followig the Parametric Equatio of a Hypersphere [16]. I the positive class, the data is geerated ito two clusters. Oe is: x 1 = 0:7 + siθ 1 siθ siθ 4 >< x = 0:7 + siθ 1 siθ cosθ 4 x 3 = 0:7 + siθ 1 si θ cosθ 3 ; 0 θ π ð15þ x 4 = 0:7 + siθ 1 siθ cosθ >: x 5 = 0:7 + cosθ 1 ad the other is: x 1 = 0:7 + siθ 1 siθ siθ 4 >< x = 0:7 + siθ 1 siθ cosθ 4 x 3 = 0:7 + siθ 1 siθ cos θ 3 x 4 = 0:7 + siθ 1 siθ cos θ >: x 5 = 0:7 + cosθ 1 ; 0 θ π ð16þ I the egative class, the data is geerated ito two clusters too. Oe is x 1 = 0:7 + siθ 1 siθ siθ 4 >< x = 0:7 + siθ 1 siθ cosθ 4 x 3 = 0:7 + siθ 1 siθ cos θ 3 ; 0 θ π ð17þ x 4 = 0:7 + siθ 1 siθ cos θ >: x 5 = 0:7 + cosθ 1 ad the other is: x 1 = 0:7 + siθ 1 siθ siθ 4 >< x = 0:7 + siθ 1 siθ cosθ 4 x 3 = 0:7 + siθ 1 siθ cos θ 3 x 4 = 0:7 + siθ 1 siθ cos θ >: x 5 = 0:7 + cosθ 1 ; 0 θ π ð1þ We the add 1% oise to each class by radomly selectig 4 samples to chage class label. Table ad Fig. show the results of usig the CBE idex with the simulated data sets. Table 3 The averages ad stadard deviatios (SDs) of CBE idexes with icreasig size of the traiig data sets for the simulated data set. (Bold value meas the optimal data size). Traiig data 40 (5%) 1 (10%) 11 (15%) 161 (0%) 0 (5%) Average SD Traiig data 4 (30%) 3 (35%) 91 (36%) 99 (37%) 307 (3%) Average SD From the table ad figure above we ca see that whe the value of CBE icreases, the classificatio accuracies of SVM, BPN, ad NBC also rise. There is thus a highly positive correlatio betwee the CBE idex ad classificatio accuracy for the simulated data sets. To fid the optimal traiig data size, we calculate various CBE idexes by icreasig the traiig set size. Table 3 ad Fig. 3 show the results of usig the CBE idex with various simulated data sets. WHEN X 37% CBEb0:01; THEN Maxfo: of 37% samples; o:of 10% samplesg data size = 37% 0 = 99 CBE X 3% ð19þ We determie that the optimal traiig data size is 99 whe CBE X decreases by less tha 0.01, ad cosider that the geometric structure of the optimal traiig data is stable. To fid the optimal umber of experimet rus for the simulated data set. We use the optimal sample size to measure the optimal experimet rus as: 0:13 Z α= CALCULATE =6:46; THEN ð0:05 :19Þ Maxf6:456; 5g =6 ð0þ I the simulated data set, with a sigificace level α=0.05 ad a margi of error of , the optimal umber of traiig data is 99, ad the optimal umber of experimet rus is six. We use repeated radom sub-samplig validatio (with 533 (66%) traiig data poits, 75 (34%) testig data poits, experimet repeated 30 times) to validate that our CBE cross-validatio (with 99 (37%) traiig data poits, 509 (63%) testig data poits, experimet repeated six times) is efficiet. The average ad stadard deviatios of the SVM with the repeated radom sub-samplig validatio are 7.36 ad 1.044, respectively; ad of the CBE cross-validatio are ad The performaces of the two cross-validatio methods thus have isigificat differeces (the P-value is 0.15, Fig.. The relatioship betwee classificatio accuracy ad the CBE idex with 1% oise. Fig. 3. Relatioship betwee traiig size ad the CBE idex with the simulated data set.

6 9 D.-C. Li et al. / Decisio Support Systems 50 (010) Table 4 Properties of the three data sets. Data set No. of dimesios No. of samples No. of classes Pima Idias diabetes 76 Haberma's survival Australia credit approval usig the idepedet t-test). The average traiig time of the repeated radom sub-samplig validatio is =6.7 s, ad that of the CBE cross-validatio is =3.1 s. We also use five-fold cross-validatio to validate that our CBE cross-validatio is efficiet. The average ad stadard deviatios of the SVM with five-fold crossvalidatio are ad 1.141, respectively. The performaces of the cross-validatio methods have isigificat differeces (the P-value is 0.13, usig the idepedet t-test) ad the average traiig time of the five-fold cross-validatio is =5.94 s. I additio, whe we use 10% of the total data (the lower boud of the traiig data size) as the traiig data, ad five experimets rus (the lower boud of the experimet rus), the average ad stadard deviatios of SVM are ad.563, with a sigificat differece (lower) compared to CBE cross-validatio (the P-valuebb0.01, usig the idepedet t-test). The average traiig data is 0.3 5=1.6 s. Sice validatio effectiveess is the bssic cocer of researchers, the CBE cross-validatio is thus cosidered to be better tha the crossvalidatio usig the lower boud of the traiig data size ad experimet rus, ad so it is a efficiet ad effective method. 4.. Real data experimet This research uses two medical data sets, Pima Idias Diabetes ad ad Haberma's Survival, ad oe busiess data set, Australia Credit Approval, i the experimet. The Pima Idias diabetes data set cosists of 76 data with eight umeric dimesios (attributes), ad it is a two-class data set with target values deoted by 0 ad 1. The class value 1 meas tested positive for diabetes, ad the class value 0 meas tested egative. The Haberma's Survival data set cosists of 306 data with three umeric dimesios, ad it is a two-class data set to record the survival status for breast cacer patiets. The Australia Credit Approval data set cosists of 690 data with 14 dimesios that iclude six umerical ad eight categorical data, ad it is a two-class data set. Table 4 shows the summary of the sample characteristics of the three data sets, which are all dowloaded from the UCI repository, available at The results of the experimet for the three data sets are show i the followig subsectio The Pima data set The relatioship betwee the CBE idexes ad classificatio accuracies is show i Table 5 ad Fig. 4. From the table ad figure above we ca see that whe the value of CBE decreases, the classificatio accuracy of the SVM also falls. There is thus a highly positive correlatio betwee the CBE idex ad classificatio accuracy for the Pima data set. Table 6 ad Fig. 5 show the experimetal results of CBE cross-validatio for the Pima data set. WHEN X 13% CBE X14% CBE b 0:01; THEN Maxfo: of 13% samples; o: of 10% samplesg data size ¼ 13% 76 ¼ 100 ð1þ Fig. 4. Relatioship betwee CBE idexes ad accuracies with the Pima data set (correlatio coefficiet=0.773). Table 6 The averages ad stadard deviatios (SDs) of CBE idexes with icreasig size of the traiig data sets for the Pima data set. (Bold value meas the optimal data size). Traiig data 3 (5%) 46 (6%) 54 (7%) 61 (%) 70 (9%) Average SD Traiig data 77 (10%) 4 (11%) 9 (1%) 100 (13%) 10 (14%) Average SD We determie this size as the optimal traiig data size to be 100, ad thus cosider that the geometric structure of the optimal traiig data is stable. We use the optimal sample size to calculate the optimal umber of experimet rus with the Pima data set as: 0:06 Z α= CALCULATE =0:15; THEN ð0:05 1:1Þ Maxf0:15; 5g =5 ðþ where α=0.05 is the sigificace level, ad ð0:05 1:1Þ¼ 0:0564 is the desired margi of error. We thus determie that the optimal umber of experimet rus to be five. We the use repeated radom sub-samplig validatio (with 507 (66%) traiig data poits, 61 (34%) testig data poits, experimet repeated 30 times) to validate that our CBE cross-validatio (with 100 (13%) traiig data poits, 66 (7) testig data poits, experimet repeated five times) is efficiet. The average ad stadard deviatios of the SVM with the repeated radom sub-samplig validatio are ad 1.743, respectively; ad of the CBE cross-validatio are ad.044. The performaces of the two cross-validatio methods have isigificat differeces (the P-value is 0.043, usig the idepedet t-test). The average traiig time of the repeated radom sub-samplig validatio is =7.3 s ad the average traiig time of the CBE cross-validatio is 1.9 5=6. s. We also use five-fold cross-validatio to validate that our CBE cross-validatio is efficiet. The average ad stadard deviatios of the SVM with the five-fold cross-validatio are 75.4 ad 1.74, respectively. The performaces of the two cross-validatio methods have isigificat differeces (the P-value is 0.05, usig the idepedet t-test). The average traiig time of the five-fold crossvalidatio is =15. s. I additio, whe we use 10% of the total data (the lower boud of the traiig data size) as the traiig data with five experimet rus Table 5 Pima data set with 77 selected samples as the traiig data (default MiPt=3). Accuracy CBE idex

7 D.-C. Li et al. / Decisio Support Systems 50 (010) Fig. 5. Relatioship betwee traiig size ad CBE idex with the Pima data set. (the lower boud of the experimet rus), the average ad stadard deviatios of SVM are ad.94, ad it has sigificat differeces with CBE cross-validatio (the P-value=0.055, usig the idepedet t-test). The average traiig data is =.4 s. The CBE crossvalidatio is better tha the cross-validatio usig the lower bouds of the traiig data size ad experimet rus. Therefore, CBE crossvalidatio is cosidered a efficiet ad effective method The Haberma data set The relatioship betwee the CBE idexes ad classificatio accuracies is show i Table 7 ad Fig. 6. From the table ad figure above we ca see that whe the value of CBE decreases, the classificatio accuracy of the SVM also falls. There is thus a highly positive correlatio betwee the CBE idex ad classificatio accuracy for this data set. Table ad Fig. 7 show the results of CBE cross-validatio for the Haberma data set. WHENX 33% CBEb0:01; THEN Maxfo:of 33% samples; o:of10% samplesg data size = 33% 306 =101 CBE X 34% ð3þ Usig the above equatio, we determie the optimal traiig data size to be 101. With that, we cosider the geometric structure of the optimal traiig data is stable. By a similar procedure, the optimal umber of experimet rus is: 0:019 Z α= CALCULATE =9:973; THEN ð0:05 1:13Þ Maxf9:973; 5g 10 ð4þ where α=0.05 is the sigificace level, ad ð0:05 1:13Þ¼ 0:0566 is the desired margi of error. We thus determie the optimal umber of experimet rus to be 10. We the use repeated radom sub-samplig validatio (with 04 (66%) traiig data poits, 104 (34%) testig data poits, experimet repeated 30 times) to validate that our CBE cross-validatio (with 101 (33%) traiig data poits, 05 (67%) testig data poits, experimet repeated 10 times) is efficiet. The average ad stadard deviatios of the SVM with the repeated radom sub-samplig validatio are ad 3.19, respectively, ad the average ad stadard deviatios of the SVM with the CBE cross-validatio are ad Fig. 6. Relatioship betwee CBE idexes ad accuracies with the Haberma data set (correlatio coefficiet=0.7)..04. The performaces of the two cross-validatios have isigificat differeces (the P-value is 0.379, usig the idepedet t-test). The average traiig time of the repeated radom sub-samplig validatio is =9.9 s, while that of the CBE cross-validatio is =.3 s. We the use 10-fold cross-validatio to validate that our CBE cross-validatio is efficiet. The average ad stadard deviatios of SVM with the five-fold cross-validatio are ad.16, respectively. The performaces of the two cross-validatio methods have isigificat differeces (the P-value is 0.075, usig the idepedet t-test). The average traiig time of the 10-fold crossvalidatio is =5.1 s. I additio, whe we use 10% of the total data (the lower boud of the traiig data size) as the traiig data with five experimet rus, the average ad stadard deviatios of the SVM are ad 3.641, ad it has sigificat differeces with the CBE cross-validatio (the P- valuebb0.01, usig the idepedet t-test). The average traiig data is 0.1 5=0.9 s. By cosiderig validatio effectiveess, the CBE cross-validatio is thus agai cosidered better tha the crossvalidatio usig the lower bouds of traiig data size ad experimet rus. Therefore, CBE cross-validatio is a efficiet ad effective method The Australia credit approval First, for umerical idepedet variables aalysis, we delete the categorical idepedet variables X 1 ; X 4 ; X ; X 9 ; X 11 ; adx 1 ad delete the data that have missig value. The relatioship betwee the CBE idexes ad classificatio accuracies is show i Table 9 ad Fig.. From the table ad figure above we ca see a highly positive correlatio betwee the CBE idex ad classificatio accuracy. Table 7 Haberma data set with 31 samples selected as the traiig data (Default MiPt=). Accuracy CBE idex

8 100 D.-C. Li et al. / Decisio Support Systems 50 (010) Table The averages ad stadard deviatios (SD) of CBE idexes with icreasig the size of the traiig data set for the Haberma data set. (Bold value meas the optimal data size). Traiig data 4 (14%) 5 (17%) 61 (0%) 70 (3%) 73 (4%) 77 (5%) 0 (6%) 3 (7%) Average SD Traiig data 6 (%) 9 (9%) 9 (30%) 95 (31%) 97 (3%) 101 (33%) 104 (34%) Average SD Fig. 7. Relatioship betwee traiig size ad the CBE idex with the Haberma data set. Table 9 Australia data set with 1,90 samples selected as traiig data (Default MiPt=3). Accuracy CBE idex Table 10 ad Fig. 9 show the results of CBE cross-validatio for the Australia data set. WHENX 4% CBEb0:01; THEN Maxfo:of 37% samples; o:of 10% samplesg data size = 4% 690 = 90 CBE X 43% ð5þ By a similar procedure, we determie the optimal umber of traiig data poits to be 90, ad measure the optimal umber of experimet rus as: 0:03 Z α= CALCULATE =:17; THEN ð0:05 :31Þ Maxf:17; 5g =5 ð6þ where α=0.05 is the sigificace level, ad ð0:05 :31Þ =0:1116 is the desired margi of error. We determie the optimal umber of experimet rus to be five. Agai, whe we use repeated radom sub-samplig validatio (with 455 (66%) traiig data poits, 35 (34%) testig data poits, experimet repeated 30 times) to validate that our CBE crossvalidatio (with 90 (4%) traiig data poits, 400 (5%) testig data poits, experimet repeated 5 times) is efficiet. The average ad stadard deviatios of the SVM with the repeated radom subsamplig validatio are ad 1.30, respectively, ad the average ad stadard deviatios of the SVM with the CBE cross-validatio are ad The performaces of the two cross-validatios have isigificat differeces (the P-value is 0.305, usig the idepedet t-test). The average traiig time of the repeated radom subsamplig validatio is =54.9 s, ad that of the CBE crossvalidatio is 1.4 5=9. s. Whe we use five-fold cross-validatio to validate CBE crossvalidatio, the average ad stadard deviatios of SVM with the fivefold cross-validatio are 79. ad 1.351, respectively. Thus, the performace of the two cross-validatio methods has isigificat differeces (the P-value is 0.333, usig the idepedet t-test). The average traiig time of the five-fold cross-validatio is 1.9 5=9.9 s. I additio, usig 10% of the total data (the lower boud of the traiig data size) as traiig data with five experimet rus, the average ad stadard deviatios of SVM are ad.169, showig sigificat differeces with the CBE cross-validatio (the P-valuebb0.01, usig the idepedet t-test). The average traiig data is =7.5 s. Similarly, the CBE cross-validatio is better tha the cross-validatio usig the lower bouds of the traiig data size ad experimet rus. Therefore, CBE cross-validatio is a efficiet ad effective method Discussio of CBE idex for various data characteristics I this subsectio, we apply sesitivity aalysis to the calculatio of the CBE idex usig ubalaced classes, dimesios, ad sample sizes of a data set as the attributes. Fig.. Relatioship betwee CBE idexes ad accuracies with the Australia data set (correlatio coefficiet=0.9) Ubalaced class Nguye ad Yoggwa proposed that the accuracy of classifiers goes dow as the ubalaced level icreases. Specifically, they used

9 D.-C. Li et al. / Decisio Support Systems 50 (010) Table 10 The averages ad stadard deviatios (SD) of CBE idexes with icreasig the size of the traiig data set for the Australia data set. (Bold value meas the optimal data size). Traiig data 3 (10%) 13 (0%) 173 (5%) 07 (30%) 4 (35%) Average SD Traiig data 76 (40%) 3 (41%) 90 (4%) 96 (43%) Average SD SVM as the classificatio tool ad foud that it was affected by the ubalaced effect [19]. I our experimets, we first cosider the ubalaced class characteristic of a data set with the same data structure. We geerate data sets by fixig the positive sample size ad icreasig the egative sample size, ad the results are show i Table 11. Table 11 shows that the higher the ubalaced level, the higher the data complexity ad the lower the CBE idex Dimesios For a fixed sample size, addig dimesios will degrade the performace (high data complexity) of a classifier if the umber of traiig data poits is small relative to the umber of dimesios [4]. For the secod characteristic, a fixed sample size of 50 is used. Whe icreasig the umber of dimesios with the same data structure, give that the umber of traiig data is smaller tha the umber of dimesios i the experimets, the results are obtaied ad show i Table 1. Table 1 shows that whe the dimesios are high, the data complexity is also high, while the CBE idex is low Sample size For the third characteristic i our experimets, we use the same sample sizes for both classes, ad these are icreasig with the same structure. The results are show i Table 13. Table 13 shows that whe the samples of both classes icrease, the data complexity stays the same, as does the CBE idex. 5. Coclusio ad discussios Our research develops a efficiet ad effective cross-validatio method called Complexity-based Efficiet (CBE) cross-validatio. The CBE cross-validatio uses the CBE idex (calculated by explorig the data's geometric structure ad oise) to precisely discover the data's characteristics ad its o-liear complexity, i order to help uderstad the data set. We also employ the CBE idex to calculate the optimal traiig data size ad umber of experimet rus. CBE cross-validatio aims to reduce model evaluatio time whe a complex ad computatioally expesive classifier is used. We expect that whe we apply CBE cross-validatio to real biary data sets, we ca use the proposed method to fid the optimal traiig Table 11 Sesitivity aalysis of the CBE idex for ubalaced data sets. Positive samples Negative samples MiPts CBE idex Case Case Case Case Case Table 1 Sesitivity aalysis of the CBE idex for various data dimesios. No. of dimesios MiPts CBE idex Case Case Case Case Case Table 13 Sesitivity aalysis of the CBE idex for various sample sizes of both classes. Positive samples Negative samples MiPts CBE idex Case Case Case Case Case data ad the umber of experimet rus, to help researchers to develop more precise classificatio tools with less evaluatio time. Thus this work ca assist researchers i developig ew classificatio tools. The threshold criterio of X % CBE X+1% CBE, the lower boud sample size of 0.01, ad the lower boud of experimet rus of five are empirical values, that we hope to fid theoretical values i future studies. With regard to the settig of the threshold criterio of the lower boud, we cosider that whe the umber of data is large, we do ot wat to use too few data for the aalysis, eve though the data is easy to classify, because the iformatio lost could be sigificat, ad thus it is very difficult to covice decisio makers ituitively. Besides, whe we use these low limits, we are idicatig that there are about 40% of the whole data that have the chace to be selected as the traiig data = 10 Þ 5 40%. As to the experimet beig repeated 30 times, we cosider that the CBE distributio will ormally coverge to a ormal distributio whe is large. As a matter of coveiece, we thus use 30 times to approximate a ormal distributio. I fact, oe may eed to use Q-Q plot to check if the statistics (accuracy) does i fact follow a ormal distributio. Fig. 9. Relatioship betwee traiig size ad CBE idex of Australia data set.

10 D.-C. Li et al. / Decisio Support Systems 50 (010) 93 10 CBE cross-validatio is a biary classificatio validatio method.

Therefore, the study of CBE cross-validatio with multiple classes is also cosidered as oe directio for future research. Refereces [1] C.M. Bishop, Patter Recogitio ad Machie Learig, Spriger, 006.

[4] R. Clarke, H.W. Ressom, A. Wag, J. Xua, M.C. Liu, E.A. Geha, Y. Wag, The properties of high-dimesioal data spaces: implicatios for explorig gee ad protei expressio data, Nature Reviews.

10 10 D.-C. Li et al. / Decisio Support Systems 50 (010) CBE cross-validatio is a biary classificatio validatio method. However, multi-class classificatio problems are very commo i both studies ad real-world applicatios. Therefore, the study of CBE cross-validatio with multiple classes is also cosidered as oe directio for future research. Refereces [1] C.M. Bishop, Patter Recogitio ad Machie Learig, Spriger, 006. [] L.J. Cao, H.P. Lee, W.K. Chog, Modified support vector ovelty detector usig traiig data with outliers, Patter Recogitio Letters 4 (003) [3] G. Casella, R.L. Berger, Statistical Iferece, secod editio, Duxbury, 00. [4] R. Clarke, H.W. Ressom, A. Wag, J. Xua, M.C. Liu, E.A. Geha, Y. Wag, The properties of high-dimesioal data spaces: implicatios for explorig gee ad protei expressio data, Nature Reviews. Cacer (1) (00) [5] M. Daszykowski, B. Walczak, D.L. Massart, Lookig for atural patters i data part 1. desity-based approach, Chemometrics ad Itelliget Laboratory Systems 56 () (001) 3 9. [6] M. Daszykowski, B. Walczak, D.L. Massart, Represetative subset selectio, Aalytica Chimica Acta 46 (00) [7] M. Ester, H.P. Kriegel, J. Sader, X. Xu.,, A desity-based algorithm for discoverig clusters i large spatial databases with oisy, Proceedigs of d Iteratioal Coferece o Kowledge Discovery ad Data Miig, Portlad, 1996, pp [] M.T. Haga, H.B. Demuth, M. Beale, Neural Network Desig, Thomso, Sigapore, [9] H. Ha, Y. Ko, J. Seo, Usig the revised EM algorithm to remove oisy for improvig the oe-agaist-the-rest method i biary text classificatio, Iformatio Processig ad Maagemet 43 (5) (007) [10] T.K. Ho, A data complexity aalysis of comparative advatages of decisio forest costructors, Patter Aalysis ad Applicatios 5 (00) [11] M.Y. Hu, M. Shaker, G.P. Zhag, M.S. Hug, Modelig cosumer situatioal choice of log distace commuicatio with eural etworks, Decisio Support Systems 44 (4) (00) [1] V.N. Vapik, The Nature of Statistical Learig Theory, secod editiospriger, New York, 000. [13] M. Katardzic, Data Miig: Cocept, Model, Method, ad Algorithms, Wiley- Itersciece, 003. [14] E.W.M. Lee, Y.Y. Lee, C.P. Lim, C.Y. Tag, Applicatio of a oisy classificatio techique to determie the occurrece of flashover i compartmet fires, Advaced Egieerig Iformatics 0 (006) 13. [15] D.C. Li, Y.H. Fag, A algorithm to cluster data for efficiet classificatio of support vector machies, Expert Systems with Applicatios 34 (00) [16] D.C. Li, Y.H. Fag, A o-liearly virtual sample geeratio techique usig cluster discovery ad parametric equatios of hypersphere, Expert Systems with Applicatios 36 (009) [17] D.C. Li, C.W. Yeh, T.I. Tsai, Y.H. Fag, Susa C. Hu, Acquirig kowledge with limited experiece, Expert Systems 4 (3) (007) [1] E.B. Masilla, O classifier domais of competece, Proceedigs of the 17th Iteratioal Coferece o Patter Recogitio (ICPR'04), 004. [19] H.V. Nguye, W. Yoggwa, Classificatio of ubalaced medical data with weighted Regularized Least Squares, Proceedigs of the Frotiers i the Covergece of Biosciece ad Iformatio Techologies (IEEE), 007, pp [0] S. Piramuthu, M.J. Shaw, J.A. Getry, A classificatio approach usig multi-layered eural etworks, Decisio Support Systems 11 (5) (1994) [1] A.M. Rubiov, N.V. Soukhorkova, J. Ugo, Classes ad clusters i data aalysis, Europea Joural of Operatioal Research 173 (006) [] C. Schaffer, Techical ote: selectig a classificatio method by cross-validatio, Machie Learig 13 (1993) [3] P.N. Ta, M. Steibach, V. Kumar, Itroductio to Data Miig, 1st editio, Pearso Addiso, Wesley, Bosto, 006. [4] I.H. Witte, Eibe was preseted as. first ame ad Frak as.surame. Please check if. appropriate.eibe Frak, Data Miig: Practical Machie Learig Tools ad Techiques, Secod editiomorga Kaufma, Amsterdam, 005. Der-Chiag Li is a Distiguished Professor i the Departmet of Idustrial ad Iformatio Maagemet, the Natioal Cheg Kug Uiversity, Taiwa. He received his Ph.D. degree at the Departmet of Idustrial Egieerig at Lamar Uiversity Beaumot, Texas, USA, i 195. As a research professor, his curret iterest cocetrates o learig with small data sets. Yao-Hwei Fag is a postdoctoral fellow i the Divisio of Biostatistics ad Bioiformatics, Natioal Health Research Istitutes. He is workig at the laboratory for statistical aalysis of huma geetic. He received his Ph.D. at the Departmet of Idustrial ad Iformatio Maagemet at Natioal Cheg Kug Uiversity, Taiwa, i 009. Y.M. Frak Fag obtaied his PhD degree from the Departmet of Civil ad Hydraulic Egieerig, Feg Chia Uiversity i 006. Before he joied the Departmet of Civil ad Hydraulic Egieerig of Feg Chia Uiversity (FCU) i 006, he worked as a post doctoral researcher i Geographic Iformatio Systems Research Ceter, Feg Chia Uiversity. Curretly, Assistat Professor Fag is Chief Researcher of Geographic Iformatio Systems Research Ceter, FCU. His research iterests iclude disaster Moitorig ad civil egieerig.

3D Model Retrieval Method Based on Sample Prediction

20 Iteratioal Coferece o Computer Commuicatio ad Maagemet Proc.of CSIT vol.5 (20) (20) IACSIT Press, Sigapore 3D Model Retrieval Method Based o Sample Predictio Qigche Zhag, Ya Tag* School of Computer