A Statistical Model Selection Strategy Applied to Neural Networks

A Statstcal Model Selecton Strategy Appled to Neural Networks Joaquín Pzarro Elsa Guerrero Pedro L. Galndo joaqun.pzarro@uca.es elsa.guerrero@uca.es pedro.galndo@uca.es Dpto Lenguajes y Sstemas Informátcos e Intelgenca Artfcal Grupo Sstemas Intelgentes de Computacón Unversdad de Cádz - SPAIN Abstract In statstcal modellng, an nvestgator must often choose a sutable model among a collecton of vable canddates. There s no consensus n the research communty on how such a comparatve study s performed n a methodologcally sound way. The rankng of several methods s usually performed by the use of a selecton crteron, whch assgns a score to every model based on some underlyng statstcal prncples. The ftted model that s favoured s the one correspondng to the mnmum (or the maxmum) score. Statstcal sgnfcance testng can extend ths method. However, when enough parwse tests are performed the multplcty effect appears whch can be taken nto account by consderng multple comparson procedures. The exstng comparson procedures can roughly be categorzed as analytcal or resamplng based. Ths paper descrbes a resamplng based multple comparson technque. Ths method s llustrated on the estmate of the number of hdden unts for feed-forward neural networks. 1. Introducton Many model selecton algorthms have been proposed n the lterature of varous research communtes. The exstng comparson procedures can roughly be categorzed as analytcal or resamplng based. Analytcal approaches requre certan assumptons of the underlyng statstcal model. Resamplng based methods nvolve much more computaton, but they remove the rsk of makng faulty statements due to unsatsfed assumptons [4]. Wth the computer power currently avalable, ths does not seem to be an obstacle. The standard methods of model selecton nclude classcal hypothess testng, maxmum lkelhood [2], Bayes method [6], cross-valdaton [7] and Akake s nformaton crteron [1]. Although there s actve debate wthn the research communty regardng the best method for comparson, statstcal model selecton s a reasonable approach [5]. We am at determnng whch of two models s better on average. A way to defne on average s to consder the performance of these algorthms averaged over all the tranng sets that mght be drawn from the underlyng dstrbuton. Obvously, we have only a lmted sample of data, and a drect approach s to dvde avalable data nto a tranng set and a dsjont test set. However, the relatve performance can be dependent on the tranng and test sets.

One way to mprove ths estmate s to repeatedly partton the data nto dsjont tranng and test sets and to take the mean of the test set errors for these dfferent experments. The standard t-test for testng the dfference between two sample means s not a vald strategy, snce the errors are estmated from the same test sample, and are, therefore, hghly correlated. A pared sample t-test should be used nstead. However, when more than two models are compared, pared t-tests should be extended to multple comparson strateges. The frst dea that comes to mnd s to test each possble dfference by a pared t-test. The problem wth ths approach s that the probablty of makng at least one Type I error ncreases wth the number of tests made. Ths phenomenon s called selecton bas. A general method to deal wth selecton bas that s useful n most stuatons s called the Bonferron multple comparsons procedure. The Bonferron approach s a follow-up analyss to the ANOVA method and s based on the followng result. If c comparsons are to be made, each wth confdence coeffcent (1-alpha/c ), then the overall probablty of makng one or more Type I errors s at most alpha. However, the proper applcaton of the ANOVA procedure requres certan assumptons to be satsfed,.e., all k populatons are approxmately normal wth equal varances. Resdual analyss can be appled to determne whether these assumptons are satsfed to a reasonable degree. Other procedures, such as Tukey and Tukey-Cramer, may be more powerful n certan samplng stuatons. In the followng sectons, we descrbe statstcal technques appled to model selecton, ncludng sgnfcance testng, parwse comparson and multple comparson strateges. Then, we justfy the use of analyss of varance as a vald strategy to compare dfferent output error means that allows us the estmate of the optmum number of hdden unts n feedforward neural networks. Fnally, the results of computer smulaton for an actual learnng task are dscussed. 2. Strategy descrpton We wll descrbe our strategy n terms of a classfcaton task by feed-forward neural networks. It s assumed that there exsts a set X of possble data ponts, called the populaton. There also exsts some target functon, f, that classfes x X nto one of K classes. Wthout loss of generalty, t s assumed K=2, although none of the results n ths paper depend on ths assumpton, snce our only concern wll be whether an example s classfed correctly or ncorrectly. A set of competng models are generated, they dffer n the number of hdden unts. Msclassfcaton errors from the populaton X s computed for each model and statstcal tests are used to decde whch of the competng models are better. Detterch [3] studed dfferent statstcal tests for comparng supervsed classfcaton learnng algorthms and the sources of varaton that a good statstcal test should control. In our method, these sources of varaton are controlled as follows: Selecton of the tranng data and test data. The same tranng data set and test data set are used to tran and test all the competng models. A two-fold crossvaldaton method s performed snce n a k-fold cross-valdaton method (k > 2)

each par of tranng sets shares a hgh rato of the samples. Ths overlap may prevent ths statstcal test from obtanng a good estmate of the amount of varaton that would be observed f each tranng set were completely ndependent of prevous tranng sets. Internal randomness n the learnng algorthm. The learnng algorthm n each competng model must be executed several tmes and consequentely several msclassfcaton errors are generated. It s necessary to choose one. If the mnmum of these values were taken, ths would be the best case and we would thnk we are near the global mnmum of the error functon. But ths would be a bad selecton n a statstcal test because an extreme case was chosen. To avod extreme cases, the maxmum and mnmum msclassfcaton errors are elmnated and the averaged error s calculated. We are tryng to determne how the model behaves so we are focusng on the error samples on average better than just consderng the mnmum error. Furthermore, we must account the varaton from the selecton of the test data and from the selecton of the tranng data, so the above process s several tmes repeated. At the begnnng of each teraton, the tranng and test set are randomly determned. At the end of ths process msclassfcaton error mean s calculated. The strategy s summarzed as follows: For v:=1 to V (30 tmes) Random selecton of the tranng and test set, both of them wth the same sze. For h:=model one to model H For r:=1 to R Tran model h. Error(r) = msclassfcaton error. End Error_Model(v,h)=Average(Error) End; End; We recommend at least 30 msclassfcaton error samples n order to guarantee the results are dstrbuted accordng to a normal dstrbuton. The goal of our strategy s to compare dfferent models and to determne, by analysng the mean and the varance of each one of them, f dfferences among the models exst. When comparng more than two means, a test of dfferences s needed. An exploratory/descrptve analyss must be the frst step. An unvarate analyss of the nterval varable by the groupng varable helps to understand the dstrbuton and says whether t s parametrc. Both the parametrc test for dfferences (Anova) and the nonparametrc test (Kruskal Walls) for dfferences are ways to do an analyss of varance. These tests look at how much varaton or spread there s n each sub-group. The more wthn group varaton that there s n each sub-group the more dffcult t wll be to postvely say that there s a dfference between the group's mean. There are some questons to be answered: 1- Are the populatons dstrbuted accordng to a Gaussan dstrbuton? Whle ths assumpton s not too mportant wth large samples szes, t s mportant wth small samples szes (specally wth unequal samples szes). Ths assumpton has

been tested usng the method of Kolmogorov and Smrnov and we have always found that the results are accordng to a Gaussan dstrbuton. 2- Do the populatons have the same standard devatons? Ths assumpton s not very mportant when all the models have the same (or almost the same) number of error subjects, but t s very mportant when ths number dffers. In our method the number of error subjects s the same n all the models. 3- Are the data unmatched? We have to compare the dfferences among group means wth the pooled standard devatons of the groups. In our experment the data are matched. 4- Are the dfference between each value and the group mean ndependent? Ths assumpton s n practce dffcult to test. We must thnk about the expermental desgn As the sources of varaton have been taken nto account, we assume ths dfference s ndependent. In our method, the assumptons to use the Anova test have been met. Snce a large number of competng models s compared, Bonferron correcton s appled to deal wth selecton bas. The null hypothess s usually rejected. In other words, varaton among msclassfcaton error means s sgnfcantly greater than expected by chance. Thus, groups of models wth not sgnfcantly dfferent msclassfcaton error means are estmated. To do ths, the models are sorted by the msclassfcaton error mean. Two groups are not sgnfcantly dfferent f y y j t 1 α / 2* csvne + n 1 n j,j=1,..,m where M s the max number of models, n s the number of data for model, y and y j are the means for models and j, t s Student pdf wth n-m degree of freedom. c s the Bonferron correcton, α s the statstcal sgnfcance and S 2 VNE M n = ( ( y = 1 h= 1 h 2 y ) ) /( n M ) s the wthn-sample varaton. In the group wth the least msclassfcaton error mean the model wth the least hdden unts s selected. (Occam s razor crtera). We have assumed that the goal s to fnd a network havng the best generalzaton performance. Ths s usually the most dffcult part of any pattern recognton problem, and s the one whch typcally lmts the practcal applcaton of neural networks. In some cases, however, other crtera mght also be mportant. For nstance, speed of operaton on a seral computer wll be governed by the sze of the network, and we mght be prepared to trade some generalzaton capablty n return for a smaller network. It s desrable to consder a set of several competng models smultaneously, compare them and come to a decson on whch to retan. We have therefore been concerned prmarly wth the choce of a model from a set of competng models rather than wth the decson whether or not a new model wth more hdden unts should be used.

3. Smulaton results Let us consder the problem of determnng the number of hdden unts n a feedforward neural network n a classfcaton task. Let us defne a data set where each nput vector has been labelled as belongng to one of two classes C 1 and C 2. Fgure 1 shows the nput patterns. The sample sze s N1=270 data of the class C1 and N2=270 of the class C2. In the smulaton study, we consder mult-layer perceptrons havng two layers of weghts wth full connectvty between adjacent layers. One lnear output unt, M sgmod (logstc, tanh, arctan, etc.) hdden unts and no drect nput-output connectons. The only aspect of the archtecture whch remans to be specfed s the number M of hdden unts, and so we tran a set of networks (models) havng a range of values of M. 6 4 2 0-2 -4-6 -5 0 5 10 15 20 25 Fgure 1. Sample Data Dstrbuton The results of the smulaton study are gven n Table 2. Two models are n the same group f the dfference between ts means s less than 0.04973 (statstcal sgnfcance 0.1). Thus, from the group of models wth less error mean (7 hdden unts) the model wth 4 hdden unts could be selected. Hdden Unts Table 1. Smulaton Results Error Mean Models not sgnfcantly dfferent 7 0.06139 7 6 9 10 8 5 4 6 0.06278 7 6 9 10 8 5 4 9 0.06417 7 6 9 10 8 5 4 10 0.06546 7 6 9 10 8 5 4 8 0.06593 7 6 9 10 8 5 4 5 0.07398 7 6 9 10 8 5 4 4 0.08630 7 6 9 10 8 5 4 3 0.14731 3 1 0.27870 1 2 2 0.27880 1 2

If the number of models to compare s ncreased, results show that four hdden unts s a good selecton, that s, there s not a statstcally sgnfcant dfference among the error means of neural network archtecture wth four or more hdden unts. The same results are obtaned when the number of data s ncreased. 4. Conclusons An alternatve method has been proposed to model selecton, where no dstrbuton assumptons about the data are needed. Our goal have been to determne that, n a fnte set of models, t s possble to fnd a subset, whose error mean dfferences are not sgnfcant wth respect to the smallest. Our statstcal testng procedure has been desgned avodng dependences and randomness n order to be able to obtan sample data from dfferent models under the same crcumstances. After collectng data from a completely randomzed desgn, sample data means are analyzed. The way to determne whether a dfference exsts between the populaton means, s to examne the spread (or varaton) between the sample means, and to compare t to a measure of varablty wthn the samples. The greater the dfference n the varatons, the greater wll be the evdence to ndcate a dfference between them. A statstcal test procedure has been used to estmate groups of models whch dfferences among the msclassfcaton error means are not sgnfcantly greater than expected by chance. Ths study shows how statstcal methods can be employed for the specfcaton of neural networks archtectures. Although the smulaton study presented s encouragng, ths s only a frst step. More experence has to be ganed through further smulaton wth dfferent underlyng models, sample szes and level to nose ratos. References 1. H. Akake, A New Look at the Statstcal Model Identfcaton, IEEE Transactons on Automatc Control, 1974. AC-19:716-723. 2. C. M. Bshop, Neural Network for Pattern Recognton, Clarendon Press- Oxford, 1995. 3. T.G. Detterch, Aproxmate Statstcal Test for Comparng Supervsed Classfcaton Learnng Algorthms, Neural Computaton, 1998, Vol. 10, no.7, pp. 1895-1923,. 4. A. Feelders & W. Verkoojen. On the statstcal Comparson of nductve learnng methods, Learnng from data Artfcal Intellgence and Statstcs V. Sprnger-Verlag 1996. pp 271-279. 5. T. Mtchell. Machne Learnng, WCB/McGraw-Hll, 1997. 6. G. Schwarz, Estmatng the Dmenson of a Model, The Annals of Statstcs, 1978, Vol 6, pp 461-464. 7. M. Stone, Cross-valdatory choce and assesment of statstcal predcton (wth dscusson). Journal of the Royal Statstcal Socety, 1974, Seres B, 36, 111-147.