A Statistical Model Selection Strategy Applied to Neural Networks

Similar documents
X- Chart Using ANOM Approach

y and the total sum of

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Classifier Selection Based on Data Complexity Measures *

C2 Training: June 8 9, Combining effect sizes across studies. Create a set of independent effect sizes. Introduction to meta-analysis

The Research of Support Vector Machine in Agricultural Data Classification

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Smoothing Spline ANOVA for variable screening

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

A Binarization Algorithm specialized on Document Images and Photos

Parallelism for Nested Loops with Non-uniform and Flow Dependences

7/12/2016. GROUP ANALYSIS Martin M. Monti UCLA Psychology AGGREGATING MULTIPLE SUBJECTS VARIANCE AT THE GROUP LEVEL

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Support Vector Machines

Backpropagation: In Search of Performance Parameters

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Mathematics 256 a course in differential equations for engineering students

SVM-based Learning for Multiple Model Estimation

S1 Note. Basis functions.

A New Approach For the Ranking of Fuzzy Sets With Different Heights

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Edge Detection in Noisy Images Using the Support Vector Machines

Three supervised learning methods on pen digits character recognition dataset

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Feature Reduction and Selection

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Optimizing Document Scoring for Query Retrieval

Meta-heuristics for Multidimensional Knapsack Problems

The Man-hour Estimation Models & Its Comparison of Interim Products Assembly for Shipbuilding

Classifying Acoustic Transient Signals Using Artificial Intelligence

Programming in Fortran 90 : 2017/2018

An Entropy-Based Approach to Integrated Information Needs Assessment

Machine Learning 9. week

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

The Codesign Challenge

Review of approximation techniques

EXTENDED BIC CRITERION FOR MODEL SELECTION

General Vector Machine. Hong Zhao Department of Physics, Xiamen University

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Wishing you all a Total Quality New Year!

Simulation Based Analysis of FAST TCP using OMNET++

Unsupervised Learning and Clustering

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Support Vector Machines

Using Neural Networks and Support Vector Machines in Data Mining

CS 534: Computer Vision Model Fitting

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

A Similarity-Based Prognostics Approach for Remaining Useful Life Estimation of Engineered Systems

Data Mining: Model Evaluation

Unsupervised Learning

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Optimal Algorithm for Prufer Codes *

SHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE

Adaptive Transfer Learning

An Improved Image Segmentation Algorithm Based on the Otsu Method

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Biostatistics 615/815

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Cluster Analysis of Electrical Behavior

A fair buffer allocation scheme

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

A Deflected Grid-based Algorithm for Clustering Analysis

Feature Selection as an Improving Step for Decision Tree Construction

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Face Recognition Method Based on Within-class Clustering SVM

Network Intrusion Detection Based on PSO-SVM

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks

INTELLECT SENSING OF NEURAL NETWORK THAT TRAINED TO CLASSIFY COMPLEX SIGNALS. Reznik A. Galinskaya A.

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Fuzzy Filtering Algorithms for Image Processing: Performance Evaluation of Various Approaches

TN348: Openlab Module - Colocalization

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

Module Management Tool in Software Development Organizations

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Classification Methods

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

Report on On-line Graph Coloring

Artificial Intelligence (AI) methods are concerned with. Artificial Intelligence Techniques for Steam Generator Modelling

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Design of Structure Optimization with APDL

A classification scheme for applications with ambiguous data

Human Face Recognition Using Generalized. Kernel Fisher Discriminant

A Hill-climbing Landmarker Generation Algorithm Based on Efficiency and Correlativity Criteria

Simultaneously Fitting and Segmenting Multiple- Structure Data with Outliers

Transcription:

A Statstcal Model Selecton Strategy Appled to Neural Networks Joaquín Pzarro Elsa Guerrero Pedro L. Galndo joaqun.pzarro@uca.es elsa.guerrero@uca.es pedro.galndo@uca.es Dpto Lenguajes y Sstemas Informátcos e Intelgenca Artfcal Grupo Sstemas Intelgentes de Computacón Unversdad de Cádz - SPAIN Abstract In statstcal modellng, an nvestgator must often choose a sutable model among a collecton of vable canddates. There s no consensus n the research communty on how such a comparatve study s performed n a methodologcally sound way. The rankng of several methods s usually performed by the use of a selecton crteron, whch assgns a score to every model based on some underlyng statstcal prncples. The ftted model that s favoured s the one correspondng to the mnmum (or the maxmum) score. Statstcal sgnfcance testng can extend ths method. However, when enough parwse tests are performed the multplcty effect appears whch can be taken nto account by consderng multple comparson procedures. The exstng comparson procedures can roughly be categorzed as analytcal or resamplng based. Ths paper descrbes a resamplng based multple comparson technque. Ths method s llustrated on the estmate of the number of hdden unts for feed-forward neural networks. 1. Introducton Many model selecton algorthms have been proposed n the lterature of varous research communtes. The exstng comparson procedures can roughly be categorzed as analytcal or resamplng based. Analytcal approaches requre certan assumptons of the underlyng statstcal model. Resamplng based methods nvolve much more computaton, but they remove the rsk of makng faulty statements due to unsatsfed assumptons [4]. Wth the computer power currently avalable, ths does not seem to be an obstacle. The standard methods of model selecton nclude classcal hypothess testng, maxmum lkelhood [2], Bayes method [6], cross-valdaton [7] and Akake s nformaton crteron [1]. Although there s actve debate wthn the research communty regardng the best method for comparson, statstcal model selecton s a reasonable approach [5]. We am at determnng whch of two models s better on average. A way to defne on average s to consder the performance of these algorthms averaged over all the tranng sets that mght be drawn from the underlyng dstrbuton. Obvously, we have only a lmted sample of data, and a drect approach s to dvde avalable data nto a tranng set and a dsjont test set. However, the relatve performance can be dependent on the tranng and test sets.

One way to mprove ths estmate s to repeatedly partton the data nto dsjont tranng and test sets and to take the mean of the test set errors for these dfferent experments. The standard t-test for testng the dfference between two sample means s not a vald strategy, snce the errors are estmated from the same test sample, and are, therefore, hghly correlated. A pared sample t-test should be used nstead. However, when more than two models are compared, pared t-tests should be extended to multple comparson strateges. The frst dea that comes to mnd s to test each possble dfference by a pared t-test. The problem wth ths approach s that the probablty of makng at least one Type I error ncreases wth the number of tests made. Ths phenomenon s called selecton bas. A general method to deal wth selecton bas that s useful n most stuatons s called the Bonferron multple comparsons procedure. The Bonferron approach s a follow-up analyss to the ANOVA method and s based on the followng result. If c comparsons are to be made, each wth confdence coeffcent (1-alpha/c ), then the overall probablty of makng one or more Type I errors s at most alpha. However, the proper applcaton of the ANOVA procedure requres certan assumptons to be satsfed,.e., all k populatons are approxmately normal wth equal varances. Resdual analyss can be appled to determne whether these assumptons are satsfed to a reasonable degree. Other procedures, such as Tukey and Tukey-Cramer, may be more powerful n certan samplng stuatons. In the followng sectons, we descrbe statstcal technques appled to model selecton, ncludng sgnfcance testng, parwse comparson and multple comparson strateges. Then, we justfy the use of analyss of varance as a vald strategy to compare dfferent output error means that allows us the estmate of the optmum number of hdden unts n feedforward neural networks. Fnally, the results of computer smulaton for an actual learnng task are dscussed. 2. Strategy descrpton We wll descrbe our strategy n terms of a classfcaton task by feed-forward neural networks. It s assumed that there exsts a set X of possble data ponts, called the populaton. There also exsts some target functon, f, that classfes x X nto one of K classes. Wthout loss of generalty, t s assumed K=2, although none of the results n ths paper depend on ths assumpton, snce our only concern wll be whether an example s classfed correctly or ncorrectly. A set of competng models are generated, they dffer n the number of hdden unts. Msclassfcaton errors from the populaton X s computed for each model and statstcal tests are used to decde whch of the competng models are better. Detterch [3] studed dfferent statstcal tests for comparng supervsed classfcaton learnng algorthms and the sources of varaton that a good statstcal test should control. In our method, these sources of varaton are controlled as follows: Selecton of the tranng data and test data. The same tranng data set and test data set are used to tran and test all the competng models. A two-fold crossvaldaton method s performed snce n a k-fold cross-valdaton method (k > 2)

each par of tranng sets shares a hgh rato of the samples. Ths overlap may prevent ths statstcal test from obtanng a good estmate of the amount of varaton that would be observed f each tranng set were completely ndependent of prevous tranng sets. Internal randomness n the learnng algorthm. The learnng algorthm n each competng model must be executed several tmes and consequentely several msclassfcaton errors are generated. It s necessary to choose one. If the mnmum of these values were taken, ths would be the best case and we would thnk we are near the global mnmum of the error functon. But ths would be a bad selecton n a statstcal test because an extreme case was chosen. To avod extreme cases, the maxmum and mnmum msclassfcaton errors are elmnated and the averaged error s calculated. We are tryng to determne how the model behaves so we are focusng on the error samples on average better than just consderng the mnmum error. Furthermore, we must account the varaton from the selecton of the test data and from the selecton of the tranng data, so the above process s several tmes repeated. At the begnnng of each teraton, the tranng and test set are randomly determned. At the end of ths process msclassfcaton error mean s calculated. The strategy s summarzed as follows: For v:=1 to V (30 tmes) Random selecton of the tranng and test set, both of them wth the same sze. For h:=model one to model H For r:=1 to R Tran model h. Error(r) = msclassfcaton error. End Error_Model(v,h)=Average(Error) End; End; We recommend at least 30 msclassfcaton error samples n order to guarantee the results are dstrbuted accordng to a normal dstrbuton. The goal of our strategy s to compare dfferent models and to determne, by analysng the mean and the varance of each one of them, f dfferences among the models exst. When comparng more than two means, a test of dfferences s needed. An exploratory/descrptve analyss must be the frst step. An unvarate analyss of the nterval varable by the groupng varable helps to understand the dstrbuton and says whether t s parametrc. Both the parametrc test for dfferences (Anova) and the nonparametrc test (Kruskal Walls) for dfferences are ways to do an analyss of varance. These tests look at how much varaton or spread there s n each sub-group. The more wthn group varaton that there s n each sub-group the more dffcult t wll be to postvely say that there s a dfference between the group's mean. There are some questons to be answered: 1- Are the populatons dstrbuted accordng to a Gaussan dstrbuton? Whle ths assumpton s not too mportant wth large samples szes, t s mportant wth small samples szes (specally wth unequal samples szes). Ths assumpton has

been tested usng the method of Kolmogorov and Smrnov and we have always found that the results are accordng to a Gaussan dstrbuton. 2- Do the populatons have the same standard devatons? Ths assumpton s not very mportant when all the models have the same (or almost the same) number of error subjects, but t s very mportant when ths number dffers. In our method the number of error subjects s the same n all the models. 3- Are the data unmatched? We have to compare the dfferences among group means wth the pooled standard devatons of the groups. In our experment the data are matched. 4- Are the dfference between each value and the group mean ndependent? Ths assumpton s n practce dffcult to test. We must thnk about the expermental desgn As the sources of varaton have been taken nto account, we assume ths dfference s ndependent. In our method, the assumptons to use the Anova test have been met. Snce a large number of competng models s compared, Bonferron correcton s appled to deal wth selecton bas. The null hypothess s usually rejected. In other words, varaton among msclassfcaton error means s sgnfcantly greater than expected by chance. Thus, groups of models wth not sgnfcantly dfferent msclassfcaton error means are estmated. To do ths, the models are sorted by the msclassfcaton error mean. Two groups are not sgnfcantly dfferent f y y j t 1 α / 2* csvne + n 1 n j,j=1,..,m where M s the max number of models, n s the number of data for model, y and y j are the means for models and j, t s Student pdf wth n-m degree of freedom. c s the Bonferron correcton, α s the statstcal sgnfcance and S 2 VNE M n = ( ( y = 1 h= 1 h 2 y ) ) /( n M ) s the wthn-sample varaton. In the group wth the least msclassfcaton error mean the model wth the least hdden unts s selected. (Occam s razor crtera). We have assumed that the goal s to fnd a network havng the best generalzaton performance. Ths s usually the most dffcult part of any pattern recognton problem, and s the one whch typcally lmts the practcal applcaton of neural networks. In some cases, however, other crtera mght also be mportant. For nstance, speed of operaton on a seral computer wll be governed by the sze of the network, and we mght be prepared to trade some generalzaton capablty n return for a smaller network. It s desrable to consder a set of several competng models smultaneously, compare them and come to a decson on whch to retan. We have therefore been concerned prmarly wth the choce of a model from a set of competng models rather than wth the decson whether or not a new model wth more hdden unts should be used.

3. Smulaton results Let us consder the problem of determnng the number of hdden unts n a feedforward neural network n a classfcaton task. Let us defne a data set where each nput vector has been labelled as belongng to one of two classes C 1 and C 2. Fgure 1 shows the nput patterns. The sample sze s N1=270 data of the class C1 and N2=270 of the class C2. In the smulaton study, we consder mult-layer perceptrons havng two layers of weghts wth full connectvty between adjacent layers. One lnear output unt, M sgmod (logstc, tanh, arctan, etc.) hdden unts and no drect nput-output connectons. The only aspect of the archtecture whch remans to be specfed s the number M of hdden unts, and so we tran a set of networks (models) havng a range of values of M. 6 4 2 0-2 -4-6 -5 0 5 10 15 20 25 Fgure 1. Sample Data Dstrbuton The results of the smulaton study are gven n Table 2. Two models are n the same group f the dfference between ts means s less than 0.04973 (statstcal sgnfcance 0.1). Thus, from the group of models wth less error mean (7 hdden unts) the model wth 4 hdden unts could be selected. Hdden Unts Table 1. Smulaton Results Error Mean Models not sgnfcantly dfferent 7 0.06139 7 6 9 10 8 5 4 6 0.06278 7 6 9 10 8 5 4 9 0.06417 7 6 9 10 8 5 4 10 0.06546 7 6 9 10 8 5 4 8 0.06593 7 6 9 10 8 5 4 5 0.07398 7 6 9 10 8 5 4 4 0.08630 7 6 9 10 8 5 4 3 0.14731 3 1 0.27870 1 2 2 0.27880 1 2

If the number of models to compare s ncreased, results show that four hdden unts s a good selecton, that s, there s not a statstcally sgnfcant dfference among the error means of neural network archtecture wth four or more hdden unts. The same results are obtaned when the number of data s ncreased. 4. Conclusons An alternatve method has been proposed to model selecton, where no dstrbuton assumptons about the data are needed. Our goal have been to determne that, n a fnte set of models, t s possble to fnd a subset, whose error mean dfferences are not sgnfcant wth respect to the smallest. Our statstcal testng procedure has been desgned avodng dependences and randomness n order to be able to obtan sample data from dfferent models under the same crcumstances. After collectng data from a completely randomzed desgn, sample data means are analyzed. The way to determne whether a dfference exsts between the populaton means, s to examne the spread (or varaton) between the sample means, and to compare t to a measure of varablty wthn the samples. The greater the dfference n the varatons, the greater wll be the evdence to ndcate a dfference between them. A statstcal test procedure has been used to estmate groups of models whch dfferences among the msclassfcaton error means are not sgnfcantly greater than expected by chance. Ths study shows how statstcal methods can be employed for the specfcaton of neural networks archtectures. Although the smulaton study presented s encouragng, ths s only a frst step. More experence has to be ganed through further smulaton wth dfferent underlyng models, sample szes and level to nose ratos. References 1. H. Akake, A New Look at the Statstcal Model Identfcaton, IEEE Transactons on Automatc Control, 1974. AC-19:716-723. 2. C. M. Bshop, Neural Network for Pattern Recognton, Clarendon Press- Oxford, 1995. 3. T.G. Detterch, Aproxmate Statstcal Test for Comparng Supervsed Classfcaton Learnng Algorthms, Neural Computaton, 1998, Vol. 10, no.7, pp. 1895-1923,. 4. A. Feelders & W. Verkoojen. On the statstcal Comparson of nductve learnng methods, Learnng from data Artfcal Intellgence and Statstcs V. Sprnger-Verlag 1996. pp 271-279. 5. T. Mtchell. Machne Learnng, WCB/McGraw-Hll, 1997. 6. G. Schwarz, Estmatng the Dmenson of a Model, The Annals of Statstcs, 1978, Vol 6, pp 461-464. 7. M. Stone, Cross-valdatory choce and assesment of statstcal predcton (wth dscusson). Journal of the Royal Statstcal Socety, 1974, Seres B, 36, 111-147.