Toward a Method of Selecting Among Computational Models of Cognition

Size: px

Start display at page:

Download "Toward a Method of Selecting Among Computational Models of Cognition"

Louise Booth
5 years ago
Views:

1 Psychological Review Copyright 2002 by the America Psychological Associatio, Ic. 2002, Vol. 109, No. 3, X/02/$5.00 DOI: // X Toward a Method of Selectig Amog Computatioal Models of Cogitio Mark A. Pitt, I Jae Myug, ad Shaobo Zhag The Ohio State Uiversity The questio of how oe should decide amog competig explaatios of data is at the heart of the scietific eterprise. Computatioal models of cogitio are icreasigly beig advaced as explaatios of behavior. The success of this lie of iquiry depeds o the developmet of robust methods to guide the evaluatio ad selectio of these models. This article itroduces a method of selectig amog mathematical models of cogitio kow as miimum descriptio legth, which provides a ituitive ad theoretically well-grouded uderstadig of why oe model should be chose. A cetral but elusive cocept i model selectio, complexity, ca also be derived with the method. The adequacy of the method is demostrated i 3 areas of cogitive modelig: psychophysics, iformatio itegratio, ad categorizatio. How should oe choose amog competig theoretical explaatios of data? This questio is at the heart of the scietific eterprise, regardless of whether verbal models are beig tested i a experimetal settig or computatioal models are beig evaluated i simulatios. A umber of criteria have bee proposed to assist i this edeavor, summarized icely by Jacobs ad Graiger (1994). They iclude (a) plausibility (are the assumptios of the model biologically ad psychologically plausible?); (b) explaatory adequacy (is the theoretical explaatio reasoable ad cosistet with what is kow?); (c) iterpretability (do the model ad its parts e.g., parameters make sese? are they uderstadable?); (d) descriptive adequacy (does the model provide a good descriptio of the observed data?); (e) geeralizability (does the model predict well the characteristics of data that will be observed i the future?); ad (f) complexity (does the model capture the pheomeo i the least complex i.e., simplest possible maer?). The relative importace of these criteria may vary with the types of models beig compared. For example, verbal models are likely to be scrutiized o the first three criteria just as much as the last Mark A. Pitt, I Jae Myug, ad Shaobo Zhag, Departmet of Psychology, The Ohio State Uiversity. Portios of this work were preseted at the 40th aual meetig of the Psychoomic Society, Los Ageles, Califoria, November 18 22, 1999, ad at the 31st ad 32d aual meetigs of the Society for Mathematical Psychology, Nashville, Teessee (August 6 9, 1998) ad Sata Cruz, Califoria (July 29 August 1, 1999), respectively. We thak D. Bamber, R. Golde, ad Adrew Haso for their valuable commets ad attetio to detail i readig earlier versios of this article. Mark A. Pitt ad I Jae Myug cotributed equally to the article, so order of authorship should be viewed as arbitrary. The sectio Three Applicatio Examples is based o Shaobo Zhag s doctoral dissertatio submitted to the Departmet of Psychology at The Ohio State Uiversity. I Jae Myug ad Mark A. Pitt were supported by Natioal Istitute of Metal Health Grat MH Correspodece cocerig this article should be addressed to Mark A. Pitt or I Jae Myug, Departmet of Psychology, The Ohio State Uiversity, 1885 Neil Aveue Mall, Columbus, Ohio pitt.2@ osu.edu or myug.1@osu.edu three to thoroughly evaluate the soudess of the models ad their assumptios. Computatioal models, o the other had, may have already satisfied the first three criteria to a certai level of acceptability earlier i their evolutio, leavig the last three criteria to be the primary oes o which they are evaluated. This emphasis o the latter three ca be see i the developmet of quatitative methods desiged to compare models o these criteria. These methods are the topic of this article. I the last two decades, iterest i mathematical models of cogitio ad other psychological processes has icreased tremedously. We view this as a positive sig for the disciplie, for it suggests that this method of iquiry holds cosiderable promise. Amog other thigs, a mathematical istatiatio of a theory provides a test bed i which researchers ca examie the detailed iteractios of a model s parts with a level of precisio that is ot possible with verbal models. Furthermore, through systematic evaluatio of its behavior, a accurate assessmet of a model s viability ca be obtaied. The goal of modelig is to ifer the structural ad fuctioal properties of a cogitive process from behavioral data that were thought to have bee geerated by that process. At its most basic level, the, a mathematical model is a set of assumptios about the structure ad fuctioig of the process. The adequacy of a model is first assessed by measurig its ability to reproduce huma data. If it does so reasoably well, the the ext step is to compare its performace with competig models. It is imperative that the model selectio method that is used to select amog competig models accurately measures how well each model approximates the metal process. Above all else, the method must be valid. Otherwise, the purpose of modelig is udermied. Oe rus the risk of choosig a model that i actuality is a poor approximatio of the uderlyig process of iterest, leadig researchers astray. The potetial severity of this problem should make it clear that soud methodology is ot oly itegral to but also ecessary for theoretical advacemet. I short, model selectio methods must be as sophisticated ad robust as the models themselves. I this article, we itroduce a ew quatitative method of model selectio. It is theoretically well grouded ad provides a clear 472

2 MODEL SELECTION AND COMPLEXITY 473 uderstadig of why oe model should be chose over aother. The purpose of the article is to provide a good coceptual uderstadig of the problem of model selectio ad the solutio beig advocated. Cosequetly, oly the most importat (ad ew) techical advaces are discussed. A more thorough treatmet of the mathematics ca be foud i other sources (Myug, Balasubramaia, & Pitt, 2000; Myug, Kim, & Pitt, 2000; Myug & Pitt, 1997, 1998). After itroducig the problem of model selectio ad idetifyig model complexity as a key property of a model that must be cosidered by ay selectio method, we itroduce a ituitive statistical tool that assists i uderstadig ad measurig complexity. Next, we develop a quatitative measure of complexity withi the mathematics of differetial geometry ad show how it is icorporated ito a powerful model selectio method kow as miimum descriptio legth (MDL). Fially, applicatio examples of MDL ad the complexity measure are provided by comparig models i three areas of cogitive psychology: psychophysics, iformatio itegratio, ad categorizatio. Geeralizability Istead of Goodess of Fit Model selectio i psychology has largely bee limited to a sigle criterio to measure the accuracy with which a set of models describes a metal process: goodess of fit (GOF). The model that fits a particular set of observed data the best (i.e., accouts for the most variace) is cosidered superior because it is presumed to approximate most closely the metal process that geerated the data. Typical measures of GOF iclude the root mea squared error (RMSE), which is the square root of the sum of squared deviatios betwee observed ad predicted data divided by the umber of data poits fitted, ad the maximum likelihood, which is the probability of obtaiig the observed data maximized with respect to the model s parameter values. GOF as a selectio criterio is attractive because it appears to measure exactly what oe wats to kow: How well does the model mimic huma behavior? I additio, the GOF measure is easy to calculate. GOF is a ecessary ad importat compoet of model selectio: Data are the oly lik to the uderlyig cogitive process, so a model s ability to describe the output from this process must be cosidered i model selectio. However, model selectio based solely o GOF ca lead to erroeous results ad the choice of a iferior model. Just because a model fits data well does ot ecessarily imply that the regularity oe seeks to capture i the data is well approximated by the model (Roberts & Pashler, 2000). Properties of the model itself ca eable it to provide a good fit to the data for reasos that have othig to do with the model s approximatio to the cogitive process (Myug, 2000). Two of these properties are the umber of parameters i the model ad its fuctioal form (i.e., the way i which the model s parameters ad data are combied i the model equatio). Together they cotribute to a model s complexity, which refers to the flexibility iheret i a model that eables it to fit diverse patters of data. 1 The followig simulatio example demostrates the idepedet cotributio of these two properties to GOF. Three models were compared o their ability to fit data. Model M 1 (defied i Table 1) geerated the data to fit, ad is therefore cosidered the true model. Model M 2 differed from M 1 oly i havig oe additioal parameter, two istead of oe; ote that their Table 1 Goodess of Fit ad Geeralizability Measures of Three Models Differig i Complexity Model M 1 (true model) M 2 M 3 Goodess of fit 2.68 (0%) 2.49 (31%) 2.41 (69%) Geeralizability 2.99 (52%) 3.08 (28%) 3.14 (20%) Note. Each cell cotais the average root mea squared error of the fit of each model to the data ad the percetage of samples (out of 1,000) i which that particular model fitted the data best (i paretheses). The three models were as follows: M 1 : y l(x a) error; M 2 : y b*l(x a) error; ad M 3 : y bx a error. The error was ormally distributed, M 0, SD 3. Samples were geerated from model M 1 usig a 1 o the same 6 poits for x, which raged from 1 to 6 i icremets of 1. fuctioal forms are the same. Model M 3 had the same umber of parameters as M 2, but a differet fuctioal form (a is a expoet of x rather tha a additive compoet). Parameters were chose for each of the three models to give the best fit to 1,000 radomly geerated samples of data from the model M 1. Each model s mea fit to the samples is show i the first row of Table 1 alog with the percetage of time that particular model provided a better fit tha its two competitors. As ca bee see, M 2 ad M 3, with oe more parameter tha M 1, always provided a better fit to the data tha M 1. Because the data were geerated by M 1,M 2 ad M 3 must have overfitted the data beyod what is ecessary to capture the uderlyig regularity. Otherwise, oe would have expected M 1 to fit its ow data best at least some of the time. After all, M 1 geerated the data! The improved fit of M 2 ad M 3 occurred because the extra parameter, b, i these two models eabled them to absorb radom error (i.e., osystematic variatio) i the data. Absorptio of these radom fluctuatios is the oly meas by which M 2 ad M 3 could have fitted the data better tha the true model, M 1. Note also that M 3 provided a better fit tha M 2. This improvemet i fit must be due to fuctioal form rather tha the umber of parameters, because these two models differ oly i how the data (x) ad the two parameters (a ad b) are combied i the model equatio. This example demostrates clearly that GOF aloe is iadequate as a model selectio criterio. Because a model s complexity is ot evaluated by the method, the model capable of absorbig the most variatio i the data, regardless of its source, will be chose. Frequetly this will be the most complex model. The simulatio also highlights the poit that model selectio is particularly difficult i psychology, ad i other social scieces, precisely because radom error is preset i the data. Although this oise ca be miimized (i the experimetal desig), it caot be elimiated, so i ay give data set, variatio due to the cogitive process ad variatio due to radom error are etagled, posig a sigificat obstacle to idetifyig the best model. To get aroud this problem, model selectio must be based, istead, o a differet criterio that of geeralizability. The goal 1 Cuttig, Bruo, Brady, ad Moore (1992) used the term scope, which is similar to our defiitio of complexity. They proposed to measure the scope by assessig a model s ability to accout for all possible data fuctios, where those fuctios are geerated by a reasoably large sample of radom data sets (p. 364).

3 474 PITT, MYUNG, AND ZHANG of geeralizability is to predict the statistics of ew, as yet usee, samples geerated by the metal process beig studied. The ratioale uderlyig the criterio is that the model should be chose that fits all samples best, ot the model that provides the best fit to oe particular sample. Oly whe this coditio is met ca oe be sure a model is accurately capturig the uderlyig process, ot also the idiosycracies (i.e., radom error) of a particular data sample. More formally, geeralizability ca be defied i terms of a discrepacy fuctio that measures the expected error i predictig future data give the model of iterest (Lihart & Zucchii, 1986; also see their work for a discussio of the theoretical uderpiigs of geeralizability). The results of a secod simulatio illustrate the superiority of geeralizability as a model selectio criterio. After each of the data samples was fitted i the first simulatio, the parameters of the three models were fixed, ad geeralizability was assessed by fittig the models to aother 1,000 samples of data geerated from M 1. The average fits are show i the secod row of Table 1. As ca be see, poor geeralizability is the cost of overfittig a specific sample of data. Not oly are average fits ow worse for M 2 ad M 3 tha for M 1, but these two models provided the best fit to the secod sample much less ofte tha M 1. Geeralizability should be preferred over GOF because it does a better job of capturig the geeral tred i the data ad igorig radom variatio. This differece betwee these two selectio criteria is show i Figure 1. Dots i the pael represet observed data poits. Lies are the fuctios geerated by two models varyig i complexity. The simpler model (thick lie) captures the geeral tred i the data. If ew data poits () are added to the sample, fit will remai similar. The more complex model (thi lie) ot oly captures the geeral tred i the data, but also captures may of the idiosycracies of each observatio i the data set, which will cause fit to drop whe additioal observatios are added to the sample. Geeralizability would favor the simple model, which fits with our ituitios. GOF, o the other had, would favor the complex model. Figure 1. Illustratio of the trade-off betwee goodess of fit ad geeralizability. A observed data set (dots) was fitted to a simple model (thick lie) ad a complex model (thi lie). New observatios are show by the plus symbol. The goal of model selectio, the, should be to maximize geeralizability. This turs out to be quite difficult i practice, because the relatioship betwee complexity ad geeralizability is ot as straightforward as that betwee complexity ad GOF. These differeces are illustrated i Figure 2. Model complexity is represeted alog the horizotal axis ad ay fit idex o the vertical axis, where larger values idicate a better fit (e.g., percet variace accouted for), with the two fuctios represetig the two selectio criteria. As was demostrated i the first simulatio, as complexity icreases, so does GOF. Geeralizability will also icrease positively with complexity, but oly up to the poit where the model is sufficietly complex to capture the regularities i the data caused by the cogitive process beig modeled. Ay additioal complexity will cause a drop i geeralizability, because after that poit the model will begi to capture radom error, ot just the uderlyig process. The differece betwee the GOF ad geeralizability curves represets the amout of overfittig that ca occur. Oly by takig complexity ito accout ca a selectio method accurately measure a model s geeralizability. The task before the modelig commuity has bee to develop a accurate ad complete measure of model complexity, beig sesitive ot oly to the umber of parameters i the model but also to its fuctioal form. Aother way to iterpret the precedig discussio is that the trademark of a good model is its ability to satisfy the two opposig selectio pressures of GOF ad complexity, with the ed result beig good geeralizability. These two pressures ca be thought of as the two edges of Occam s razor: A model must be complex eough to capture the uderlyig regularity yet simple eough to avoid overfittig the data sample ad thus losig geeralizability. I this regard, model selectio methods should be evaluated o their success i implemetig Occam s razor. The selectio method that we itroduce i this article, MDL, achieves this goal. Before we describe this method, we review prior approaches to model selectio. Prior Approaches to Model Selectio We begi this sectio with a formal defiitio of a model. From a statistical stadpoit, data are a sample geerated from a true but ukow probability distributio, which is the regularity uderlyig the data. A statistical model is defied as a collectio of probability distributios defied o experimetal data ad idexed by the model s parameter vector, whose values rage over the parameter space of the model. If the model cotais as a special case the probability distributio that geerated the data (i.e., the true model), the the model is said to be correctly specified; otherwise it is misspecified. Formally, defie y (y 1,...,y N )as a vector of values of the depedet variable, ( 1,..., k )as the parameter vector of the model, f(y) as the likelihood fuctio as a fuctio of the parameter. N is the umber of observatios ad k is the umber of parameters. Ofte it is possible to write y as a sum of a determiistic compoet plus radom error: y g, x e. (1) I the equatio, x (x 1,..., x N ) is a vector of a idepedet variable x, ad e (e 1,...,e N ) is the radom error vector from a probability distributio with a mea of zero. Quite ofte the mea

4 MODEL SELECTION AND COMPLEXITY 475 Figure 2. Illustratio of the relatioship betwee goodess of fit ad geeralizability as a fuctio of model complexity (Myug & Pitt, 2001). From Steves Hadbook of Experimetal Psychology (p. 449, Figure 11. 4), by J. Wixted (Editor), 2001, New York: Wiley. Copyright 2001 by Wiley. Adapted with permissio. fuctio g(, x) itself is take to defie a mathematical model. However, the specificatio of the error distributio must be icluded i the defiitio of a model. Additioal parameters may be itroduced i the model to specify the shape of the error distributio (e.g., ormal). Ofte its shape is determied by the experimetal task or desig. For example, cosider a recogitio memory experimet i which the participat is required to respod old or ew to a set of pictures preseted across a series of idepedet trials, with the umber of correct resposes recorded as the depedet variable. Suppose that a two-parameter model assumes that the probability of a correct respose follows a logistic fuctio of the time lag (x i ), for coditio i (i 1,..., N), betwee iitial exposure ad recogitio test, i the form of g( 1, 2, x i ) [1 1 exp( 2 x i )] 1. I this case, the depedet variable y i will be biomially distributed with probability g( 1, 2, x i ) ad the umber of biomial trials, so the shape of error fuctio is completely specified by the experimetal task. Six represetative selectio methods curretly i use are show i Table 2. They are the Akaike iformatio criterio (AIC; Akaike, 1973), the Bayesia iformatio criterio (BIC; Schwarz, 1978), the root mea squared deviatio (RMSD), the iformatiotheoretic measure of complexity (ICOMP; Bozdoga, 1990), cross-validatio (CV; Stoe, 1974), ad Bayesia model selectio (BMS; Kass & Raftery, 1995; Myug & Pitt, 1997). Each of these methods assesses a model s geeralizability by combiig a measure of GOF with a measure of complexity. Each prescribes that the model that miimizes the give criterio be chose. That is, the smaller the criterio value of a model, the better the model geeralizes. 2 A fuller discussio of these methods ca be foud i Myug, Forster, ad Browe (2000; see also Lihart & Zucchii, 1986). AIC ad BIC are the two most commoly used selectio methods. The first term, 2 l(f(y ˆ)), is a maximum likelihood measure of GOF ad the secod term, ivolvig k, is a measure of complexity that is sesitive to the umber of parameters i the model. As the umber of parameters icreases, so does the criterio. I BIC, the rate of icrease is modified by the log of the sample size,. 3 RMSD uses RMSE as the measure of GOF ad also takes ito accout the umber of parameters through k. 4 These three measures, AIC, BIC, ad RMSD, are all sesitive oly to oe aspect of complexity, umber of parameters, but isesitive to fuctioal form. This is clearly iadequate because, as demostrated i Table 1, the fuctioal form of a model iflueces geeralizability. ICOMP is a improvemet o this shortcomig. Its secod ad third terms together represet a complexity measure that takes ito accout the effects of parameter sesitivity through trace() ad parameter iterdepedece through det(), which, accordig to Li, Lewadowski, ad DeBruer (1996), are two pricipal compoets of the fuctioal form that cotribute to model complexity. However, ICOMP is also problematic because it is ot ivariat uder reparameterizatio of the model, i particular uder oliear forms of reparameterizatio. 5 2 The model selectio methods discussed i the preset article do ot require the assumptio that the models beig compared are correct or ested. (A model is said to be correct if there is a parameter value of the model that yields the probability distributio that has geerated the observed data sample. A model is said to be ested withi aother model if the former ca be reduced to a special case of the latter by settig oe or more of its parameters to fixed values.) O the other had, the geeralized likelihood ratio test based o the G 2 or chi-square statistics (e.g., Bishop, Fieberg, & Hollad, 1975, pp ), which are ofte used to compare two models, assumes that the models are ested ad, further, that the reduced model is correct. Whe these assumptios are met, both types of selectio methods should perform similarly. However, the methods should ot be viewed as iterchageable because their goals differ. The selectio methods preseted i this article were desiged to idetify the model that geeralizes best i some defied sese. The geeralized likelihood ratio test, i cotrast, is a ull hypothesis sigificace test i which the hypothesis that the reduced model is correct is tested give a prescribed level of the Type 1 error rate (i.e., ). Accordigly, the model chose uder this test may ot ecessarily be the oe that geeralizes best. 3 Sample size refers to the umber of idepedet data samples (more accurately, errors, i.e., e i s) draw from the same probability distributio. Data size is the umber of observed data poits that are beig fitted to evaluate a model ad that may come from differet probability distributios, although from the same probability family. Ofte, the sample size is equal to the data size. A case i poit is a liear regressio model, y i x i e i (i 1,..., N), where e i N(0, 2 ). Note that errors, e i s, are idepedet ad idetically distributed accordig to the ormal probability distributio with mea zero ad variace 2. O the other had, if it is assumed that each e i is ormally distributed with zero mea but with a differet value of the variace, that is, e i N(0, i 2 ), (i 1,...,N), the the sample size,, will ow be equal to 1 whereas the data size, N, remais uchaged. 4 The RMSD defied i Table 2 differs from the RMSD that has ofte bee used i the psychological literature (e.g., Friedma, Massaro, Kitzis, & Cohe, 1995) where it is defied as RMSD SSE/N, i which (N k) is replaced by N, ad therefore does ot take ito accout the umber of parameters. This form of RMSD is othig more tha RMSE. As such, it is ot appropriate to use as a method of model selectio, especially whe comparig models that differ i the umber of parameters. 5 Reparameterizatio refers to trasformig the parameters of a model so that it becomes aother, behaviorally equivalet, model. For example, a oe-parameter expoetial model with ormal error, y e x N(0, 2 )is a reparameterizatio of aother model: y x N(0, 2 ). The latter is

5 476 PITT, MYUNG, AND ZHANG Table 2 Six Prior Model Selectio Methods Selectio method Criterio equatio Akaike iformatio criterio (AIC) AIC 2lf(y ˆ) 2k Bayesia iformatio criterio (BIC) BIC 2lf(y ˆ) k l Root mea square deviatio (RMSD) RMSD SSE/(N k) Iformatio-theoretic measure of ICOMP l f(y complexity (ICOMP) k ˆ)] ltrace[( 1 l det(( ˆ)) 2 k 2 Cross-validatio (CV) CV l f(y Val ˆ Cal ) Bayesia model selectio (BMS) BMS l f(y)()d Note. y data sample of size ; ˆ parameter value that maximizes the likelihood fuctio f(y); k umber of parameters; SSE sum of the squared deviatios betwee observed ad predicted data; N the umber of data poits fitted; covariace matrix of the parameter estimates; y Val validatio sample of observed data; ˆ Cal maximum likelihood parameter estimate for a calibratio sample; l the atural logarithm of base e; () the prior probability desity fuctio of the parameter. I CV, the observed data are divided ito two subsamples of equal sizes, calibratio ad validatio. The former is used to estimate the best-fittig parameter values of a model. The parameters are the fixed to these values ad used by the model to fit the validatio sample, yieldig a model s CV idex. CV is a easyto-use, heuristic method of estimatig a model s geeralizability (for a brief tutorial, see Myug & Pitt, 2001). The emphasis o geeralizability makes it reasoable to suppose that CV somehow takes ito accout the effects of fuctioal form. If, how, ad how well it does this is ot clear, however. BMS is a model selectio method motivated from Bayesia iferece. As such, the method chooses models based o the posterior probability of a model give the data. Calculatio of the posterior probability requires the specificatio of the parameter prior desity, (), creatig the possibility that model selectio will deped o the choice of the prior desity. As with CV, complexity i BMS is elusive. The itegral form of the measure idicates that BMS takes ito accout fuctioal form ad the umber of parameters, but how this is achieved is ot etirely clear. It ca be show that BIC performs equivaletly to BMS as a large sample approximatio. It is importat to ote that these selectio criteria are themselves sample estimates of a true but ukow populatio parameter (i.e., geeralizability i the populatio), ad thus their values ca chage from sample to sample. Uder the model selectio procedure described above, however, oe is forced to choose oe model o matter how small the differece is amog models, eve whe the models are virtually equivalet i their approximatio of the uderlyig regularity. Oe solutio to this dilemma is to coduct a statistical test, before applyig the model selectio procedure, to decide if two give models provide equally good descriptios of the uderlyig process. Golde (2000) proposed such a methodology, i which oe ca determie whether a subset of models are obtaied from the former by defiig a ew parameter,, as e. Wheever two models are related to each other through reparameterizatio, they become equivalet i the sese that both will fit ay give data set idetically, albeit with differet parameter values. Statistically speakig, they are idistiguishable from oe aother. equally good approximatios of the cogitive process. 6 If the umber of comparisos is ot small, however, it ca be difficult to cotrol experimet-wise error. The precedig selectio methods represet importat progress i tacklig the model selectio problem. All have shortcomigs that limit their usefuless to various degrees. The complexity measure i AIC, BIC, ad RMSD is icomplete, ad the other three are either ot ivariat uder reparameterizatio (ICOMP) or lack a clear complexity measure (CV, BMS). The remaider of this article is devoted to the developmet ad testig of a model selectio approach that overcomes these limitatios. We begi by showig that differetial geometry provides a theoretically welljustified ad ituitive framework for uderstadig complexity ad model selectio i geeral. Model Complexity: A Distributioal Approach We begi the discussio of complexity with a graphical defiitio of the term, iteded to clarify what it meas for a model to be complex. Depicted i the top pael i Figure 3 is the set of all data patters that are possible give a particular experimetal desig. Every poit i this multidimesioal data space represets a particular data patter i terms of a probability distributio, such as the shape of a frequecy distributio of respose times. All models occupy a sectio, or multiple sectios, of data space, beig able to fit a subset of the possible data patters that could be observed. It is equally appropriate to thik of data space as the uiverse of all models uder cosideratio, because every model will occupy a regio of this space, large or small. 6 This is a ull hypothesis sigificace test, which, as a extesio of the Wilke s geeralized likelihood ratio test, tests the ull hypothesis that all models uder cosideratio fit the data equally well. This test, ulike the geeralized likelihood ratio test, is applicable to comparig o-ested ad misspecified models for a wide rage of discrepacy fuctios, icludig the oes with pealty terms, such as AIC, BIC, ad MDL. I the stadard model selectio procedure usig these criteria, oe is forced to decide betwee two models uder compariso. This test allows for a third decisio that both models are equally good or there is ot eough evidece yet for choosig oe model over the other.

6 MODEL SELECTION AND COMPLEXITY 477 Although the examples i Figure 3 are hypothetical, the graphical depictio of mathematical models i this way is ot merely illustrative. Respose surface aalysis (RSA) is a statistical tool that, as i Figure 3, yields graphical represetatios of models for comparig their relative complexities. I additio, it serves as a iformative startig poit for the derivatio of a elegat quatitative measure of complexity. RSA is a method for studyig geometric relatios amog resposes geerated by a mathematical model, ofte used i oliear regressio (Bates & Watts, 1988). For a model with k parameters ad N observatios, the respose surface is defied as a k-dimesioal surface, formed by all possible respose vectors that the model ca describe. The respose surface is embedded i a N-dimesioal data space, which is the set of all possible respose vectors that could be geerated idepedetly of a model. The respose surface is a hyperplae for a liear model but may be curved whe the model is oliear. The effects of model complexity o model fit is easily visible whe models are compared i the space of respose surfaces. This is show i the followig example. See Myug, Kim, ad Pitt (2000) for a more detailed discussio. Cosider the followig oe-parameter power model: y t power model, (2) Figure 3. The top pael depicts regios i data space occupied by two models, M a (simple model) ad M b (complex model), with the rage of data patters that ca be geerated by each model i the lower paels. The amout of space occupied by a model is positively related to its complexity. A simple model (M a ) will occupy a small regio of data space because it assumes a specific structure i the data, which will maifest itself as a relatively arrow rage of similar data patters. This idea is illustrated i the lower left pael. Whe oe of these few patters occurs, the model will fit the data well; otherwise, it will fit poorly. Simple models are easily falsifiable, requirig a small miimum umber of data poits outside of its regio of data space to disprove the model. I cotrast, a complex model (M b ) will occupy a larger portio of data space. Complex models do ot assume a sigle structure i the data. Rather, the structure chages as a fuctio of the parameter values of the model. A slight chage i a parameter s value ca have a dramatic chage i the model s structure. Such chameleo-like behavior eables complex models to be fiely tued to fit a wide rage of data patters. This is illustrated i the lower right pael. Overly complex models are of questioable worth because their ability to fit such a diverse set of data patters ca make them difficult to falsify. I geeral, a complex model is oe with may parameters ad a (powerful) oliear equatio for combiig parameters. Complexity is dichotomized i this example for illustrative purposes oly. It is more accurate to thik of it as a cotiuum, as depicted i Figure 2. Respose Surface Aalysis where y is the respose probability (e.g., proportio correct), t is a presetatio or retetio iterval greater tha 1, ad (0) is a parameter. Suppose that y is measured at two differet time itervals, t 1 ad t 2. Give two fixed values of t 1 ad t 2, the respose surface is a lie or a curve i a two-dimesioal data space composed of (y t1, y t2 ) created by plottig the y values at t 1 agaist the correspodig y values at t 2 for the full rage of the parameter, similar to phase plots i dyamical systems research (Kelso, 1995). I essece, a model is represeted graphically as a plot of y t1 versus y t2 i data space. For example, for the parameter 1, the y value at t 1 2 is obtaied as y t1 (t 1 ) (2) Similarly, the y value at t 2 8 is obtaied as y t2 (t 2 ) (8) These two values are the represeted as a sigle poit (0.500, 0.125) o the (y t1, y t2 ) plae. Additioal poits are obtaied by varyig the full rage of the parameter (i.e., 0 ) to form a cotiuous curve, which is called the respose curve of a model, show i the middle pael of Figure 4. The equatio that describes this relatioship ca be derived aalytically as follows: l t y t2 y 2/l t 1 t1. (3) Note that the parameter has bee removed from the equatio. The model is ow parameter free, havig bee redefied as the relatioship betwee two y values istead of a parameter ad a y value. Each poit o the respose curve describes the relatioship betwee two y values that are themselves described perfectly by a power fuctio. Similarly, the respose curves for the followig oe-parameter models ca be obtaied ad are graphed i the adjacet paels i Figure 4: y 1 t (liear model) y [1.102 si(5 t/12) 1]/2 (blackhole model). (4) RSA provides two valuable isights ito model complexity. First, RSA makes the meaig of complexity tagible. The respose curve of a model represets a complete visual descriptio of the model (i.e., all of the data patters it ca describe). The curve is the model. Ay poit that falls o the curve ca be perfectly fit by the model. Thus, RSA clearly reveals what patters of data a model ca describe ad what patters it caot. For example, the respose curve of the liear model reveals that the

7 478 PITT, MYUNG, AND ZHANG Figure 4. Respose curves of three oe-parameter models that have the same umber of parameters but differ i fuctioal form, each obtaied for t 1 2 ad t 2 8. model ca describe oly those (y 2, y 1 ) poits satisfyig the equatio, y 2 4y 1 3 (0.75 y t 1), o others. Secod, the cotributios of fuctioal form to model complexity become evidet whe models are compared i RSA space. All three models i Figure 4 have oe parameter, but their respose curves differ greatly, idicatig that their fuctioal forms must also differ. This observatio leads to a ituitive measure of model complexity: Give that the respose surface of a model represets the collectio of all possible data patters that the model ca describe, oe could defie a atural complexity measure as the total legth of the model s respose curve. For example, for the three respose curves i Figure 4, oe ca coclude that the black-hole model is most complex with its lie legth of 25.74, followed by the power model (legth 1.50), ad the liear model (legth 1.03). Du (2000) presets aother RSA-based complexity measure. Despite the possible ways of quatifyig complexity withi RSA, ay such measure would be icomplete because it would ot take ito accout the stochastic ature of the process uderlyig the data. That is, RSA igores radom variatio i the data. The respose curves i Figure 4 depict the three models without a error term. Recall that data represet a sample from a ukow probability distributio, the shape of which must be specified by the model. A complete measure of complexity must take ito accout the distributioal characteristics of a model (e), ot oly that of the mea fuctio, that is, g(, x) i Equatio 1. Oly the latter is cosidered i RSA. Thus, ay RSA metric would yield oly a approximate measure of complexity. To icorporate radom error ito a complexity measure requires that RSA be exteded ito a space of probability distributios, to which we ow tur. the data space depicted i Figure 3, every distributio is a poit i this space, ad the collectio of poits created by varyig the parameters of the model gives rise to a hypervolume i which similar distributios are mapped to earby poits, as illustrated i Figure 5. Earlier, we defied complexity as that characteristic of a model that eables it to fit a wide rage of data patters. I a geometric cotext, this traslates ito a iheret characteristic of a model that eables it to describe a wide rage of probability distributios. Models that are able to describe more distributios should be more complex. Model complexity would therefore seem to be related to the umber of probability distributios that a model ca geerate. This ituitio immediately rus ito trouble: The umber of all such distributios is ucoutably ifiite, makig the value idetermiable. Or is it? Give that ot all distributios are equally similar to oe aother, oe solutio is to cout oly distiguishable distributios. That is, if two or more probability distributios o a model s maifold are sufficietly similar to oe aother to be statistically idistiguishable, they are couted as oe distributio, with a cluster of such distributios occupyig a local eighborhood o the maifold. This procedure yields a coutably ifiite set of distiguishable distributios, the size of which is a atural measure of complexity. More precisely, two probability distributios should be cosidered idistiguishable if oe is mistake for the other Differetial Geometric Approach to Model Complexity I this sectio we show that differetial geometry, a brach of mathematics, provides a theoretically well-justified ad ituitive measure of model complexity. A more techically rigorous presetatio of the topic ca be foud i Myug, Balasubramaia, ad Pitt (2000). Withi differetial geometry, a model forms a geometric object kow as a Riemaia maifold that is embedded i the space of all probability distributios (Amari, 1983, 1985; Rao, 1945). As i Figure 5. The space of probability distributios forms a maifold o which similar distributios are mapped to earby poits.

8 MODEL SELECTION AND COMPLEXITY 479 eve i the presece of a ifiite amout of data. A measure of volume that couts oly distiguishable distributios must be devised to achieve this goal. The followig metal exercise shows how this ca be doe. Draw data from oe distributio, which is idexed by a specific parameter, say p, i the model, ad ask how well oe ca guess whether the data came from p rather tha from a earby q. The ability to distiguish betwee these distributios icreases with the amout of available data. However, it ca be show that for ay fixed amout of data there is a little ellipsoid aroud p where the probability of error i the guessig game is large. I other words, withi this ellipsoid, distributios are ot very distiguishable i the statistical sese. To cout distiguishable distributios, oe should the tile the model maifold with such ellipsoids, coutig oe distributio for each ellipsoid. This procedure turs the maifold ito a ellipsoid-covered lattice with a distiguishable distributio at each lattice poit. The the limit of ifiite sample size should be take so that the ellipsoids of idistiguishability shrik ad the associated lattice becomes fier, formig a cotiuum i the limit. Takig this limit recovers a cotiuum measure that couts oly distiguishable distributios. Whe this computatio is carried out, the umber of distiguishable distributios turs out to be equal to d{det[i()]} 1/2 where I() is the Fisher iformatio matrix of a sample of size 1, det(i) is the determiat of the matrix I, ad d the ifiitesimal parameter volume (see Footote 6 for a defiitio of the Fisher Iformatio matrix; see also Schervish, 1995). The umber of all distiguishable probability distributios that a model ca geerate or describe is obtaied by itegratig d{det[i()]} 1/2 over the etire parameter maifold as follows: V M ddeti, (5) geometric shape embedded i a hyperdimesioal space, albeit differet spaces (probability distributios vs. respose vectors). This correspodece is ot accidetal, because the differetial geometric approach is a logical extesio of RSA. To uderstad the coectio betwee the two, thik of model selectio as a iferece game: The goal is to determie, out of a set of probability distributios that idex data patters, which model is most likely to have geerated a data sample draw from a ukow probability distributio. Referrig back to Equatio 1, the mai yardstick used i this selectio process is the likelihood fuctio, f(y). The value of the likelihood fuctio depeds upo ot oly the mea fuctio, g(, x), but also the distributioal characteristics of the error term (e). Ay justifiable measure of complexity should take ito accout these two factors. RSA cosiders oly the first term, whereas the differetial geometric approach cosiders both. To see how the two approaches are related quatitatively, cosider the respose curve of a oe-parameter model, such as the power model i Figure 4 (middle pael). The RSA measure of complexity i this model is the total legth of the respose curve, which i essece measures the umber of data poits alog the curve. I the differetial geometric approach, coutig is carried out with the additioal kowledge of the local distiguishability of data poits alog the curve. This differece is illustrated i Figure 6 for the oe-parameter power model. The respose curve is split ito segmets of differet legths, with the poits withi each segmet beig statistically idistiguishable. Note that distiguishability is ot uiform alog the curve. Poits i the middle regio are less distiguishable tha those at either ed. I fact, for ay oe-parameter model of observed data that follows a biomial probability distributio, oe ca derive formal expressios for these two measures of complexity as follows (see Appedix A for a complete derivatio): where the subscript M deotes a particular model uder cosideratio. This measure is kow as the Riemaia volume i differetial geometry. A highly desirable property of the volume measure is that it is ivariat uder reparameterizatio. This property is a outgrowth of models beig represeted as maifolds i the space of all probability distributios. I this cotext, the parameters of a model simply idex the collectio of distributios a model describes. The choice of the parameters themselves is irrelevat. The maifold is the model, which will ever chage, regardless of how the model is specified i a equatio (see Equatio 10 ad accompayig text). The Riemaia volume makes good sese as a complexity measure. Because complexity is related to the volume of a model i the space of probability distributios, the measure of volume should cout oly differet, or distiguishable, distributios, ad ot the coordiate volume ( d) of the maifold. The Riemaia volume, therefore, is a direct fuctio of the umber of distiguishable distributios that a model ca geerate, with a complex model geeratig more distributios tha a simple model. Relatio to RSA The differetial geometric approach to model complexity is similar to RSA i that a mathematical model is viewed as a Figure 6. The power model s respose curve from Figure 5 divided ito local regios of idistiguishability (i.e., the poits withi each regio are statistically idistiguishable).

9 480 PITT, MYUNG, AND ZHANG RSA: legth L M d N q1 Differetial geometry: volume V M d N q1 dg, 2 x q d ; 1 g, x q 1 g, x q dg, x q d 2. (6) I the equatios, it is assumed that observed data, y q, are distributed biomially, Bi[, g(, x q )], of sample size, ad probability g(, x q ). Note that the two measures are idetical except for the additioal term, 1/{g(, x q )[1 g(, x q )]}, i the differetial geometric complexity measure. This extra term takes ito accout local distiguishability ad is equal to det[i()] i Equatio 5. MDL Method of Model Selectio Thus far i the article we have itroduced a measure of model complexity. Although it is useful for comparig the relative complexities of models, as will be show below, by itself the measure is isufficiet as a model selectio method. What is missig is a measure of how well the model fits the data (i.e., a measure of GOF). MDL, a model selectio method from algorithmic codig theory i computer sciece (Gruwald, 2000; Rissae, 1983, 1996) combies both of these measures. The MDL approach to model selectio was developed withi the domai of iformatio theory, where the goal of model selectio is to choose the model that permits the greatest compressio of data i its descriptio. The assumptio uderlyig the approach is that regularities or patters i data imply redudacy. The more the data ca be compressed by extractig this redudacy, the more we lear about the uderlyig regularities goverig the cogitive process of iterest. The full form of the measure is show below. The first term is the GOF measure ad the secod ad third together form the itrisic complexity of the model (Rissae, 1996). MDL l fy ˆ k 2 2 l l ddeti, (7) where y (y 1,..., y ) is a data sample of size, ˆ is the maximum likelihood parameter estimate, l is the atural logarithm of base e, I() is the Fisher iformatio matrix defied earlier. 7 The itegratio of the third term is take over the parameter space defied by the model. As with prior selectio methods, MDL prescribes that the model that miimizes the criterio should be chose, the assumptio beig that such a model has extracted the most redudacy (i.e., regularity) i the data ad thus should geeralize best. I practice, the criterio represets the shortest legth of computer code (measured i bits of iformatio) ecessary to describe the data give the model. The shorter the code, the greater the amout of regularity i the data that the model ucovered. The soudess of MDL as a model selectio criterio has bee well documeted by Li ad Vitayi (1997), who showed that there is a close relatioship betwee miimizig MDL ad achievig good geeralizability. From a decisio-theoretic perspective, MDL selects the oe model, amog a set of competig models, that miimizes the expected error i predictig future data i which the predictio error is measured usig a logarithmic discrepacy fuctio (Rissae, 1999; Yamaishi, 1998). It turs out that miimizatio of MDL correspods to maximizatio of the posterior probability withi the Bayesia statistics framework (i.e., BMS). Balasubramaia (1997) showed that the MDL criterio ca be derived as a fiite series of terms i a asymptotic expasio of the Bayesia posterior probability of a model give the data for a special form of the parameter prior desity. This coectio betwee the two suggests that choosig the model that gives the shortest descriptio of the observed data is essetially equivalet to choosig the model that is most likely true i the sese of probability theory (see Theorem 1 of Vitayi & Li, 2000). The theoretical lik betwee BMS ad MDL also suggests that they may perform similarly i practice. Barro ad Cover (1991) showed that BMS ad MDL are asymptotically equivalet give large sample sizes; that is, both will coverge to the true model if the true model is correctly specified. O the other had, if models are misspecified ad sample size is relatively small, they ca yield disparate results, especially depedig o the form of the parameter prior desity used i the calculatio of BMS. Despite these similarities, MDL has at least oe advatage over BMS: The complexity measure is well uderstood. As metioed above, complexity ad GOF are ot easily disetagled i the itegral form of BMS (Table 2). I cotrast, a clear uderstadig of the complexity term i MDL is provided by its couterpart i differetial geometry, the geometric complexity measure. This is described i detail i the followig sectio. The latter two terms of the MDL criterio (Equatio 7) readily led themselves to a differetial geometric iterpretatio, which is related to the Riemaia volume measure preseted earlier. Coceptually, model selectio usig MDL proceeds by choosig the model that best approximates the true model by coutig the umber of distiguishable distributios that come close to the true model. Proximity to the true model is assessed by f(y). Withi the differetial geometric approach, this correspods to a volume measure i the space of probability distributios. The followig volume, uder the assumptio of large sample size, is show to be a valid measure of proximity (Balasubramaia, 1997; Myug, Balasubramaia, & Pitt, 2000): C M (2/) k/2 h( ˆ), where k is the umber of parameters i the model ad h( ˆ) is a datadepedet factor that goes to 1 as grows large (some additioal coditios are required; see Balasubramaia, 1997). Essetially, C M represets the Riemaia volume of a small ellipsoid aroud ˆ, withi which the probability of the data, f(y), is appreciable. As such, it measures the umber of distiguishable distributios 7 The Fisher iformatio matrix I() of the MDL criterio is defied as I ij () (1/)E( 2 l f(y)/ i j )(i, j 1,...,k) for the data vector y (y l,..., y ) where y q s are sample values of radom variables, Y q s (q 1,...,; see, e.g., Rissae, 1996, Equatio 7). Further, if Y q s are idepedetly ad idetically distributed, the above I() reduces to the Fisher iformatio matrix of sample size 1, that is, I ij () E( 2 l f(y q )/ i j )(i, j 1,...,k) for ay q.

10 MODEL SELECTION AND COMPLEXITY 481 that come close to the truth, as measured by predictig the data y with relatively high probability. However, C M aloe is ot a adequate measure of proximity because the total umber of distiguishable distributios of a model (V M, the Riemaia volume, Equatio 5) must also be cosidered. Iclusio of this additioal measure leads to a volume ratio, V M /C M, which pealizes models for havig a uecessarily large umber of distiguishable distributios (V M ) or havig relatively few distiguishable distributios close to the truth (C M ). Takig the log of this ratio gives l V M C M k 2 2 l l ddeti l h ˆ. (8) The first ad secod terms are idepedet of the true distributio as well as the data, ad therefore represet a itrisic property of the model. Together they will be called the geometric complexity of the model, ad are ivariat uder reparameterizatio of the model. As sample size icreases, the third term, which is data depedet, becomes egligible. Whe this occurs, the geometric complexity is equal to the complexity pealty i the MDL criterio i Equatio 7. It is also worth otig that the first term of the geometric complexity measure icreases logarithmically with sample size, whereas the secod term is idepedet of. A implicatio of this is that as grows large, the effects of complexity due to fuctioal form, reflected through I(), will gradually dimiish compared to those due to the umber of parameters (k). Thus, fuctioal form effects will have their greatest impact o model selectio whe sample size is small. Because small samples are the orm i experimets i much of cogitive psychology, it is imperative that the selectio method be sesitive to this property of a model. Differetial geometry provides may valuable isights ito model complexity ad model selectio. Oe is a ew explicatio of MDL. The MDL selectio criterio ca be rewritte as follows: MDL l fy ˆ V M /C M l ormalized f y ˆ. (9) This reiterpretatio provides a clearer picture of what MDL does i model selectio. It selects the model that gives the highest value of the maximum likelihood per the relative ratio of distiguishable distributios (V M /C M ). We might call this the ormalized maximum likelihood. From this perspective, the better model is the oe with may distiguishable distributios close to the truth but few distiguishable distributios overall. Perhaps the most importat isight provided by differetial geometry is a ituitive uderstadig of the meaig of complexity i MDL: It measures the mius log of the volume of the distiguishable distributios i a model relative to those close to the truth. I this regard, the size of a model maifold i the space of distributios is what matters whe measurig complexity. A model s fuctioal form ad its umber of parameters ca be misleadig idicators of complexity because they are simply the apparatus by which a collectio of distributios defied by the model is idexed. The geometric approach to complexity preseted here makes it clear that either the parameterizatio or the specific fuctioal form used i idexig is relevat so log as the same collectio of distributios is catalogued o the model maifold. For example, the followig two models, although assumig differet fuctioal forms, are equivalet ad equally complex i the geometric sese: Model A: y exp 1 x 1 2 x 2 error, Model B: y 1 x 12 x 2 error, (10) where the error has zero mea ad follows the same distributio for both models. Here, the parameters of Model A are related to the parameters of Model B through i exp( i ), i 1, 2. Three Applicatio Examples Geometric complexity ad MDL costitute a powerful pair of model evaluatio tools. Whe used together i model testig, a deeper uderstadig of the relatioship betwee models ca be gaied. The first measure eables oe to assess the relative complexities of the set of models uder cosideratio. The secod builds o the first by suggestig which model is preferable give the data i had. The followig simulatios demostrate the applicatio of these methods i three areas of cogitive modelig: psychophysics, iformatio itegratio, ad categorizatio. I each example, two competig models with the same umber of parameters but differet fuctioal forms were fitted to data sets geerated by each of these models (huma data were ot used). Of iterest is the ability of each selectio method to recover the model that geerated the data. A good selectio method should be able to discrimiate betwee data geerated by a model from those geerated by aother model. That is, it should be able to see through the radom variatio i the data sample ad accurately ifer whether the model beig tested geerated the data it is beig fit to. Errors are a sig of overgeeralizatio ad reveal a bias i the selectio method, which could be toward either the more complex or simpler model. The ideal patter of data is oe i which each model geeralizes best oly to data geerated by itself, ot to data geerated by the competig model. I the 2 2 sectios of Tables 3 5, this correspods to a mea selectio criterio measure that is lowest i the upper left ad lower right quadrats, with perfect recovery rates (100%) i these cells as well. Four selectio methods were compared: AIC, ICOMP, CV, ad MDL. Give the close relatioship betwee MDL ad BMS, the latter was ot icluded i the compariso. BIC ad RMSD were also ot icluded because of their equivalece to AIC i the preset testig coditios. AIC ca be expressed with BIC as a term i the equatio: AIC BIC k (2 l ). Cosequetly, both methods will yield the same results whe models with the same umber of parameters (i.e., equal k) are compared. RMSD will geerally yield the same outcome as well. 8 A fuller discussio of the three simulatios ca be foud i Zhag (1999). 8 Whe comparig amog models with the same umber of parameters, model selectio uder RMSD will be the same as that uder AIC ad BIC whe errors are ormally distributed ad have equal variaces. This is because i such cases the sum of squares error i RMSD is related to the likelihood fuctio i AIC (ad BIC) as SSE lf(y) ad hece, miimizatio of SSE is equivalet to maximizatio of the likelihood

Bayesian approach to reliability modelling for a probability of failure on demand parameter

Bayesian approach to reliability modelling for a probability of failure on demand parameter Bayesia approach to reliability modellig for a probability of failure o demad parameter BÖRCSÖK J., SCHAEFER S. Departmet of Computer Architecture ad System Programmig Uiversity Kassel, Wilhelmshöher Allee