Minimum Description Length Model Selection for Multinomial Processing Tree Models

Size: px

Start display at page:

Download "Minimum Description Length Model Selection for Multinomial Processing Tree Models"

Geoffrey Long
5 years ago
Views:

1 Minimum Description Length Model Selection for Multinomial Processing Tree Models Hao Wu Jay I. Myung William H. Batchelder Ohio State University Ohio State University University of California, Irvine December 25, 2008 Word count of text: 4000 Running head: Model Selection Corresponding author and address: Hao Wu Department of Psychology Ohio State University 1835 Neil Avenue Columbus, Ohio Tel:

2 Abstract Multinomial processing tree (MPT) modeling, since its introduction in the 1980s, has been widely and successfully applied as a statistical tool for measuring hypothesized latent cognitive processes in selected experimental paradigms. This paper concerns the problem of selection among a set of scientifically plausible MPT models given observed data. The likelihood ratio test (LRT) is often employed in model selection for MPT models, but it comes with a number of assumptions and constraints that can limit its applicability in practice. In this paper, we introduce a minimum description length (MDL) based model selection approach for MPT modeling that overcomes some of the shortcomings of LRT. The application of the MDL approach is demonstrated for a well studied family of MPT models for the source-monitoring paradigm. The results indicate that the application of MDL in model selection can lead to scientific conclusions about underlying cognitive functioning that are different from those based on LRT. 1 Introduction Multinomial processing tree (MPT) modeling is a statistical methodology for measuring latent cognitive capacities in selected experimental paradigms (Batchelder & Riefer, 1990, 1999). In such experiments, participants make categorical responses to a series of test items, and an MPT model parameterizes a subset of probability distributions over the response categories by specifying a processing tree designed to represent hypothesized cognitive steps in performing a cognitive task, such as memory encoding, storage, discrimination, inference, guessing, and retrieval. Since its introduction in the 1980s, MPT models have been successfully applied to modeling performance in a range of cognitive tasks in many areas including memory, perception, and social psychology. This paper is concerned with the problem of selecting the best MPT model from a set of scientifically plausible MPT models that are available to account for a given set of data. A researcher may entertain multiple scientific hypotheses about the underlying processes, each formulated as a distinct MPT model, and wish to determine which one of the models best describes observed data in some defined sense. This is the model selection problem (Myung & Pitt, 1997). The G 2 -based likelihood ratio test (LRT) is often employed in model evaluation and selection for MPT models (e.g., Hu & Batchelder, 1994). The basis of this approach is to evaluate the adequacy of a model in the null hypothesis significance testing framework. Although LRT is generally useful for model evaluation in this context, it has several limitations in its application to model selection with multiple models. First, it can only be used to test a pair of nested models with different numbers of parameters. For this reason, it cannot compare models with the same number of parameters but different functional forms, nor can it test inequality constraints on the parameters of a model. Furthermore, LRT is a null hypothesis significance testing procedure, whose goal is to assess the descriptive adequacy of a given model, measured by the p-value. Consequently, LRT does not necessarily help identify the best approximating model to the truth the hallmark of model selection. This criterion is known as generalizability or predictive accuracy in statistics and refers to a 2

3 model s ability to predict future, as yet unseen, data samples (Pitt, Myung & Zhang, 2002). Various model selection methods that measure a model s generalizability have been proposed. They include the Akaike Information Criterion (AIC: Akaike, 1973), the Bayesian Information Criterion (BIC: Schwartz, 1978) and Minimum Description Length (MDL: Grünwald, Myung & Pitt, 2005; Grünwald, 2007). 1 Among these, MDL presents itself as an attractive alternative to LRT in MPT modeling because it overcomes the aforementioned limitations of LRT. That is, MDL can be used for selecting among a set of multiple, nested or non-nested, models that might differ in model structure as well as number of parameters. In this paper, we assess and demonstrate the applicability of the MDL approach for selecting among MPT models using a well studied family of MPT models for the source monitoring paradigm. 2 Minimum Description Length The MDL principle originated from algorithmic coding theory in computer science, and it views statistical modeling as data compression (Grünwald, 2007). The best model is the one that compresses data as tightly as possible. A model s ability to compress data is measured by the shortest code length with which data can be coded with the help of the model. The resulting code length is related to generalizability such that the shorter the code length, the better the model generalizes. The most widely used implementation of the MDL principle for model selection is the Fisher Information Approximation (FIA: Rissanen, 1996). For a model with probability density function f(y θ) with parameter space Θ, it is defined as where F IA = LML + C FIA (1) LML = ln f(y ˆθ(y)) C FIA = S 2 ln N 2π + ln Θ I(θ) dθ (2) LML is the natural logarithm of the maximized likelihood function and it serves as a goodness-of-fit term, and C FIA is the term that measures the complexity of the model. 2 In the equations, S is the number of parameters, N is the sample size, I(θ) is the Fisher information matrix of sample size 1 defined as I(θ) ij = E 2 ln f(x 1 θ), i, j = 1,..., S, and θ i θ j 1 The reader is directed to two recent Journal of Mathematical Psychology special issues on model selection for discussion and example applications of these and other methods of model selection (Myung, Forster & Browne, 2000; Wagenmakers & Waldorp, 2006). 2 C FIA is derived as a large sample approximation to a full, but much harder to compute, MDL complexity measure (Grünwald, 2007), and as such, it can be inaccurate (see, e.g., Navarro, 2004). The reader should note that G 2 -based LRT is also based on large sample approximation. 3

4 I is its determinant. A smaller criterion value indicates better generalization, and thus, the model that minimizes F IA should be chosen. The FIA complexity measure, C FIA, takes into account the number of parameters (S) and the sample size (N) but also importantly, functional form captured through the Fisher information matrix. This is unlike AIC or BIC that considers only the number of parameters as a measure of model complexity. The two criteria are defined as AIC = LML + S and BIC = LML+ S 2 ln(n). The FIA has been been successfully applied to addressing various model selection issues in cognitive modeling (e.g., Lee, 2001; Pitt, Myung & Zhang, 2002; Myung, Pitt & Navarro, 2007). 3 Model Selection for MPT Source Monitoring Models In the remainder of the paper we will illustrate the application of FIA to select among momery models of source monitoring (Batchelder & Riefer, 1990; Bayen, Murnane & Erdfelder, 1996). In a source monitoring study, participants are exposed to items from two sources (A and B), and then they are required to categorize test items as from source A, B, or new (N). The data in any condition can be represented in a 3 3 table, where the rows are test item type (A, B, N) and the columns are responses categories corresponding to these types. All of the models discussed in this section are special cases of the model in Figure 1 known as the two-high threshold source monitoring model (2HTM). The model in Figure 1 has three trees corresponding to the three test item types. It has parameters for detecting the item type as old or new (the D), parameters for discriminating the source of a detected old item (the d), and guessing parameters (the a, the g, and b). In the case of MPT models, differences in functional form arise from different tree structures or different assignments of parameters or categories to the nodes of the tree. Batchelder & Riefer (1990) provided a general one-high threshold source monitoring model (1HTM), which is a special case of 2HTM by requiring D 3 = 0. They further defined seven MPT models (Batchelder & Riefer, 1990, Figure 2) obtained by equating the detection parameters (D 1 = D 2 ), the discrimination parameters (d 1 = d 2 ), and/or the guessing parameters (a = g). Figure 2 shows the C FIA curves 3 for the six of the eight models that are identified 4. The first thing to note in the figure is that complexity is generally ordered according to the number of parameters. Also note that among models with the same number of parameters, their complexity values can differ significantly from one another, sometimes even greater than the complexity difference due to the difference in the number of parameters. The case in point is the three models, 4, 5b and 5c. At the 50% value of new items, the complexity difference between model 5c and model 5b is equal to about 1.50, which is greater than the complexity difference (0.62) between models 5b and 4. In short the results demonstrate that model complexity is determined not only by the number of parameters but also importantly, by functional form (i.e., tree structure, parameter 3 The C FIA values in this paper were calculated with a MATLAB programme for arbitrary MPT models. Please contact the first author for more details. 4 An MPT model is identified if different parameter values generate different category probabilities. 4

5 assignment and category assignment), and sometimes even more so by the latter. Bayen, Murnane & Erdfelder (1996, Appendix C) published source monitoring data in a 3 (distractor similarity) 3 (source similarity) factorial experiment designed to select among different versions of the 2HTM. The goal was to experimentally manipulate conditions designed to affect the different processes represented by parameters in the model. In each condition of their study, participants listened to statements from two narrative sources (A and B) about a hypothetical missing person. Then they were asked to categorize test words as from narrator A, B, or new (N). The data collected were summarized into nine 3 3 tables, one table for each of the nine experimental conditions (Bayen, Murnane & Erdfelder, 1996, Appendix C). The purpose of their study was to evaluate the appropriateness of three different MPT models, namely the 1HT four parameter model (1HTM-4, model 4 in Batchelder & Riefer, 1990, Figure 2), the two-equal-high-threshold four-parameter model (2eHTM-4, model 4 in Bayen, Murnane & Erdfelder, 1996, Figure 4), and the low-threshold model, in modeling source monitoring data. We will focus only on the 1HT-4 and 2eHTM-4 models. Both models have four parameters (D, d, b, g) and are special cases of 2HTM assuming D 1 = D 2 (= D), d 1 = d 2 (= d) and a = g. The difference is that in 1HTM-4 there is no threshold for new items (D 3 = 0), while in 2eHTM-4 the threshold for new items is assumed equal to that of the old items (D 3 = D 1 = D 2 ). Based on signal detection considerations, six experimental hypotheses shown in Table 1 were constructed to test the validity of each model. For example, increasing distractor similarity should decrease old item detection (H 1 ), and increasing source similarity should decrease source discrimination (H 4 ), etc. To test the six hypotheses for each model, Bayen and colleagues relied on multiple pairwise LRTs. Fixing one factor to a given level, a LRT of the null hypothesis of parameter equality across the three levels of the other factor was performed. This way, a total of 18 LRTs were performed for each model. For model 2eHTM-4 all but one tests supported the six hypotheses; however, for model 1HTM-4 H 2 was not supported. Therefore the authors concluded that 2eHTM was the more valid of the two models. The results from multiple LRTs, however seemingly clear-cut, should be taken with a grain of salt. In addition to the conceptual and technical weaknesses of LRT discussed in Section 1, the following concerns should be noted. First, the results from the 18 tests, if taken as a single hypothesis, may not hold true, because the overall type I error rate was not protected. Furthermore, the parametric constraints of the 2eHTM-4 may not be realistic. In particular, the equality constraint of the three detection parameters, D 1 = D 2 = D 3 seems unnecessarily stringent. Rather, the assumption that the detection parameter D 3 for new items is different from D 1 and D 2 for old items seems more reasonable. In fact, the equality constraint was not imposed for theoretical reasons, but primarily for a technical reason, i.e., to make the model identifiable (Bayen, Murnane & Erdfelder, 1996, p. 206). We now analyze the same data set using FIA after making two modifications to the original MPT model of Bayen, Murnane & Erdfelder (1996) and also to the way in which the data were used to evaluate the model. First, our analysis is based on the 2HTM with five parameters (D, D, d, b, g), where D (= D 1 = D 2 ) is the detection parameter for both source 5

6 A and B items and D (= D 3 ) for new items. Since this model, 2HTM-5, is not identified in each separate condition of the 3 3 design, we devised a new model testing scheme that addresses this issue, which leads to the second modification: instead of testing the same model repeatedly to data from each condition, as done in Bayen, Murnane & Erdfelder (1996), we develop one super 2HTM-5 with relevant constraints on the parameters across treatments and evaluate this model just once against the entire data set. In particular, we use the equality constraints in H 2 (for both D and D ) and H 3 for this purpose. With this novel specification, the model is now identified and therefore the questionable constraint of D = D within each treatment can now be removed. It is worth mentioning that we could also construct many such super models, nested or non-nested, each corresponding to a different combination of H 1, H 4 - H 6 described earlier. With these changes implemented, we applied FIA to select among MPT models, and the results are compared to those based on AIC and BIC. A total of 15 models with various combinations of equality and inequality constraints on parameters are constructed and are then compared simultaneously with one another using FIA. The results are presented in Table 2 and 3. Note that the non-identifiable 2HTM-5 assumes five parameters for each of the nine conditions of the experimental design; however each of the equality constraints in H 2 (for both D and D ) and H 3 removes six parameters, resulting in an identified model with (5 9) (6 3) = 27 parameters for the entire data set. The resulting model is denoted by M0-0 in Table 2. This least constrained model, which nests all other models in the tables, serves as a baseline for the analysis described next. From model M0-0, the other 14 models were derived by incorporating additional equality constraints into the model. They are grouped into three subgroups: (a) 2HTMs in group M0 assume separate detection thresholds for old and new items; (b) 1HTMs in group M1 assume no detection threshold for new items (i.e., D = 0); and (c) 2eHTM in group M2 assume equal thresholds for both old and new items (i.e., D = D ). Within each subgroup, multiple models are constructed by imposing additional equality constraints in H 1, H 4 -H 6 to address relevant theoretical issues. A total of 14 models are presented for evaluation. 5 Of these, models M1-3 and M2-3 are noteworthy as they are fully compatible with H 1 - H 6 stated in Bayen, Murnane & Erdfelder (1996). Results from FIA analyses for the 15 models, as well as AIC s and BIC s are summarized in Tables 2 and 3. It should be noted that hypotheses H 1 (for both D and D ), H 4 and H 5 involves inequality constraints on the parameters when the corresponding equality constraints are not assumed. Both versions of each of the 15 models with and without those inequality constraints are included for analysis. 6 Because the inequality constraints are not violated in this data set for all 15 models, both versions give the same LML, AIC and BIC results and the only difference made by the inclusion of the inequality constraints is the FIA results. The FIA result for the version of model without inequality constraints is listed 5 Models whose fit is very poor (with LML > 239) and models that leads to very small improvement in fit when compared to a more constrained model (with LML < 1) are excluded because they will not be selected by any of the model selection criteria. Model M0-4 is not identified. S = 16 denotes the effective number of parameters. The identification condition max j d j = 1 was employed to obtain LML and C FIA. 6 This applies to b only when H 6 holds. 6

7 in the rows marked FIA, while that for the model with inequality constraints are listed in the rows marked FIAi. Among the four 1HTMs, M1-x (x =1,...,4), FIA selects model M1-3 with inequality constraints within this class. Regarding AIC and BIC, the former favors M1-3 whereas the latter favors M1-4. For the five 2HT models, all three selection criteria choose model M2-3 within this class. Between this model with and without inequality constraints, FIA favors the former. In short, M1-3 and M2-3 with inequality constraints on parameters turn out to be favored within each class. This conclusion only partially agrees with that of Bayen, Murnane & Erdfelder (1996) based on multiple LRT tests, which pointed to the 2eHTM M2-3 but not 1HTM M1-3. In addition, because both M1-3 and M2-3 are consistent with the six hypotheses, our analysis does not support their conclusion that the 2eHTM describes the source monitoring data better than the 1HTM. When we compare all 15 models, including the five in group 0, all three criteria unanimously select the 2HTM M0-4. In particular, between model M0-4 with and without inequality constraints, FIA favors the one with the constraints. The final choice of this model as the best generalizing model of all considered have two important implications for interpreting Bayen, Murnane & Erdfelder (1996) data. First, the model does not make the assumption of D = D, casting a doubt on the equality assumption made in Bayen, Murnane & Erdfelder (1996). Second, note that under model M0-4, both D and b are set to be the same for all treatments but D and d are allowed to vary with distractor and source similarity levels respectively in the desired direction. This implies that besides H 2 and H 3, which are baseline, the data do not support H 5, partially support H 1 and fully support H 4 and H 6. To summarize, the reanalysis of Bayen, Murnane & Erdfelder (1996) s data using FIA has led to the interpretations of the data that are substantively different from those based on their LRT tests. According to our conclusion, under the experimental manipulations, both 1HTM and 2eHTM exhibit the pattern of parameter change specified in the six hypotheses, and there is no evidence for favoring 2eHTM over 1HTM. Furthermore, neither D = D nor D = 0 is supported by our model selection approach. The chosen model has different thresholds for the old and new items. In this model, both guessing parameters are insensitive to experimental manipulations 7, increasing distractor similarity decreases the threshold of new items but not old items, and increasing source similarity decreases source discriminability. 4 Conclusions MDL s flexibility of application to virtually any model comparison situations that may arise in MPT modeling makes it an attractive alternative to the traditional LRT approach. It is important to note that MDL represents a model selection approach in which the models 7 If models assuming equal g over the 9 treatments are involved for comparison, FIA will choose the version of M0-4 with constant g and inequality constraints, with F IA = 201.5, lower than the second lowest FIA value by F IA =

8 in contention are ranked by their generalizability, or, equivalently, predictive accuracy, the hallmark in model selection. In the current application, hypotheses designed to test the validity of a model were specified into master models using various combinations of equality and inequality restrictions among the parameters. In this way several potentially valid models could be comparing on FIA rather than a series of LRTs based on a null hypothesis testing approach. This approach may be especially useful for choosing a valid model to use in investigating a number of future data sets of interest. In the present study we focused our investigations on FIA due to its unique advantage in describing model complexity. Other methods of MPT model selection not discussed in the present paper include the Bayes factor (BF: Kass & Raftery, 1995) and the Population-Parameter Mapping (PPM) method (Chechile, 1998). Both methods originate from Bayesian statistics, and in particular, the PPM method is unique in its way that it can handle misspecified models. Unlike FIA, however, both of these methods do not lend themselves to furnishing a stand-alone complexity measure. This is why we did not specifically include them in the present investigation. In conclusion, model selection lies at the core of the scientific inference process. Accordingly, a theoretically well-justified and widely applicable methodology can help advance science. We believe that MDL represents such a methodology that provides versatile yet powerful tools for assessing the validity of MPT models in a way that goes beyond the shortcomings of the current methods such as AIC, BIC and LRT. One final note. It is important to note that statistical model selection techniques alone, however sophisticated, are not a panacea for all inference problems. Other non-statistical means of model evaluation such as plausibility, interpretability, and explanatory adequacy are equally, if not more, important. Statistical model selection would be most useful if it were combined with the judicious use of sound subjective but scientific judgement. 8

9 References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Kotz and Johnson (1992): Breakthroughs in Statistics. NY: Springer Verlag. Batchelder, W. H. and Riefer, D. M. (1990). Multinomial processing models of source monitoring. Psychological Review, 97, Batchelder, W. H. and Riefer, D. M. (1999). Theoretical and empirical review of multinomial process tree modeling. Psychonomic Bulletin and Review, 6, Bayen, U. J., Murnane, K. and Erdfelder, E. (1996). Source discrimination, item detection, and multinomial models for source monitoring. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, Chechile, R.A. (1998). A new method for estimating model parameters for multinomial data. Journal of Mathematical Psychology, 42, Grünwald, P. (2007). The Minimum Description Length Principle. MIT Press. Grünwald, P., Myung, I. J., and Pitt, M. A.(2005). Advances in Minimum Description Length: Theory and Applications. MIT Press. Hu, X. and Batchelder, W. H. (1994). The statistical analysis of general processing tree modelswith the EM algorithm. Psychometrika, 59, Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, Knapp, B., & Batchelder, W.H. (2004). Representing parametric order constraints in multitrial applications of multinomial processing tree models. Journal of Mathematical Psychology, 2004, Lawley, D. N. and Maxwell, A. E. (1971). Factor analysis as a statistical method. Elsevier New York. Lee, M. D. (2001). On the complexity of additive clustering models. Journal of Mathematical Psychology, 45, Myung, I. J., Forster, M. R. and Browne, M. W. (2000). Guest editors introduction, special issue on model selection. Journal of Mathematical Psychology, 44, 1-2. Myung, I. J., and Pitt, M. (1997). Applying Occam s razor in modeling cognition: Bayesian approach. Psychonomic Bulletin & Review, 4(1), A Myung, I. J., Pitt, M. and Navarro, D. J. (2007). Does response scaling cause the generalized context model to mimic a prototype model? Psychonomic Bulletin & Review, 14(6),

10 Navarro, D. J. (2004). A note on the applied use of MDL approximations. Neural Computation, 16, Pitt, M. A., Myung, I. J. and Zhang, S (2002). Toward a method of selecting among computational models of cognition. Psychological Review, 109, Rissanen, J. J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42, Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, Wagenmakers, E. -J. and Waldorp, L. (2006). Editors introduction. Journal of Mathematical Psychology, 50, Acknowledgements This paper is based on Hao Wu s Master of the Arts thesis submitted to the Ohio State University in July It is supported in part by National Institute of Health Grant R01- MH57472 to JIM. Work on this paper by the third author was supported in part from a grant from the National Science Foundation: SES to X.Hu and W.H. Batchelder(Co-PIs). We wish to thank Mark A. Pitt and Michael W. Browne for valuable feedbacks provided for the project. Correspondence concerning this article should be addressed to Hao Wu, Department of Psychology, Ohio State University, 1835 Neil Avenue, Columbus, OH wu.498@osu.edu. 10

11 Table 1: Summary of the six hypotheses in Bayen, Murnane & Erdfelder (1996). Each parameter has two subscripts denoting levels of the two experimental factors. The first subscript denotes the level of the distractor similarity factor and the second subscript j denotes the level of the source similarity factor. H, M and L denotes High, Medium and Low levels. H 1 : D Hj < D Mj < D Lj, for all j {H, M, L} H 2 : D ih = D im = D il, for all i {H, M, L} H 3 : d Hj = d Mj = d Lj, for all j {H, M, L} H 4 : d ih < d im < d il, for all i {H, M, L} H 5 : b Hj > b Mj > b Lj, for all j {H, M, L} H 6 : b ih = b im = b il, for all i {H, M, L} 11

12 Table 2: Summary of model selection results for source monitoring data from Bayen, Murnane and Erdfelder (1996, Appendix C). The total sample size is N = 11,232. Sample size S is listed for each model. The middle part of the table summarizes the equality constraints assumed by the models. The symbols are defined as follows: s 2 stands for set to be the same across distractor similarity conditions only ; s 1 stands for set to be the same across source similarity conditions only ; s stands for set to be the same across all 9 conditions ; v stands for allowed to vary across all 9 conditions. In the row for parameter D, 0 stands for set to 0 ; D stands for fixed to parameter D in the same experimental condition. For the inequality constraints associated with FIAi, see text. The table continues to Table 3. Model M0-0 M0-1 M0-2 M0-3 M0-4 M0-5 S D s 1 s 1 s s 1 s s 1 Equality D s 1 s 1 s 1 s s 1 s constraints b v s v v s s 1 imposed d s 2 s 2 s 2 s 2 s 2 s 2 g v v v v v v -LML LM L FIA C FIA F IA FIAi C FIA F IA BIC C BIC BIC AIC C AIC AIC

13 Table 3: Continuation of Table 2. Model M1-1 M1-2 M1-3 M1-4 M2-1 M2-2 M2-3 M2-4 M2-5 S D s 1 s s 1 s s 1 s s 1 s s 1 Equality D D D D D D constraints b v v s 1 s 1 v v s 1 s 1 s imposed d s 2 s 2 s 2 s 2 s 2 s 2 s 2 s 2 s 2 g v v v v v v v v v -LML LM L FIA C FIA F IA FIAi C FIA F IA BIC C BIC BIC AIC C AIC AIC

14 Figure 1: The two-high-threshold multinomial processing tree model of source monitoring. Adapted from Bayen, Murnane & Erdfelder (1996, Figure 3).The parameters are defined as follows: D 1 (detectability of source A items); D 2 (detectability of source B items); d 1 (source discriminability of source A items); d 2 (source discriminability of source B items); a (guessing that a detected but nondiscriminated item belongs to source A category); b (guessing old to a nondetected item); g (guessing that a nondetected item biased as old belongs to source A category). 14

15 Figure 2: C FIA complexity curves for six source-monitoring models from (Batchelder & Riefer, 1990, Figure 2) plotted as a function of the percentage of new items for the total sample size of n Old + n New = Sample sizes for the two sources are the same. The six models differ from one another in the number of parameters but also in functional form. The number of parameters of each model is indicated by the first digit of the corresponding label in the legend. 15

Annotated multitree output

Annotated multitree output A simplified version of the two high-threshold (2HT) model, applied to two experimental conditions, is used as an example to illustrate the output provided by multitree (version