PARAMETRIC MODEL SELECTION TECHNIQUES

Size: px

Start display at page:

Download "PARAMETRIC MODEL SELECTION TECHNIQUES"

Alan Warner
5 years ago
Views:

1 PARAMETRIC MODEL SELECTION TECHNIQUES GARY L. BECK, STEVE FROM, PH.D. Abstract. Many parametric statistical models are available for modelling lifetime data. Given a data set of lifetimes, which may or may not be censored, which parametric model should be used to conduct statistical tests? In only a few cases can analytical expressions be found to answer this question in some optimal fashion. Various measures of discrepancy and other functionals of the distribution function will be considered for a finite number of competing parametric statistical models. Utilizing techniques developed by Linhart and Zucchini, survival data from pediatric patients who have received stem cell transplants will be analyzed to determine if models for random samples represent the actual model for the population. 1. Introduction Probability models are useful for providing information about observations of seemingly random variables. In controlled settings, various parametric models may be chosen which appear to fit the data. However, it is highly unlikely that investigators will have a complete data set with which to base assumptions and are therefore resigned to using a battery of tests to confirm how well the chosen model fits the observed pattern of data [1]. Linhart and Zucchini [2] lay the groundwork to find more accurate means of model selection. Taking simple random samples, observations may be regarded as independent and identically distributed random variables with a non-negative probability density function (pdf). This particular pdf, notated f(x), may be regarded as the model which gave rise to the observations, which is referred to as the operating model [3]. In practice, the operating model is used to make estimations The author would like to give special thanks to John Maloney, Ph.D., for his invaluable assistance in programming Maple. 1

2 2 GARY L. BECK, STEVE FROM, PH.D. about f(x) even though it is based on only a sample of the population. It is only exceptional cases wherein sufficient information is available to identify the operating model. It is even more of a rarity to have a complete data set available to identify the operating model. Therefore, it is important to have an understanding of the information under investigation in order to circumscribe a family of models to best represent the pdf. The size of the family of models is determined by the number of its independent parameters. These parameters are estimated from observations, the accuracy of which depends on the amount of data available relative to the number of parameters to be estimated. This improves as either the sample size increases or the number of parameters decreases. Attempting to estimate too many parameters is referred to as overfitting, which leads to instability. In other words, repeated samples collected under comparable conditions leading to widely varying models exemplifies overfitting. Consequently, this suggests that successful model fitting is a matter of ensuring a sufficient data set to achieve a determined level of accuracy. Traditionally, histograms have been used to summarize a set of data graphically. Based on the graphical display, the behavior of the histogram may be used to estimate the probability density function whether it be a final estimate or intermediate estimate in search of a smoother model. However, the properties of the histogram as an estimator of the pdf strongly depend on the sample size and number of intervals of the information. Depending on the sample size, the number of intervals tend to display greater variability; however, the uncertainty of this phenomenon is identifying whether these graphical variations represent the population as a whole or simply sample characteristics. A common approach to fitting a model is to select the simplest approximating family which is consistent with the data. This is a matter of selecting from a catalog of models which match the general features apparent in the data, which may be recognized using histograms. It is assumed

3 PARAMETRIC MODEL SELECTION TECHNIQUES 3 that the family of models represents the true situation, which must then be tested by means of a hypothesis to determine if the model is consistent with specific aspects of the data. The advantage to this method is that relatively simple models may be selected to analyze the data. In so doing, assumptions are made that the models chosen are valid, and thus estimators are chosen and decisions made. The drawback is that the assumptions about unknown parameters becomes the focus, which in reality are only of interest if they can be naturally and usefully interpreted in the context of the observation. Instead, Linhart and Zucchini suggest using estimators robust to deviations from the assumed family selected for fitting holds. [2]. This approach is to choose a family of models which are estimated to be the most appropriate under the circumstances, which they identify as the background assumptions, the sample size, and the specific requirements of the user. Discrepancies Before comparing performances of competing models may occur, what measure to use to assess the fit or lack thereof must be determined. This measure of lack of fit is referred to as a discrepancy, denoted (f, g θ ). A discrepancy between the operating model and the best approximating model is referred to as the discrepancy due to approximation. It constitutes the lower bound for the discrepancy for models in the approximating family and does not depend on the data, the sample size, or the method of estimation employed. The discrepancy between the fitted model and the best approximating model is called the discrepancy due to estimation, which does depend on the sample values and changes depending on the sample. Therefore, discrepancy due to estimation is a random variable. Finally, the overall discrepancy is defined as the discrepancy between the operating model and fitted model, which takes both factors into account. Therefore, it is necessary to take the two issues into account when comparing approximating families of different complexity. The best model in more complex families is typically closer to the

4 4 GARY L. BECK, STEVE FROM, PH.D. operating model than the best model in simpler families. However, the fitted model in the complex family tends to be further away from the best model than is the simpler model. Thus, complex families have more potential but tend to perform below potential. Therefore, the overall discrepancy, which is a sum of its two component discrepancies, allows for an appropriate compromise between the two opposing properties [3]. All of this is possible when the actual operating model is known. However, in practice, this is rarely the case. In practice, calculating discrepancies is not possible even though they exist. Since the actual operating model is unknown and overall discrepancy is unknown, an estimator of the expected discrepancy, E (f, gˆθ), can be made, which is called a criterion. The expected discrepancy given by Linhart and Zucchini is notated as (1.1) E F (ˆθ) = E F (f(x) g (I) ˆθ (x))2 dx where g (I) ˆθ (x) = n ii 100n for 100(i 1) I < x 100i I, i = 1, 2,..., I. In this equation, n i represents the frequency of the ith interval. In taking the expectation of this integral after sufficient subdividing, the following equation was obtained 100 (1.2) E F (ˆθ) = E F (f(x) 2 dx + I ( 1 (n + 1) 100n where π i = 100i I 100(i 1) f(x)dx. I 0 The first term does not depend on the approximating family and therefore can be ignored. The second term is the essential term and an unbiased estimator of this term, a criterion, is given by I i=1 π 2 i ) (1.3) [ I 1 n + 1 ( I n 2 )] i 100n n 1 n 1 i=1 This type of procedure does not specify any one approximating family of models except on the basis of their criteria. Some situations merit using a simple or a complex model depending on the behavior of the data set. For some instances, it may also be possible to construct a test of a

5 PARAMETRIC MODEL SELECTION TECHNIQUES 5 hypothesis that a particular approximating family has a smaller expected discrepancy than another. In most cases, tests are difficult to construct, meriting reliance on simple comparisons of estimated expected discrepancies against a selected level of significance. Consider the following random age distributions from the Federal Republic of Germany in 1974 (Statistisches Bundesant,1976, p. 58). Figures 1 to 2 display the distribution of ages in histogram format. The figures show the grouped data in individual intervals, intervals of 10 and intervals of 50. The size of the interval will determine the optimal criterion. Figure 3 shows the criterion values for the population, indicating at I = 6 is approximately the smallest criterion value, thereby identifying it as the most appropriate. Table 1. Random Age Distributions, n= This leads to the primary issue of model selection; namely, identifying an operating family of models and constructing a discrepancy. Identifying the operating model is typically determined by the type of analysis one intends to complete, whether it is hypothesis testing, regression analysis,

6 6 GARY L. BECK, STEVE FROM, PH.D Figure 1. Age histogram, I=10 analysis of variance, etc. The next step is to choose a discrepancy. For this, what must be determined is the use to which the fitted model is to be assigned and the determination of which aspect of it must conform to the operating model. Important Discrepancies Discrepancies should be selected to match the objectives of the analysis [3]. A natural estimator to use with a particular discrepancy is a minimum discrepancy estimator, or minimum distance estimator. In other words, this is simply the discrepancy between the approximating model and the proposed operating model. The method of maximum likelihood estimation (MLE) which was developed by R.A. Fisher states the desired probability distribution be that which makes the observed data most likely [1]. Using

7 PARAMETRIC MODEL SELECTION TECHNIQUES Figure 2. Age histogram, I=50 MLE is an important general purpose method for calculating discrepancies. MLEs are asymptotically normally distributed, asymptotically minimum variance, and asymptotically unbiased (as n approaches infinity) [1]. The Kullback-Leibler discrepancy, notated K L (f, g θ ) = E F log(g θ (x)) = log(g θ (x)f(x)dx where g θ is the pdf characterizing the approximating family of models. The minimum discrepancy estimator associated with this discrepancy is also MLE. This discrepancy focuses on the expected log-likelihood when the approximating model is g θ, and, as a result, the higher the expected loglikelihood the better the model. Another possible discrepancy is the Cramér-von Mises discrepancy, which is notated as C M (θ) = E Gθ (F (x) G θ (x)) 2

8 8 GARY L. BECK, STEVE FROM, PH.D Figure 3. Graph of Criterion of Ages For discrete or grouped data sets, the Pearson chi-squared or Neyman chi-squared discrepancies are suitable. They are notated, respectively, as P (θ) = x (f(x) g θ (x)) 2, g θ (x) 0 g θ (x) and N (θ) = x (f(x) g θ (x)) 2, f(x) 0 f(x) where f and g θ are the pdf s characterizing the operating model and approximating family. Discrepancies need not depend on every detail of the distribution. Rather, discrepancies may be based on some specific aspect of the distributions. For example, in regression analysis, only certain expectations are of particular interest. Thus, the use of this method for model selection, albeit complex, is very flexible for any aspect of data analysis. Derivation of Criteria

9 PARAMETRIC MODEL SELECTION TECHNIQUES 9 Each operating model and discrepancy require their own method of analysis. Having decided which approximating families to be considered, the methods to be used to estimate the parameters, and which discrepancy to use to assess the fit, a criterion must be found, which is an estimator of the expected discrepancy E F (ˆθ). This expectation is taken with respect to the operating model F. The derivation of this can be straightforward or exceptionally complex (NOTE: The appendix of Linhart and Zucchini [2] details the derivation of these criterion). When the expected discrepancy is too complex to derive, asymptotic methods will sometimes work; i.e., its limiting values as the sample increases indefinitely. By approximating (ˆθ) by the first terms of its Taylor expansion about the point θ 0, the expectation is then calculated. Thus, as the sample size increases, the expected discrepancy approaches the form (1.4) E F (ˆθ) (θ 0 ) + K 2n where K = trω 1 Σ, a trace term of the product of two matrices. Bootstrap methods provide a simple and effective means of circumventing the technical problems encountered in deriving expected discrepancies and estimators. With this method, one generates repeated samples of size n using a fixed F n which was derived from the operating model and empirical distribution function. Each sample leads to more estimates of ˆθ, which is an estimator for an approximating family of models. The average of this converges to the expected discrepancy. Cross-validation is a technique to assess the validity of a statistical analysis. With this method, data are subdivided into a calibration sample (sample size of n-m) and the second sample to validate it (sample size of m). This procedure of fitting and validating is repeated m times, one for each subdivision. There is a problem in deciding how to select m without limiting the number of observations available to fit the model. With cross-validation, one may use a small m and follows the these steps for all possible model fitting samples of size n-m: fit the model to the calibration

10 10 GARY L. BECK, STEVE FROM, PH.D. sample then estimate the expected discrepancy for the fitted model using the validation sample. The cross-validation criterion is therefore the average over these repetitions of the estimates obtained in the second step. As shown in Figures 1 and 2, histogram densities are universally applicable. However, lower discrepancies may be achieved by fitting smoother approximating densities that depend on fewer parameters. Histograms are primarily used as a means of selecting approximating models and, therefore, smoother models such as the normal, lognormal, and gamma provide more concise descriptions. Unless certain distributional properties of the estimator are available, it is not possible to derive exact expressions for the expected discrepancy. Since finite sample distributions are too difficult to derive, one must rely on asymptotic methods, Bootstrap methods, or cross-validatory methods. This study concentrated on asymptotic methods of a complete data set acquired from the University of Nebraska Medical Center. In order to obtain asymptotic criteria, it is necessary to obtain a trace term, trω 1 Σ, derived from a functional on an M xm k-dimensional stochastic process. This may be estimated from data using an estimator for Ω and Σ. The criterion is then n (ˆθ) + trω 1 n Σ n n, wherein Ω and Σ have been derived. Alternatively, if the operating model were a member of the approximating family, the trace term becomes significantly simpler. For a number of discrepancies, the trace term is simply a multiple of p, the number of free parameters in the approximating family. It should be noted that approximations on which the derivation of simpler criteria is used will be inaccurate whenever the discrepancy due to approximation is large [2]. The Kullback-Leibler discrepancy is one of the most important general purpose discrepancies. It is also an essential part of the expected log-likelihood ratio and is related to entropy, a fundamental property in information theory. This discrepancy and its asymptotic criteria gives rise to a number

11 PARAMETRIC MODEL SELECTION TECHNIQUES 11 of discrepancies of standard distributions. The discrepancy is notated as (θ) = (G θ, F ) = E F log(g θ (x)) with empirical discrepancy of n (θ) = (G θ, F n ) = 1 nlog(g θ (x i )) n The asymptotic criterion is notated as i 1 n (ˆθ) + trω 1 n Σ n n where the simpler criterion is notated to be n (ˆθ) + p n The minimum discrepancy estimator, which is the maximum likelihood estimator, is notated ˆθ = argmin{ n (θ) : θ Θ} Given this information, Linhart and Zucchini have identified maximum likelihood estimators for a variety of probability distributions. For the purposes of this study, the normal, lognormal and gamma distributions were used. Based on the methodologies of Linhart and Zucchini (Appendix, [2]), the Kullback-Leibler criterion for each model is indicated below. Please note that sample moments and sample moments about sample means are denoted m h [ ] and m h[ ], h=1,2,... For example, m 2 [log(x)] = 1 n n i=1 log2 (x i ) is the second moment about the mean of log(x). Therefore, for normal distributions, the criterion is (1.5) 1 + log(2πm 2 ) 2 + m 4 + m 2 2 2nm 2 2

12 12 GARY L. BECK, STEVE FROM, PH.D. with trace term equalling m 4+m 2 2. The estimators for the normal distribution are 2nm 2 2 ˆλ = m 2 ˆµ = m 1 It should be noted that ˆλ is the same as ˆσ 2 for this distribution and the lognormal distribution. For the lognormal distribution, the criterion is (1.6) m 1[log(x)] log(2π) + log(m 2[log(x)]) 2 + m 4[log(x)] + m 2 2 [log(x)] 2nm 2 2 [log(x)] with trace term equalling m 4[log(x)]+m 2 2 [log(x)] 2nm 2 2 [log(x)]. The maximum likelihood estimators for lognormal are ˆµ = m 1[log(x)] ˆλ = m 2 [log(x)] Finally, the criterion for the gamma distribution is notated (1.7) log(γ(ˆν)) ˆν(log(ˆλ 1) (ˆν 1)m 1[log(x)] + trω 1 n Σ n n Trace term for the gamma distribution is the last term in equation 1.7 where Ω n = m 2 1 ˆν m 1 ˆν m 1 ˆν ψ (ˆν) Σ n = m 2 m 11 [x, log(x)] m 11 [x, log(x)] m 2 [log(x)]) The estimators for the gamma distribution are ˆνˆλ = m 1 and log(ˆλ) ψ(ˆν) = m 1 [log(x)].

13 PARAMETRIC MODEL SELECTION TECHNIQUES Methods Data Collection Data for this study was obtained from the University of Nebraska Medical Center Clinical Cancer Trials office. Pediatric patients who have undergone stem cell transplants are tracked by this office. Information pertaining to the date of diagnosis, date of treatment, and date of expiration are collected and stored in a Microsoft Access database. The Institutional Review Board gave an exemption to this study because no subjects would be identifiable. Analysis The entire population of pediatric patients who have received stem cell transplants at the University of Nebraska Medical Center is 289. Therefore, it is possible to compare the operating model based on the entire population with repeated random samples of 20. Use of equation 1.3, criterion for the data set was examined. Figures 4 and 5 show histograms of survival data based on intervals when I = 10 and I = 50. It was found that at I = 6, the criterion was the smallest and thus the optimal number of intervals for this histogram. Using the Kullback-Leibler criterion of normal, lognormal, and gamma distributions, the study will emphasize how well one of the chosen models selects the true operating model as determined by the mean square error. Using the known population, this technique should select the correct best model from the approximating models, where correct best is measured by the fraction of samples of size 20 when certain functionals of the cumulative distribution function of the population are most closely estimated. Let Y 0 be a real positive number. Let M = P (Y Y 0 ), where Y is a population value. For a given sample of n data points (n=20 ), each competing model (normal, lognormal, gamma) will provide an estimate of M. Let these be denoted by ˆM N (normal), ˆML (lognormal), and ˆM G (gamma) for each model. Now let ˆM N (i) equal the value of ˆMN for sample number i. This was computed

14 14 GARY L. BECK, STEVE FROM, PH.D Figure 4. Histogram of Survival Data, I=10 from the MLEs of the model parameter (a parametric model). Similar definitions for ˆM L (i) and ˆM G (i) exist. Let R n equal the number of samples generated at random of size n=20. Computing the absolute errors, the following results are E N (i) = M ˆM N (i), i = 1, 2,..., R n E L (i) = M ˆM L (i), i = 1, 2,..., R n E G (i) = M ˆM G (i), i = 1, 2,..., R n By generating 5,000 random samples of size n, approximate mean square errors (MSEs) are computed. MSE equals the mean of the squares of the deviations from the target [4]; i.e., MSE N = MSE L = 5000 i=1 (M ˆM N (i)) i=1 (M ˆM L (i)) = = 5000 i=1 E N(i) i=1 E L(i)

15 PARAMETRIC MODEL SELECTION TECHNIQUES Figure 5. Histogram of Survival Data, I=50 MSE G = 5000 i=1 (M ˆM G (i)) = 5000 i=1 E G(i) The actual best model will be the one with the smallest MSE( ) value among the normal, lognormal, and gamma models. Upon completion of these computations, the Kullback-Leibler criterion was calculated for each model to ascertain whether this method matched the determination based on the actual operating model. For this, the asymptotic criteria values for the Kullback-Leibler criterion for these models, notated as A N (i), A L (i), and A G (i), using equations 1.5, 1.6, and 1.7. The chosen parametric model should correspond with the smallest of these three values. For each value of size n=20, the asymptotic criteria was computed to determine a concluded best model. It should be noted this method of comparing asymptotic criteria with MSEs is valid for this study only because the complete population is known.

16 16 GARY L. BECK, STEVE FROM, PH.D. 3. Discussion The entire population of pediatric stem cell recipients were included in this data set, n=289. A co-author (SF) wrote a program in Fortran to generate random samples of 20 from the data set which calculated mean square errors and asymptotic criterion. To ensure accuracy of this methodology, 5,000 random samples were generated for each MSE and asymptotic criterion. All of these results could not be presented in this report. Consequently, Table 2 includes a representative sample of n=20 from the data set. Table 2. Random Sample of Survival Data of Stem Cell Recipients, n= From this random sample, maximum likelihood estimators for normal, lognormal, and gamma distributions were calculated. Based on these results, the Kullback-Leibler criterion, estimated standard deviation and trace values were calculated. These results are shown in Table 3. Table 3. Random Sample Criterion Results ˆµ ˆλ K-L Criterion Est. Std. Dev. Trace Normal Lognormal Gamma From these results, it is evident that the lognormal distribution has the smallest asymptotic criterion followed closely by the gamma distribution. Regardless, according to Kullback-Leibler, this would

17 PARAMETRIC MODEL SELECTION TECHNIQUES 17 then be the best model to use to analyze the population. The interesting note is that for a sample of this size in much of the medical literature, a normal distribution would be assumed due to the n size. However, by this method, the normal distribution resulted in the largest criterion and would therefore be a less than optimal choice for this population. The real data was used to obtain the actual value of M = P (Y Y 0 ), where Y 0 is a given positive real number and Y is a population value. Using the following inequality M = #populationvalues Y 0 N where N=289 is the population size. Based on the smallest MSE from 5,000 random samples of size n=20, the best parametric model was selected. In calculating the mean standard error, Y 0 = 50.0, M = P (Y Y 0 ) = The MSEs for each distribution were as follows: Normal = Lognormal = Gamma = Determining the MSE for the entire population, the gamma distribution appeared to be the optimal distribution to use for analyzing the population. However, given the close values for lognormal and gamma MSEs, the gamma distribution is only slightly better. In both cases of asymptotic criterion and MSE, the normal distribution is clearly not an appropriate means of analysis. When all 5,000 random samples were analyzed, the model that was deemed to be the most appropriate should have the smallest value the majority of the time. For this simulation, 76.4% of the samples of n=20 found that the asymptotic criterion correctly picked the best model for Y 0 = Of interest, the middle third of the distributions, the asymptotic criterion correctly

18 18 GARY L. BECK, STEVE FROM, PH.D. picked the best model. Where Y 0 is small or large, the asymptotic criterion does not seem to be the most optimal selection method Percent Y_0 Figure 6. Y 0 versus Percentage (Mean= ; Variance= , Median= ) 4. Conclusion The Kullback-Leibler criterion seems to work well for the middle of this population distribution (See Figure 6). It seems the normal distribution had a better fit for the left tail when Y 0 was small and the lognormal distribution is a better fit for the right tail for large Y 0. This implies another criterion which emphasizes a better fit for the tails is needed. The small population size and the fact that MSEs were extremely close for the gamma and lognormal distributions where the above percentages were small are also contributing factors to this phenomenon. Limitations:

19 PARAMETRIC MODEL SELECTION TECHNIQUES 19 The population size for this study was relatively small, n=289. To better test the Kullback- Leibler criterion, acquiring a population of much larger size would be ideal. This would allow for a greater range of random samples from which to draw. A serious limitation for this study was also the author s (GLB) inability to write Fortran code. This type of intense random sampling and calculation necessitates the ability to generate unique code not found in standard software packages. It should also be noted that even acquiring the assistance of an expert programmer for Maple software still proved to be inadequate for generating the necessary results. For example, criterion calculated using equation 1.3 tended toward I = 10 for both the example data and real data sets. Additional manual computations were done for the samples in Table 1 to ensure Maple s output was accurate. Future Considerations: Given the tail effects of the random sampling, it is evident a single parametric model is insufficient to measure this population as a whole. Rather than take a traditional approach of partitioning the data for analysis with the three best fitting models, considering a hybrid probability distribution may prove to be be a better model for this analysis. This would take into consideration the smaller and larger samples so that a single distribution could be used to model the population. It may also be possible to obtain an even larger data set from the University of Nebraska Medical Center. With a larger sample size, some of the phenomenon with the tail effects may be diminished. It would also allow for a more accurate assessment of the comparisons between MSEs and Kullback- Leibler criterion. References [1] Myung IJ. Maximum Likelihood Estimation. (Submitted for publication 11/21/01)

20 20 GARY L. BECK, STEVE FROM, PH.D. [2] Linhart H, Zucchini W. Wiley Series In Probability And Mathematical Statistics: Model Selection. John Wiley and Sons, New York, [3] Zucchini W. An Introduction to Model Selection. J Math Psych 2000, 44: [4] Battaglia GJ. Mean Square Error. AMP J Tech 1996; 6: Department of Pediatrics, University of Nebraska Medical Center, Omaha, NE, Department of Mathematics, University of Nebraska at Omaha, Omaha, NE

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea Chapter 3 Bootstrap 3.1 Introduction The estimation of parameters in probability distributions is a basic problem in statistics that one tends to encounter already during the very first course on the subject.