Sistemática Teórica. Hernán Dopazo. Biomedical Genomics and Evolution Lab. Lesson 03 Statistical Model Selection

Size: px

Start display at page:

Download "Sistemática Teórica. Hernán Dopazo. Biomedical Genomics and Evolution Lab. Lesson 03 Statistical Model Selection"

Denis Owen
5 years ago
Views:

Model Selection Facultad de Ciencias Exactas y

1 Sistemática Teórica Hernán Dopazo Biomedical Genomics and Evolution Lab Lesson 03 Statistical Model Selection Facultad de Ciencias Exactas y Naturales Universidad de Buenos Aires Argentina 2013

2 Statistical Model Selection How many parameters does it take to fit an elephant? An a priori attractive procedure to select a model of evolution is the arbitrary use of complex, parameter-rich models. However, when using complex models, numerous parameters need to be estimated, which has several disadvantages. First, the analysis becomes computationally difficult, and requires significant time. 2 Second, as more parameters need to be estimated from the same amount of data, more error is included in each estimate.

3 Statistical Model Selection Ideally, it would be advisable to incorporate as much complexity as needed; that is, to choose a model complex enough to explain the data but not so complex that it requires impractical long computations or large data sets to obtain accurate estimates. trade-off between bias and variance The best-fit model of evolution for a particular data set can be selected through statistical testing. The fit to the data of different models can be contrasted through likelihood ratio tests (LRTs) or information criteria to select the best-fit model within a set of possible ones. 3

4 Likelihood Function The likelihood computation will be explained in a near future... Now we can only consider that: L0 L1 log(l1) log(l0) more probable, more positive 4

Likelihood-ratio tests (LRTs) Then, the likelihood ratio test statistic: where L2 is the maximum likelihood under the more parameter-rich, complex model (i.e., alternative hypothesis) and L1 is the maximum likelihood under the less parameter-rich, simple model (i.

5 Likelihood-ratio tests (LRTs) Then, the likelihood ratio test statistic: where L2 is the maximum likelihood under the more parameter-rich, complex model (i.e., alternative hypothesis) and L1 is the maximum likelihood under the less parameter-rich, simple model (i.e., null hypothesis) In other words, twice the log likelihood difference between the null and alternative models is approximately χ2 distributed with the degree of freedom equal to the difference in the number of parameters between the two models. M0 M1 M2 M3... M0 vs M1 2(l1 l0) = χ 1 2,1% = 6.63 M0 vs M3 2(l3 l0) = CV1% =

6 Nested Models 6

Hierarchical Likelihood-ratio tests (hlrt) The main steps to perform the hierarchical LRTs are as follows: 1. Estimate a tree from the data (i.e., the base tree).

7 Hierarchical Likelihood-ratio tests (hlrt) The main steps to perform the hierarchical LRTs are as follows: 1. Estimate a tree from the data (i.e., the base tree). A neighbor-joining (NJ) tree will be fast and will do fine. - This tree has been shown to not have influence in the final model selected as far as it is not a random tree- 2. Estimate the likelihoods of the candidate models for the given data set and the base tree. 3. Compare the likelihoods of the candidate models through a hierarchy of LRTs to select the best-fit model among the candidates. Modeltest

8 Hierarchical Likelihood-ratio tests (hlrts) Some Problems with h-lrts: 1. Potential lack of Global Optima 2. Dependence on significance level 3. Dependence on starting model 4. Dependence on order of parameter addition/removal 5. Estimation of P-values 6. Burdersome to compare non-nested models Run Modeltest!!!!!!!! 8

9 Dynamical LRTs An alternative to the use of a predefined hierarchy LRT is to let the data itself determine the order in which the hypotheses are tested. In this case, the hierarchy used does not have to be the same for different data sets. The algorithms suggested proceed as follows: Algorithms (bottom-up) / (botton-down) L0 : L1 1. Start with the simplest / (most complex) model and calculate its likelihood. This is the current model. 2. Calculate the likelihood of the alternative / (null) models differing by one assumption and perform the corresponding nested LRTs. 3. If any hypotheses are / (not) rejected, the alternative / (null) model corresponding to the LRT with smallest / (biggest) associated P-value becomes the current model. In the case of several equally smallest / (biggest) p-values, select the alternative / (null) model with the best likelihood. 4. Repeat Steps 2 and 3 until the algorithm converges. 9

10 Akaike Information Criteria (AIC) A different approach for model selection is teh simultaneous comparison of all competing models. The likelihood of each model is penalized by a function of the number of parameters (K) in the model. The Akaike Information Criteria (AIC) is an asymptotically unbiased estimator of the Kullback-Leibler information quantity (Kullback and Leibler, 1951), which measures the expected distance between the true model and the estimated model. We can think of the AIC as the amount of information lost when we use, say HKY85, to approximate the real process of molecular evolution. Hence the model with the smallest AIC is preferred. An advantage of the AIC is that it also can be used to compare both nested and non-nested models. It is computed as follows: When sample size (n) is small compared with the number of parameters (n/k < 40) a corrected version is recommended Sampel size is usually approximated by the total number of characters in the alignment, although... The number of taxa. The number of sites. The number of variable sites. The number of sites * number of taxa. Some function of the number of sites and taxa. Something else. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Auto- matic Control, 19, Kullback, S. and R. A. Leibler (1951). On information and sufficiency. Annals of Mathematical Statistics, 22,

11 Akaike Information Criteria (AIC) Run Modeltest & ProtTest!!!!

12 Bayesian Information Criteria (BIC) In large data sets, both LRT and AIC are known to favour complex parameter-rich models and to reject simpler models too often (Schwaz 1978). The Bayesian Information Criterion (BIC) penalizes parameter-rich models more severely. It is defined as: Given equal priors for all competing models, choosing the model with the smallest BIC is equivalent to selecting the model with the maximum posterior probability. Again, both nested and non-nested models can be compared. Qualitatively, LRT, AIC, and BIC are all mathematical formulations of the parsimony principle of model building. Extra parameters are deemed necessary only if they bring about significant or considerable improvements to the fit of the model to data, and otherwise simpler models with fewer parameters are preferred. However, in large data sets, these criteria can differ markedly. For example, if the sample size n > 8, BIC penalizes parameter-rich models far more severely than does AIC. Schwaz, G Estimating the dimension of a model. Ann. Statist. 6:

13 Comparing Results with jmodeltest 2

14 Comparing Results with jmodeltest 2

15 Decision Theory Selection (DT) Arguing that there is no guarantee that the best-fit models will produce the best estimates of phylogenies (Minin, et al 2003) developed a novel approach that selects models based on their phylogenetic performance of, measured as the expected error on branch lengths estimates weighted by their BIC. Undet this decision theory framework (DT) the best model is the one that minimizes the risk function score: For the model i it is defined as: Simulations suggest that models selected with this criterion result in slightly more accurate branch lenght estimates than tose obtained by hlrt Minin, V., Z. Abdo, P. Joyce, and J. Sullivan Performance-based selection of likelihood models for phylogeny estimation. Systematic Biology 52:

Model Uncertainity The AIC, Bayesian and DT methods can rank the models, allowing us to assess how confident we are in the model selected. For these measures we could present their differences (Δ).

16 Model Uncertainity The AIC, Bayesian and DT methods can rank the models, allowing us to assess how confident we are in the model selected. For these measures we could present their differences (Δ). For the ith model, the AIC (BIC, DT) difference is: As a rough rule of thumb, models having Δi within 1-2 of the best model have substantial support and should receive consideration. Models having Δi within 3-7 of the best model have considerably less support, while models with Δi > 10 have essentially no support. From this difference we can obtain the relative weights of each model as: Since the total weights add 1, it is easy to establish a 95% confidence set of models for the best models by summing weight from largest to smallest until 0.95 (or similar). 16

17 Model Averaging Interestingly, the model weights allow us to obtain a model-averaged estimate of any parameter For example, model-averaged estimate of the relative substitution rate between A and C using the model weigths (w) for M candidate models is: Importance It is possible to estimate the relative importance of any parameter by summing the weights across all models that include the parameters we are interested in. For example, the relative importance of the substitution rate between adenine and cytosine across all candidate models is simply the denominator above. It is possible to build a model-averaged estimate of phylogeny itself 17

18 Model Averaging Interestingly, the model weights allow us to obtain a model-averaged estimate of any parameter For example, model-averaged estimate of the relative substitution rate between A and C using the model weigths (w) for M candidate models is: Importance It is possible to estimate the relative importance of any parameter by summing the weights across all models that include the parameters we are interested in. For example, the relative importance of the substitution rate between adenine and cytosine across all candidate models is simply the denominator above. It is possible to build a model-averaged estimate of phylogeny itself 18

19 Model Averaging 19

20 Model Averaging Using true sequences 20

21 Model Averaged Phylogeny It is possible to build a model-averaged estimate of phylogeny itself 21

22 LRT of the Global Molecular Clock Assuming Molecular Clock, any tree is rooted in the middle of the longest branch (the oldest lineage) Statistical methods of phylogenetic reconstruction can estimate the branch lengths of a tree by enforcing or not enforcing a molecular clock Assuming the same topology and a single Model of Evolution, the nested hypothesis (trees C-NC) can be evaluated using LRT (we only assume differences in branch lenght) Alternative Model Null Model L1 L0 M0 vs M1 2(lnL1 lnl0) > 9.22? χ 2 (2n-3)-(n-1)=n-2=3,1% =

( ylogenetics/bayesian_workshop/bayesian%20mini conference.htm#_toc )

( ylogenetics/bayesian_workshop/bayesian%20mini conference.htm#_toc ) (http://www.nematodes.org/teaching/tutorials/ph ylogenetics/bayesian_workshop/bayesian%20mini conference.htm#_toc145477467) Model selection criteria Review Posada D & Buckley TR (2004) Model selection