Sistemática Teórica. Hernán Dopazo. Biomedical Genomics and Evolution Lab. Lesson 03 Statistical Model Selection

Size: px
Start display at page:

Download "Sistemática Teórica. Hernán Dopazo. Biomedical Genomics and Evolution Lab. Lesson 03 Statistical Model Selection"

Transcription

1 Sistemática Teórica Hernán Dopazo Biomedical Genomics and Evolution Lab Lesson 03 Statistical Model Selection Facultad de Ciencias Exactas y Naturales Universidad de Buenos Aires Argentina 2013

2 Statistical Model Selection How many parameters does it take to fit an elephant? An a priori attractive procedure to select a model of evolution is the arbitrary use of complex, parameter-rich models. However, when using complex models, numerous parameters need to be estimated, which has several disadvantages. First, the analysis becomes computationally difficult, and requires significant time. 2 Second, as more parameters need to be estimated from the same amount of data, more error is included in each estimate.

3 Statistical Model Selection Ideally, it would be advisable to incorporate as much complexity as needed; that is, to choose a model complex enough to explain the data but not so complex that it requires impractical long computations or large data sets to obtain accurate estimates. trade-off between bias and variance The best-fit model of evolution for a particular data set can be selected through statistical testing. The fit to the data of different models can be contrasted through likelihood ratio tests (LRTs) or information criteria to select the best-fit model within a set of possible ones. 3

4 Likelihood Function The likelihood computation will be explained in a near future... Now we can only consider that: L0 L1 log(l1) log(l0) more probable, more positive 4

5 Likelihood-ratio tests (LRTs) Then, the likelihood ratio test statistic: where L2 is the maximum likelihood under the more parameter-rich, complex model (i.e., alternative hypothesis) and L1 is the maximum likelihood under the less parameter-rich, simple model (i.e., null hypothesis) In other words, twice the log likelihood difference between the null and alternative models is approximately χ2 distributed with the degree of freedom equal to the difference in the number of parameters between the two models. M0 M1 M2 M3... M0 vs M1 2(l1 l0) = χ 1 2,1% = 6.63 M0 vs M3 2(l3 l0) = CV1% =

6 Nested Models 6

7 Hierarchical Likelihood-ratio tests (hlrt) The main steps to perform the hierarchical LRTs are as follows: 1. Estimate a tree from the data (i.e., the base tree). A neighbor-joining (NJ) tree will be fast and will do fine. - This tree has been shown to not have influence in the final model selected as far as it is not a random tree- 2. Estimate the likelihoods of the candidate models for the given data set and the base tree. 3. Compare the likelihoods of the candidate models through a hierarchy of LRTs to select the best-fit model among the candidates. Modeltest

8 Hierarchical Likelihood-ratio tests (hlrts) Some Problems with h-lrts: 1. Potential lack of Global Optima 2. Dependence on significance level 3. Dependence on starting model 4. Dependence on order of parameter addition/removal 5. Estimation of P-values 6. Burdersome to compare non-nested models Run Modeltest!!!!!!!! 8

9 Dynamical LRTs An alternative to the use of a predefined hierarchy LRT is to let the data itself determine the order in which the hypotheses are tested. In this case, the hierarchy used does not have to be the same for different data sets. The algorithms suggested proceed as follows: Algorithms (bottom-up) / (botton-down) L0 : L1 1. Start with the simplest / (most complex) model and calculate its likelihood. This is the current model. 2. Calculate the likelihood of the alternative / (null) models differing by one assumption and perform the corresponding nested LRTs. 3. If any hypotheses are / (not) rejected, the alternative / (null) model corresponding to the LRT with smallest / (biggest) associated P-value becomes the current model. In the case of several equally smallest / (biggest) p-values, select the alternative / (null) model with the best likelihood. 4. Repeat Steps 2 and 3 until the algorithm converges. 9

10 Akaike Information Criteria (AIC) A different approach for model selection is teh simultaneous comparison of all competing models. The likelihood of each model is penalized by a function of the number of parameters (K) in the model. The Akaike Information Criteria (AIC) is an asymptotically unbiased estimator of the Kullback-Leibler information quantity (Kullback and Leibler, 1951), which measures the expected distance between the true model and the estimated model. We can think of the AIC as the amount of information lost when we use, say HKY85, to approximate the real process of molecular evolution. Hence the model with the smallest AIC is preferred. An advantage of the AIC is that it also can be used to compare both nested and non-nested models. It is computed as follows: When sample size (n) is small compared with the number of parameters (n/k < 40) a corrected version is recommended Sampel size is usually approximated by the total number of characters in the alignment, although... The number of taxa. The number of sites. The number of variable sites. The number of sites * number of taxa. Some function of the number of sites and taxa. Something else. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Auto- matic Control, 19, Kullback, S. and R. A. Leibler (1951). On information and sufficiency. Annals of Mathematical Statistics, 22,

11 Akaike Information Criteria (AIC) Run Modeltest & ProtTest!!!!

12 Bayesian Information Criteria (BIC) In large data sets, both LRT and AIC are known to favour complex parameter-rich models and to reject simpler models too often (Schwaz 1978). The Bayesian Information Criterion (BIC) penalizes parameter-rich models more severely. It is defined as: Given equal priors for all competing models, choosing the model with the smallest BIC is equivalent to selecting the model with the maximum posterior probability. Again, both nested and non-nested models can be compared. Qualitatively, LRT, AIC, and BIC are all mathematical formulations of the parsimony principle of model building. Extra parameters are deemed necessary only if they bring about significant or considerable improvements to the fit of the model to data, and otherwise simpler models with fewer parameters are preferred. However, in large data sets, these criteria can differ markedly. For example, if the sample size n > 8, BIC penalizes parameter-rich models far more severely than does AIC. Schwaz, G Estimating the dimension of a model. Ann. Statist. 6:

13 Comparing Results with jmodeltest 2

14 Comparing Results with jmodeltest 2

15 Decision Theory Selection (DT) Arguing that there is no guarantee that the best-fit models will produce the best estimates of phylogenies (Minin, et al 2003) developed a novel approach that selects models based on their phylogenetic performance of, measured as the expected error on branch lengths estimates weighted by their BIC. Undet this decision theory framework (DT) the best model is the one that minimizes the risk function score: For the model i it is defined as: Simulations suggest that models selected with this criterion result in slightly more accurate branch lenght estimates than tose obtained by hlrt Minin, V., Z. Abdo, P. Joyce, and J. Sullivan Performance-based selection of likelihood models for phylogeny estimation. Systematic Biology 52:

16 Model Uncertainity The AIC, Bayesian and DT methods can rank the models, allowing us to assess how confident we are in the model selected. For these measures we could present their differences (Δ). For the ith model, the AIC (BIC, DT) difference is: As a rough rule of thumb, models having Δi within 1-2 of the best model have substantial support and should receive consideration. Models having Δi within 3-7 of the best model have considerably less support, while models with Δi > 10 have essentially no support. From this difference we can obtain the relative weights of each model as: Since the total weights add 1, it is easy to establish a 95% confidence set of models for the best models by summing weight from largest to smallest until 0.95 (or similar). 16

17 Model Averaging Interestingly, the model weights allow us to obtain a model-averaged estimate of any parameter For example, model-averaged estimate of the relative substitution rate between A and C using the model weigths (w) for M candidate models is: Importance It is possible to estimate the relative importance of any parameter by summing the weights across all models that include the parameters we are interested in. For example, the relative importance of the substitution rate between adenine and cytosine across all candidate models is simply the denominator above. It is possible to build a model-averaged estimate of phylogeny itself 17

18 Model Averaging Interestingly, the model weights allow us to obtain a model-averaged estimate of any parameter For example, model-averaged estimate of the relative substitution rate between A and C using the model weigths (w) for M candidate models is: Importance It is possible to estimate the relative importance of any parameter by summing the weights across all models that include the parameters we are interested in. For example, the relative importance of the substitution rate between adenine and cytosine across all candidate models is simply the denominator above. It is possible to build a model-averaged estimate of phylogeny itself 18

19 Model Averaging 19

20 Model Averaging Using true sequences 20

21 Model Averaged Phylogeny It is possible to build a model-averaged estimate of phylogeny itself 21

22 LRT of the Global Molecular Clock Assuming Molecular Clock, any tree is rooted in the middle of the longest branch (the oldest lineage) Statistical methods of phylogenetic reconstruction can estimate the branch lengths of a tree by enforcing or not enforcing a molecular clock Assuming the same topology and a single Model of Evolution, the nested hypothesis (trees C-NC) can be evaluated using LRT (we only assume differences in branch lenght) Alternative Model Null Model L1 L0 M0 vs M1 2(lnL1 lnl0) > 9.22? χ 2 (2n-3)-(n-1)=n-2=3,1% =

( ylogenetics/bayesian_workshop/bayesian%20mini conference.htm#_toc )

(  ylogenetics/bayesian_workshop/bayesian%20mini conference.htm#_toc ) (http://www.nematodes.org/teaching/tutorials/ph ylogenetics/bayesian_workshop/bayesian%20mini conference.htm#_toc145477467) Model selection criteria Review Posada D & Buckley TR (2004) Model selection

More information

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES Integrative Biology 200, Spring 2014 Principles of Phylogenetics: Systematics University of California, Berkeley Updated by Traci L. Grzymala Lab 07: Maximum Likelihood Model Selection and RAxML Using

More information

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet) Phylogeny Codon models Last lecture: poor man s way of calculating dn/ds (Ka/Ks) Tabulate synonymous/non- synonymous substitutions Normalize by the possibilities Transform to genetic distance K JC or K

More information

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 Learning Objectives understand

More information

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Topics in Machine Learning-EE 5359 Model Assessment and Selection Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1 Training and Generalization Training stage: Utilizing

More information

CLC Phylogeny Module User manual

CLC Phylogeny Module User manual CLC Phylogeny Module User manual User manual for Phylogeny Module 1.0 Windows, Mac OS X and Linux September 13, 2013 This software is for research purposes only. CLC bio Silkeborgvej 2 Prismet DK-8000

More information

STA121: Applied Regression Analysis

STA121: Applied Regression Analysis STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model

More information

Distance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006

Distance Methods. PRINCIPLES OF PHYLOGENETICS Spring 2006 Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2006 Distance Methods Due at the end of class: - Distance matrices and trees for two different distance

More information

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea Descent w/modification Descent w/modification Descent w/modification Descent w/modification CPU Descent w/modification Descent w/modification Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

More information

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early

More information

Heterotachy models in BayesPhylogenies

Heterotachy models in BayesPhylogenies Heterotachy models in is a general software package for inferring phylogenetic trees using Bayesian Markov Chain Monte Carlo (MCMC) methods. The program allows a range of models of gene sequence evolution,

More information

Annotated multitree output

Annotated multitree output Annotated multitree output A simplified version of the two high-threshold (2HT) model, applied to two experimental conditions, is used as an example to illustrate the output provided by multitree (version

More information

Accounting for Uncertainty in the Tree Topology Has Little Effect on the Decision-Theoretic Approach to Model Selection in Phylogeny Estimation

Accounting for Uncertainty in the Tree Topology Has Little Effect on the Decision-Theoretic Approach to Model Selection in Phylogeny Estimation Accounting for Uncertainty in the Tree Topology Has Little Effect on the Decision-Theoretic Approach to Model Selection in Phylogeny Estimation Zaid Abdo,* à Vladimir N. Minin, Paul Joyce,* à and Jack

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

Model selection Outline for today

Model selection Outline for today Model selection Outline for today The problem of model selection Choose among models by a criterion rather than significance testing Criteria: Mallow s C p and AIC Search strategies: All subsets; stepaic

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst Phylogenetics Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Phylogenetics phylum = tree phylogenetics: reconstruction of evolutionary

More information

CS 581. Tandy Warnow

CS 581. Tandy Warnow CS 581 Tandy Warnow This week Maximum parsimony: solving it on small datasets Maximum Likelihood optimization problem Felsenstein s pruning algorithm Bayesian MCMC methods Research opportunities Maximum

More information

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 ML phylogenetic inference and GARLI Derrick Zwickl University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 Outline Heuristics and tree searches ML phylogeny inference and

More information

human chimp mouse rat

human chimp mouse rat Michael rudno These notes are based on earlier notes by Tomas abak Phylogenetic Trees Phylogenetic Trees demonstrate the amoun of evolution, and order of divergence for several genomes. Phylogenetic trees

More information

Machine Learning. Topic 4: Linear Regression Models

Machine Learning. Topic 4: Linear Regression Models Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

Resampling methods (Ch. 5 Intro)

Resampling methods (Ch. 5 Intro) Zavádějící faktor (Confounding factor), ale i 'současně působící faktor' Resampling methods (Ch. 5 Intro) Key terms: Train/Validation/Test data Crossvalitation One-leave-out = LOOCV Bootstrup key slides

More information

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance:

More information

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees What is a phylogenetic tree? Algorithms for Computational Biology Zsuzsanna Lipták speciation events Masters in Molecular and Medical Biotechnology a.a. 25/6, fall term Phylogenetics Summary wolf cat lion

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Bootstrapping Methods

Bootstrapping Methods Bootstrapping Methods example of a Monte Carlo method these are one Monte Carlo statistical method some Bayesian statistical methods are Monte Carlo we can also simulate models using Monte Carlo methods

More information

Robust Signal-Structure Reconstruction

Robust Signal-Structure Reconstruction Robust Signal-Structure Reconstruction V. Chetty 1, D. Hayden 2, J. Gonçalves 2, and S. Warnick 1 1 Information and Decision Algorithms Laboratories, Brigham Young University 2 Control Group, Department

More information

Which is more useful?

Which is more useful? Which is more useful? Reality Detailed map Detailed public transporta:on Simplified metro Models don t need to reflect reality A model is an inten:onal simplifica:on of a complex situa:on designed to eliminate

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course)

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course) Tree Searching We ve discussed how we rank trees Parsimony Least squares Minimum evolution alanced minimum evolution Maximum likelihood (later in the course) So we have ways of deciding what a good tree

More information

The Probability of Correctly Resolving a Split as an Experimental Design Criterion in Phylogenetics

The Probability of Correctly Resolving a Split as an Experimental Design Criterion in Phylogenetics Syst. Biol. 61(5):811 821, 2012 The Author(s) 2012. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com

More information

Tutorial: Phylogenetic Analysis on BioHealthBase Written by: Catherine A. Macken Version 1: February 2009

Tutorial: Phylogenetic Analysis on BioHealthBase Written by: Catherine A. Macken Version 1: February 2009 Tutorial: Phylogenetic Analysis on BioHealthBase Written by: Catherine A. Macken Version 1: February 2009 BioHealthBase provides multiple functions for inferring phylogenetic trees, through the Phylogenetic

More information

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy 2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.

More information

7. Collinearity and Model Selection

7. Collinearity and Model Selection Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among

More information

Recent Research Results. Evolutionary Trees Distance Methods

Recent Research Results. Evolutionary Trees Distance Methods Recent Research Results Evolutionary Trees Distance Methods Indo-European Languages After Tandy Warnow What is the purpose? Understand evolutionary history (relationship between species). Uderstand how

More information

Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller

Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller Parameter and State inference using the approximate structured coalescent 1 Background Phylogeographic methods can help reveal the movement

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

A Statistical Analysis of UK Financial Networks

A Statistical Analysis of UK Financial Networks A Statistical Analysis of UK Financial Networks J. Chu & S. Nadarajah First version: 31 December 2016 Research Report No. 9, 2016, Probability and Statistics Group School of Mathematics, The University

More information

COMPARING MODELS AND CURVES. Fitting models to biological data using linear and nonlinear regression. A practical guide to curve fitting.

COMPARING MODELS AND CURVES. Fitting models to biological data using linear and nonlinear regression. A practical guide to curve fitting. COMPARING MODELS AND CURVES An excerpt from a forthcoming book: Fitting models to biological data using linear and nonlinear regression. A practical guide to curve fitting. Harvey Motulsky GraphPad Software

More information

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016 Resampling Methods Levi Waldron, CUNY School of Public Health July 13, 2016 Outline and introduction Objectives: prediction or inference? Cross-validation Bootstrap Permutation Test Monte Carlo Simulation

More information

Estimation of a hierarchical Exploratory Structural Equation Model (ESEM) using ESEMwithin-CFA

Estimation of a hierarchical Exploratory Structural Equation Model (ESEM) using ESEMwithin-CFA Estimation of a hierarchical Exploratory Structural Equation Model (ESEM) using ESEMwithin-CFA Alexandre J.S. Morin, Substantive Methodological Synergy Research Laboratory, Department of Psychology, Concordia

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017 CPSC 340: Machine Learning and Data Mining Feature Selection Fall 2017 Assignment 2: Admin 1 late day to hand in tonight, 2 for Wednesday, answers posted Thursday. Extra office hours Thursday at 4pm (ICICS

More information

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Wayne Pfeiffer (SDSC/UCSD) & Alexandros Stamatakis (TUM) February 25, 2010 What was done? Why is it important? Who cares? Hybrid MPI/OpenMP

More information

Package treedater. R topics documented: May 4, Type Package

Package treedater. R topics documented: May 4, Type Package Type Package Package treedater May 4, 2018 Title Fast Molecular Clock Dating of Phylogenetic Trees with Rate Variation Version 0.2.0 Date 2018-04-23 Author Erik Volz [aut, cre] Maintainer Erik Volz

More information

Generalized Additive Models

Generalized Additive Models :p Texts in Statistical Science Generalized Additive Models An Introduction with R Simon N. Wood Contents Preface XV 1 Linear Models 1 1.1 A simple linear model 2 Simple least squares estimation 3 1.1.1

More information

Dynamic Programming for Phylogenetic Estimation

Dynamic Programming for Phylogenetic Estimation 1 / 45 Dynamic Programming for Phylogenetic Estimation CS598AGB Pranjal Vachaspati University of Illinois at Urbana-Champaign 2 / 45 Coalescent-based Species Tree Estimation Find evolutionary tree for

More information

Fitting (Optimization)

Fitting (Optimization) Q: When and how should I use Bayesian fitting? A: The SAAM II Bayesian feature allows the incorporation of prior knowledge of the model parameters into the modeling of the kinetic data. The additional

More information

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown Z-TEST / Z-STATISTIC: used to test hypotheses about µ when the population standard deviation is known and population distribution is normal or sample size is large T-TEST / T-STATISTIC: used to test hypotheses

More information

Markov Network Structure Learning

Markov Network Structure Learning Markov Network Structure Learning Sargur Srihari srihari@cedar.buffalo.edu 1 Topics Structure Learning as model selection Structure Learning using Independence Tests Score based Learning: Hypothesis Spaces

More information

NOTE: Any use of trade, product or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

NOTE: Any use of trade, product or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government. U.S. Geological Survey (USGS) MMA(1) NOTE: Any use of trade, product or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government. NAME MMA, A Computer Code for

More information

SEM 1: Confirmatory Factor Analysis

SEM 1: Confirmatory Factor Analysis SEM 1: Confirmatory Factor Analysis Week 2 - Fitting CFA models Sacha Epskamp 10-04-2017 General factor analysis framework: in which: y i = Λη i + ε i y N(0, Σ) η N(0, Ψ) ε N(0, Θ), y i is a p-length vector

More information

Scaling species tree estimation methods to large datasets using NJMerge

Scaling species tree estimation methods to large datasets using NJMerge Scaling species tree estimation methods to large datasets using NJMerge Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana Champaign 2018 Phylogenomics Software

More information

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM * Which directories are used for input files and output files? See menu-item "Options" and page 22 in the manual.

More information

Variable selection is intended to select the best subset of predictors. But why bother?

Variable selection is intended to select the best subset of predictors. But why bother? Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.

More information

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 28 th November 2007 OUTLINE 1 INFERRING

More information

A Bayesian approach to artificial neural network model selection

A Bayesian approach to artificial neural network model selection A Bayesian approach to artificial neural network model selection Kingston, G. B., H. R. Maier and M. F. Lambert Centre for Applied Modelling in Water Engineering, School of Civil and Environmental Engineering,

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Information Criteria Methods in SAS for Multiple Linear Regression Models

Information Criteria Methods in SAS for Multiple Linear Regression Models Paper SA5 Information Criteria Methods in SAS for Multiple Linear Regression Models Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN ABSTRACT SAS 9.1 calculates Akaike s Information

More information

Main Reference. Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taking a Random Walk through Tree Space Genetics 2005

Main Reference. Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taking a Random Walk through Tree Space Genetics 2005 Stochastic Models for Horizontal Gene Transfer Dajiang Liu Department of Statistics Main Reference Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taing a Random Wal through Tree Space

More information

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets Vincent Ranwez and Olivier Gascuel Département Informatique Fondamentale et Applications, LIRMM,

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Feature Selection Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. Admin Assignment 3: Due Friday Midterm: Feb 14 in class

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Laboratory 3: Detecting selection

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Laboratory 3: Detecting selection Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Laboratory 3: Detecting selection Handed out: November 28 Due: December 14 Part 2. Detecting selection likelihood

More information

Regularities in the Augmentation of Fractional Factorial Designs

Regularities in the Augmentation of Fractional Factorial Designs Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2013 Regularities in the Augmentation of Fractional Factorial Designs Lisa Kessel Virginia Commonwealth University

More information

Workshop 8: Model selection

Workshop 8: Model selection Workshop 8: Model selection Selecting among candidate models requires a criterion for evaluating and comparing models, and a strategy for searching the possibilities. In this workshop we will explore some

More information

New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0

New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0 Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2010 New algorithms and methods to estimate maximum-likelihood phylogenies:

More information

10601 Machine Learning. Model and feature selection

10601 Machine Learning. Model and feature selection 10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior

More information

CSE 549: Computational Biology

CSE 549: Computational Biology CSE 549: Computational Biology Phylogenomics 1 slides marked with * by Carl Kingsford Tree of Life 2 * H5N1 Influenza Strains Salzberg, Kingsford, et al., 2007 3 * H5N1 Influenza Strains The 2007 outbreak

More information

Classification: Linear Discriminant Functions

Classification: Linear Discriminant Functions Classification: Linear Discriminant Functions CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Discriminant functions Linear Discriminant functions

More information

UNIVERSITY OF CALIFORNIA. Santa Barbara. Testing Trait Evolution Models on Phylogenetic Trees. A Thesis submitted in partial satisfaction of the

UNIVERSITY OF CALIFORNIA. Santa Barbara. Testing Trait Evolution Models on Phylogenetic Trees. A Thesis submitted in partial satisfaction of the Lee 1 UNIVERSITY OF CALIFORNIA Santa Barbara Testing Trait Evolution Models on Phylogenetic Trees A Thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1 Stat 5303 (Oehlert): Unbalanced Factorial Examples 1 > section

More information

Fly wing length data Sokal and Rohlf Box 10.1 Ch13.xls. on chalk board

Fly wing length data Sokal and Rohlf Box 10.1 Ch13.xls. on chalk board Model Based Statistics in Biology. Part IV. The General Linear Model. Multiple Explanatory Variables. Chapter 13.6 Nested Factors (Hierarchical ANOVA ReCap. Part I (Chapters 1,2,3,4), Part II (Ch 5, 6,

More information

Generation of distancebased phylogenetic trees

Generation of distancebased phylogenetic trees primer for practical phylogenetic data gathering. Uconn EEB3899-007. Spring 2015 Session 12 Generation of distancebased phylogenetic trees Rafael Medina (rafael.medina.bry@gmail.com) Yang Liu (yang.liu@uconn.edu)

More information

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such) Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences joe@gs Phylogeny methods, part 1 (Parsimony and such) Methods of reconstructing phylogenies (evolutionary trees) Parsimony

More information

CHAPTER 12 ASDA ANALYSIS EXAMPLES REPLICATION-MPLUS 5.21 GENERAL NOTES ABOUT ANALYSIS EXAMPLES REPLICATION

CHAPTER 12 ASDA ANALYSIS EXAMPLES REPLICATION-MPLUS 5.21 GENERAL NOTES ABOUT ANALYSIS EXAMPLES REPLICATION CHAPTER 12 ASDA ANALYSIS EXAMPLES REPLICATION-MPLUS 5.21 GENERAL NOTES ABOUT ANALYSIS EXAMPLES REPLICATION These examples are intended to provide guidance on how to use the commands/procedures for analysis

More information

Systematics - Bio 615

Systematics - Bio 615 The 10 Willis (The commandments as given by Willi Hennig after coming down from the Mountain) 1. Thou shalt not paraphyle. 2. Thou shalt not weight. 3. Thou shalt not publish unresolved nodes. 4. Thou

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

Modelling and Quantitative Methods in Fisheries

Modelling and Quantitative Methods in Fisheries SUB Hamburg A/553843 Modelling and Quantitative Methods in Fisheries Second Edition Malcolm Haddon ( r oc) CRC Press \ y* J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of

More information

Tutorial using BEAST v2.4.1 Troubleshooting David A. Rasmussen

Tutorial using BEAST v2.4.1 Troubleshooting David A. Rasmussen Tutorial using BEAST v2.4.1 Troubleshooting David A. Rasmussen 1 Background The primary goal of most phylogenetic analyses in BEAST is to infer the posterior distribution of trees and associated model

More information

Lab 8 Phylogenetics I: creating and analysing a data matrix

Lab 8 Phylogenetics I: creating and analysing a data matrix G44 Geobiology Fall 23 Name Lab 8 Phylogenetics I: creating and analysing a data matrix For this lab and the next you will need to download and install the Mesquite and PHYLIP packages: http://mesquiteproject.org/mesquite/mesquite.html

More information

Week 5: Multiple Linear Regression II

Week 5: Multiple Linear Regression II Week 5: Multiple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Adjusted R

More information

One report (in pdf format) addressing each of following questions.

One report (in pdf format) addressing each of following questions. MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW1: Sequence alignment and Evolution Due: 24:00 EST, Feb 15, 2016 by autolab Your goals in this assignment are to 1. Complete a genome assembler

More information

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland Genetic Programming Charles Chilaka Department of Computational Science Memorial University of Newfoundland Class Project for Bio 4241 March 27, 2014 Charles Chilaka (MUN) Genetic algorithms and programming

More information

SIMULTANEOUS COMPUTATION OF MODEL ORDER AND PARAMETER ESTIMATION FOR ARX MODEL BASED ON MULTI- SWARM PARTICLE SWARM OPTIMIZATION

SIMULTANEOUS COMPUTATION OF MODEL ORDER AND PARAMETER ESTIMATION FOR ARX MODEL BASED ON MULTI- SWARM PARTICLE SWARM OPTIMIZATION SIMULTANEOUS COMPUTATION OF MODEL ORDER AND PARAMETER ESTIMATION FOR ARX MODEL BASED ON MULTI- SWARM PARTICLE SWARM OPTIMIZATION Kamil Zakwan Mohd Azmi, Zuwairie Ibrahim and Dwi Pebrianti Faculty of Electrical

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Machine Learning. Supervised Learning. Manfred Huber

Machine Learning. Supervised Learning. Manfred Huber Machine Learning Supervised Learning Manfred Huber 2015 1 Supervised Learning Supervised learning is learning where the training data contains the target output of the learning system. Training data D

More information

Balancing Multiple Criteria Incorporating Cost using Pareto Front Optimization for Split-Plot Designed Experiments

Balancing Multiple Criteria Incorporating Cost using Pareto Front Optimization for Split-Plot Designed Experiments Research Article (wileyonlinelibrary.com) DOI: 10.1002/qre.1476 Published online 10 December 2012 in Wiley Online Library Balancing Multiple Criteria Incorporating Cost using Pareto Front Optimization

More information

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) Empirical risk minimization (ERM) Recall the definitions of risk/empirical risk We observe the

More information

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11: Lecture 11: Overfitting and Capacity Control high bias low variance Typical Behaviour low bias high variance Sam Roweis error test set training set November 23, 4 low Model Complexity high Generalization,

More information

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/12/2013 Comp 465 Fall 2013 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other vertex A clique

More information

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance CSC55 Machine Learning Sam Roweis high bias low variance Typical Behaviour low bias high variance Lecture : Overfitting and Capacity Control error training set test set November, 6 low Model Complexity

More information

Seeing the wood for the trees: Analysing multiple alternative phylogenies

Seeing the wood for the trees: Analysing multiple alternative phylogenies Seeing the wood for the trees: Analysing multiple alternative phylogenies Tom M. W. Nye, Newcastle University tom.nye@ncl.ac.uk Isaac Newton Institute, 17 December 2007 Multiple alternative phylogenies

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information

Repeated Measures Part 4: Blood Flow data

Repeated Measures Part 4: Blood Flow data Repeated Measures Part 4: Blood Flow data /* bloodflow.sas */ options linesize=79 pagesize=100 noovp formdlim='_'; title 'Two within-subjecs factors: Blood flow data (NWK p. 1181)'; proc format; value

More information

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture IV: Quantitative Comparison of Models

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture IV: Quantitative Comparison of Models Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture IV: Quantitative Comparison of Models A classic mathematical model for enzyme kinetics is the Michaelis-Menten equation:

More information