PHASE: a Software Package for Phylogenetics And S equence Evolution

Size: px
Start display at page:

Download "PHASE: a Software Package for Phylogenetics And S equence Evolution"

Transcription

1 PHASE: a Software Package for Phylogenetics And S equence Evolution Version 1.1, April 24, 2003 Copyright 2002, 2003 by the University of Manchester. PHASE is distributed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Howsun Jow and Vivek Gowri-Shankar bug report: vivek.gowri-shankar@cs.man.ac.uk

2 Why is PHASE different from other phylogenetic programs? This package is designed specifically for use with RNA sequences that have a conserved secondary structure, e.g., rrna and trna. It is well known that compensatory substitutions occur in the paired regions of RNA secondary structures; this means that substitutions occurring on one side of a pair are correlated with substitutions on the other side. Most phylogenetic programs assume that each site in a molecule evolves independently of the others but this assumption is not valid for RNA genes. Substitution models of sequence evolution that consider pairs of sites rather than single sites are implemented in this package along with standard nucleotides substitution models used nowadays. When a RNA molecule with a secondary structure is used in conjunction with a RNA substitution model, PHASE requires a structure-based alignment of the sequences with the consensus secondary structure indicated in bracket and dot notation at the top of the alignment. We assume that you can provide this structure. It is now commonplace to perform combined analyses of heterogeneous sequence data when nucleotides with diffent patterns of evolution are sequenced for a set of studied species. It is possible to use several substitution models simultaneously with PHASE (for paired and/or unpaired sites) when analysing protein coding genes or when stems and loops of RNA genes are used. PHASE provides a Markov Chain Monte Carlo sampler to generate large numbers of possible phylogenetic trees with probability proportional to their likelihood. This is a Bayesian statistical method that allows posterior probabilities to be generated for alternative trees and alternative clades. These posterior probabilities provide a sound statistical measure of support of alternative phylogenetic hypotheses, and they remove the need for bootstrapping. Where many alternative arrangements of a given set of species exist, it is possible to calculate posterior probabilities for all the alternative arrangements of these species in a convenient way. Standard Maximum Likelihood techniques for inferring the optimal tree with any of the DNA or RNA evolution models are also implemented. The program s features include: Bayesian estimation of phylogenies and substitution model parameters standard ML search algorithms for inferring the optimal tree with optional topology constraints 6, 7 and 16 state RNA models standard 4 state DNA models invariant and discrete gamma model for substitution rate heterogeneity between sites mixing of molecular data types in a single analysis Journal publications : C. Hudelot, V. Gowri-Shankar, H. Jow, M. Rattray and P. Higgs. RNA-based Phylogenetic Methods: Application to Mammalian Mitochondrial RNA Sequences. Molecular Phylogenetics and Evolution (in press, 2003). H. Jow, C. Hudelot, M. Rattray and P. Higgs. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Molecular Biology and Evolution, 19(9): (2002). Acknowledgements Howsun Jow and Vivek Gowri-Shankar carried out this work as PhD students at Manchester University under the supervision of Magnus Rattray. We gratefully acknowledge contributions to the design, documentation and testing from Paul Higgs and Cendrine Hudelot. The PHASE software was developed as part of a BBSRC funded research project into RNA-based phylogenetic methods (investigators: Paul Higgs and Magnus Rattray). 1

3 Contents Why is PHASE different from other phylogenetic programs? Acknowledgements Introduction 4 How to read this manual? Aquiring and installing the software MS-Windows installation Unix-like system installation Description of programs in the PHASE package optimise and mlphase mcmcphase and consensus likelihood simulate analyser Running the programs Using programs in the PHASE package Inputs/outputs in PHASE Data file format Control file format Tree file format Substitution model parameters file format Parameters displayed on the screen and output of each program Clade file format Control files Structure of the control files Datafile block Model block Using the programs in the PHASE package likelihood optimise simulate mlphase mcmcphase analyser

4 2 Elements of phylogenetic theory Phylogenetic trees Unrooted phylogenies String representation of a tree Branch lengths Nucleotide substitution models A Markov model of substitution Transition matrices Nucleotide substitution models implemented in PHASE Paired-site substitution models RNA secondary structure Theory of compensatory substitutions Base-paired substitution models implemented in PHASE Refinements to substitution models Invariant and discrete gamma models The MIXED model Bayesian phylogenetics Bayes theorem Markov chain Monte-Carlo (MCMC) Priors and proposals Pitfalls of Markov chain Monte-Carlo techniques A Some examples of control files 37 A.1 Control file for likelihood A.2 Control file for optimise A.3 Control file for simulate (1) A.4 Control file for simulate (2) A.5 Control file for mlphase (1) A.6 Control file for mlphase (2) A.7 Control file for mcmcphase (1) A.8 Control file for mcmcphase (2) Bibliography 46 3

5 Introduction How to read this manual? People with a good background in phylogenetic inference might be interested only in the first chapter which explains how to use PHASE. The second chapter contains a few elements of the theory of phylogenetic inference with some valuable information about PHASE that can make technical details in the first chapter clearer. Experienced phylogeneticists might find it useful to read the RNA substitution models section (2.3) to learn about RNA substitution models and the Bayesian phylogenetics section (2.5) if they are not familiar with Markov Chain Monte-Carlo (MCMC) techniques. Once you have read the short description of the programs in this introduction, you can try them straightaway with the examples provided. However, be warned that inferences using the mammals dataset of 69 species and the maximum likelihood inference with the primates (primates-rna-ml.control) require at least one day. You should use other control files instead. The first chapter of this manual should be used as a reference only and to clarify obscure points about PHASE programs. The HTML version of these pages is probably more appropriate to find useful information. Aquiring and installing the software PHASE can be downloaded from it is currently available for Windows and Unix/Linux platforms. MS-Windows installation Download the archive phase-1.1-mswin-exec.zip and decompress it into the directory of your choice, for instance c:\phase\. PHASE does not require any other installation procedure and you can therefore test the software straightaway with the provided example files. Unix-like system installation For Unix and Linux systems you are recommended to compile the program yourself. However, if the process fails and if you cannot produce a proper executable, then you can try the precompiled linux version in the archive phase-1.1-linux-i586-exec.tgz. To compile the program yourself: decompress and extract the archive into the directory of your choice tar -xvvzf phase-1.1.tgz enter the newly created phase-1.1 directory cd phase-1.1 compile with the provided Makefile make 4

6 We assume here that you have the default recent C++ compiler g++ on your platform. You cannot compile PHASE with gcc v2.96 and older. You can check the gcc version installed on your system by typing g++ -v. You might want to (or might have to) edit and modify the makefiles in order to adapt them to your specific system configuration. In that case please have a look at the readme file first. PHASE uses the BLAS and LAPACK library routines. Unless your system is equipped with optimised versions of these mathematical libraries, in which case you are strongly advised to modify the makefile, generic versions will be built during the compilation process. The g77 compiler and the libg2c library are required but they should already be present on your system. Description of programs in the PHASE package The PHASE (PHylogenetics and Sequence Evolution) package consists of two main programs, mlphase and mcmcphase. mlphase performs maximum-likelihood inference mcmcphase is a Bayesian phylogenetic inference program There are five other smaller programs in the package: analyser checks the content of the molecular sequences likelihood computes the likelihood given a specified evolution model simulate generates sequences according to a specified evolution model optimise is a smaller version of mlphase without tree search capabilities consensus is used with mcmcphase to summarize the results of a MCMC run. Below we summarize the behaviour of these programs. Please refer to the first chapter in order to learn how to use them. optimise and mlphase The mlphase program is a maximum-likelihood phylogenetic inference program similar to dnaml in PHYLIP 1 and baseml in PAML 2. The mlphase program has a broad range of functionalities and can be used with a large number of evolutionary substitution models including those which take into account the RNA secondary structure in the evolution of RNA sequences (see sections 2.2 and 2.3). The mlphase program has two main modes of operation: 1. Optimisation of user-defined trees: estimation of maximum likelihood (ML) branch lengths and, optionally, evolutionary model parameters, given a set of labelled molecular sequences, for a user-defined set of phylogenetic tree topologies. 2. Maximum-likelihood tree search: the program aims at finding the model (tree topology, associated branch length and, optionally, sequence evolution model parameters) that yields the highest likelihood. The user can choose the topology search algorithm to be used among the three available: Simple exhaustive search: all the possible phylogenies are considered. Branch and bound search: non-optimal phylogenies are rejected before evaluation. Heuristic search via stepwise addition: greedy search for the best topology. Constraints can be placed on the phylogenetic tree topologies that are considered during ML inference in order to reduce the search space and the computation time

7 The optimise program is a simpler version of mlphase and is provided for convenience. This program returns the ML branch lengths and ML evolutionary parameters of a fixed user-defined tree topology, for instance a consensus tree found with a MCMC run. This is equivalent to the first mode of mlphase with only one tree. The optimise program requires less parameters than mlphase; it is simpler to use and allows quick experimentations with different initial parameters when an entrapment in a local maximum of the likelihood is suspected. mcmcphase and consensus The mcmcphase program performs Bayesian phylogenetic inference. It uses a Markov Chain Monte Carlo algorithm to sample from the posterior probability distribution of phylogenetic tree topology, branch lengths and sequence evolution model parameters. For an explanation of Bayesian phylogenetics and a description of the MCMC sampling algorithms used in mcmcphase, please consult the Bayesian Phylogenetics section (section 2.5) or Jow et al. (2002). The consensus program is used to exploit the results of a MCMC run. This program produces two consensus models (using mean and median of the parameters in the sample) and can return consensus branch lengths for any supplied topology, e.g., a PHYLIP-style consensus tree, if similar topologies were sampled during the run. likelihood The likelihood program computes the likelihood of a phylogeny with respect to any implemented substitution models. simulate The program simulate generates molecular sequences according to an user-specified tree (topology & branch lengths) and substitution model (type & parameters). analyser At the moment analyser outputs basic statistics about a sequence data file. It can also be used to locate in sequences the sites with too many gaps in case you decide to remove them. The analyser program can be quite useful to validate your secondary structure alignment and to set a maximum limit for the mismatch frequency at each site (see section 2.3.1). Running the programs Programs in the PHASE package are run through the command line under both Unix-like systems and MS-windows systems. For Windows operating systems, you have to open a MS-DOS command window to use them. Click on Run... in the Start... menu and type cmd in the newly opened dialog box. You might have to type command instead of cmd depending on your MS-Windows version. Once the command window is opened, you have to move to the directory where you extracted the software. At the shell prompt, you can type, for example, cd c:\phase\. You can then run any program of the PHASE package. Run the programs by typing their name followed by the arguments they require. In most cases, PHASE s programs take one argument which is the name of a control file (see section 1.2). You can type for instance: mcmcphase control\hiv-dna-mcmc-1.control 3 or, optimise control/primates-rna-optimise-7a.control After installation, if the examples are all present, these commands should work. 3 Please note that the use of the \ or / characters is dependant on your operating system. On Unix systems you might have to type./ before the program name. 6

8 1 - Using programs in the PHASE package 1.1 Inputs/outputs in PHASE Data file format All molecular sequence data used by the PHASE programs are stored in a common format. The data file format is similar to the PHYLIP data file format but has a few minor modifications. A data file is divided into four sections but two of them are not compulsory. Comments can be included by preceding the commented lines with a hash (#) symbol. The entire commented line is ignored by the program. Taking a look at the example data files in the package (*.dna, *.rna, and *.mix in the data directory) will make the following explanations easier to understand. File content The first non-comment section of the data file is a single line containing 1. the number of species 2. the length of the molecular sequences 3. a code which can be either DNA for usual unpaired molecular sequences or RNA for base-paired molecular sequences. In fact the purpose of this code is to indicate whether a pairing mask (see below) is present in the data file and you can use the code RNA even if some nucleotides are unpaired. For example the line, DNA at the beginning of a data file indicates that there are five non-base-paired sequences of length 100 in the file. For convenience a third code, MIXED, can be used instead of RNA when the user is using a concatenation of RNA loops and stems (see section 2.3.1) but should be avoided in other cases. More details on the specific meaning of the code MIXED are given in the class section below. The lines, RNA and, MIXED both indicate that there are ten sequences of length 300 in the file and that a pairing mask is associated with them. Pairing mask The second section of the data file is the pairing mask. This mask is only required when sequences contain some base-paired nucleotides (in that case the code should be RNA or MIXED). In the case of fully unpaired sequences i.e., when the DNA code is used the pairing mask must not be provided. The pairing mask is in the form of a mathematical expression consisting of round brackets. Corresponding brackets indicate that the bases at those positions in the sequence form a base-pair in the RNA secondary structure. Unpaired sites can be indicated with a dot. or a hyphen -. For example a sequence ACCAGAUGGU with a pairing mask (((.(.)))) indicates that the sequence is made of the base-pairs AU-CG-CG-GU and unpaired sites A-A. 7

9 Molecular sequences The third section of the data file contains the molecular sequences. Indels (-) and ambiguities (purine (R), pyrimidine (Y), unknown(n or?) ) are allowed. Sequences can be written in one of two formats. The first is the non-interleaved format. This consists of an identifying label for each sequence followed by the whole sequence. An example is: 2 8 DNA Mouse ACCGUGGU UCCAUAAA Rat ACUGUGGC UCGAUAUA There can be no spaces in the label though the sequence itself can be formatted into blocks using multiple lines and spaces. An alternate way of specifying the sequences is using the interleaved format. This enables the sequences to be split into homologous blocks. The non-interleaved example given above could equivalently be written: 2 8 DNA Mouse ACCG Rat ACUG UGGUUCCAUAAA UGGCUCGAUAUA Notice that only the first interleaved block should contain labels. Subsequent interleaved blocks are assumed to have the same labels and to be in the same order. Class section The fourth section is not compulsory and is used when performing a combined analysis of heterogeneous data sets (e.g., loops and stems of a RNA molecule, protein coding genes with three codon positions or concatenated data of different genes with different evolutionary patterns). You can safely skip this section if you plan to study DNA sequences or RNA helices only (i.e., no. in the pairing mask) with only one appropriate nucleotide/base-pair substitution model. The aim of this section is to assign each nucleotide/pair to a class. Each class is expected to have a different pattern of evolution. This section consists of a sequence of integers which correspond to the class of each nucleotide. For instance, the class section of a protein coding gene may look like: When the data file contains a class section, programs in the PHASE package expect it to comply to the following set of rules: class labels are separated by a space classes are labelled from 1 to K, where K is the number of distinct classes the number of labels equals the length of the sequences when used in conjunction with a base-paired structure, the two components of a paired site are in the same class. Since PHASE is specifically designed for the analysis of RNA sequences with secondary structure, the most common use of the class section should be the obvious separation of unpaired and base-paired sites into two distinct classes. The code MIXED can replace the code RNA to avoid a tiresome task and let PHASE know that he can simply use the provided pairing mask to build the class section (e.g., (((.())))..) implies ). When the code MIXED is used the class section is not compulsory and the unpaired and paired sites will respectively be attributed to the classes 1 and 2 automatically 1. Usually classes are used to determine the model of sequence evolution PHASE is using with each nucleotide. Each class in the data file is treated by its own model of nucleotide substitution during the one 1 When the class section is present the code MIXED is equivalent to RNA: the user assignment prevails the automatic 8

10 phylogenetic inference. The models are defined later in the model section of the control file(see 1.2.3). Let us just point out here that if you use the MIXED type for your data with the automatic assignment, i.e., without the class section, you have to make sure your first and second model are respectively a nucleotide substitution model and a base-pair substitution model when you declare your models of evolution. We will return to this point later on Control file format Most programs in the package use a control file. The purpose of this file is to assign a specific task to the program, i.e., analysed sequences, assumed substitution model, and others specific parameters. Control files are the key to using the software and two sections are devoted to them. Section 1.2 describes the structure of this file and describes common features for many programs in the package. Section 1.3 presents the specific parameters for each program Tree file format PHASE can output trees into a file and sometimes the user has to provide a file which contains one or more trees. A tree file is simply a file with one ore more phylogenies written in the computer readable format described in the tree representation section (2.1.2) Substitution model parameters file format With a model parameters file, one can provide initial values for the parameters of the substitution models used. PHASE can also create a model parameters file to store the results concerning a substitution model after a run (these could be Maximum Likelihood Estimate (MLE) parameters or Mean Posterior Estimate (MPE) parameters). Model parameters file content The content of this file is highly dependant on the substitution model used and we cannot describe it in general terms. The fields used to assign a value to each parameter are hopefully quite self-explanatory as long as you know the underlying substitution model. You might need to have a look at the transition matrices section (2.2.2) to understand the PHASE concept of rate ratios in substitution models. Each Rate ratio i parameter in this file stands for the parameter α i in the transition matrix of the corresponding model. Transition matrices for all implemented substitution models are given in section for DNA models and in section for RNA models. Producing a model parameters file Model parameters files and control files share the same structural elements. Some examples can be found in the data directory (*.model). Although it is quite easy to understand the content of a model parameters file without explanations when reading it, you might find it harder to produce your own file from scratch and without guidance if you want to initialise a substitution model with specific values. It is possible to use the simulate program to generate a stub of this file for each model implemented in the PHASE package. This skeleton can be modified easily to suit your needs. See section for details Parameters displayed on the screen and output of each program Each program in the package will output information on the screen, and one or more files to store the results permanently. The outputs will be reviewed individually for each program in section 1.3. The content displayed on the screen is usually quite easy to understand, but you might be a bit confused by the parameters of substitution models. PHASE outputs on the screen two kind of matrices: one rate ratios matrix R 9

11 one transition matrix Q These matrices are described in the transition matrices section (2.2.2). Other parameters have a straightforward meaning Clade file format The user is allowed to specify some invariant clades to reduce the number of possible topologies when using mlphase. A clade file contains a list of monophyletic clades in newick format (see section 2.1.2). All studied species must appear once (and only once) in the file, either alone or in a clade. Here is a simple clade file example for 6 species: (Specie5,Specie6); (Specie4,(Specie3,Specie2)); Specie1; 1.2 Control files Most programs in the PHASE package have their options set using a simple text file. We call this file the control file. Although the content of this file may differ for each program in the package, its structure remains the same. Some control files are provided as example with the package (*.control in the control directory). The easiest and safest way to use PHASE is to copy one of these examples and to adapt it to your need Structure of the control files A control file contains logical blocks (e.g., DATAFILE block, MODEL block,... ) and control lines. Lines preceeded by a hash (#) symbol are considered comments and ignored. Comments can be placed anywhere. A control line is used to define a parameter and gives it a value. It has the format: label = value The order in which control lines are provided in the control file is not important but they must appear in the right block. Note that PHASE is case sensitive, Tree file and Tree File are two different labels. At the moment no warning is issued if the user mistypes an optional parameter. Please check your control files against the provided examples, otherwise PHASE might miss some important parameters without you noticing it. A block is a container. It contains control lines but can also contain other blocks. The block BLOCKNAME begins with the tag: {BLOCKNAME} and ends with the tag: {\BLOCKNAME} Tags must be put alone in their line. By convention the name of blocks are all uppercase. In the remainder of this document, parameters of the control files are colored depending on their status. Compulsory parameters are in red and you must provide a value for them. Optionnal parameters are in green and they do not need to appear in the control file. Often, a default value will be assumed for optional parameters. Some fields are dependent on the presence and/or values of other parameters and their presence (or absence) is compulsory under certain conditions. These conditional parameters are in orange Datafile block Almost all programs in the PHASE package require a DATAFILE block to parse analysed sequences. As stated previously, the DATAFILE block begins with the tag {DATAFILE} alone on a line and ends with the tag {\DATAFILE} alone on a line. The DATAFILE block contains some necessary 10

12 information which is not included in the data file itself (see section for the format of this file); it contains the following control lines: Data file: the location of the molecular sequences file to be used. Data file = data/sequences.dna Interleaved data file: a yes/no option that specifies whether the molecular data is interleaved. Interleaved data file = yes Outgroup: the label of the outgroup sequence (see section 2.1.1). The inference techniques used in PHASE produce unrooted phylogenies and using an outgroup in your study is not required. However PHASE requires this parameter to produce a unique newick representation (2.1.2) for unrooted trees. Outgroup = Mole Heterogeneous data models: is a yes/no parameter which specifies whether the data file contains a class section. The default value is no and the class section of your data file will be ignored if you forget this field. Heterogeneous data models = yes Model block Most programs in the PHASE package require the specification of a substitution model for sequence evolution. This is the purpose of the MODEL block. The MODEL block is delimited by the {MODEL} and {\MODEL} tags. It contains the name of the substitution model followed by parameters (and sometimes blocks) specific to the model (see section 2.2 for background information on substitution models of nucleotide evolution). Simple substitution model Depending on the data to be analysed, the PHASE package can be used with a wide variety of DNA substitution models or RNA-specific base-paired models (see sections and for a review of these models). The content of the MODEL block is the same for all these models and the parameters are: Model: the model s name, by convention it should be all upper case. Model = REV Nucleotide substitution models implemented include JC69, K80, HKY85, TN93 and REV. Base-paired substitution models implemented include RNA6A, RNA6B, RNA7A, RNA7D, RNA16A. Discrete gamma distribution of rates: the discrete gamma model (see section 2.4.1) can be used to account for among site rate variation. Use yes/no values to turn this option on/off. When a discrete gamma model is used, PHASE expects the number of gamma categories to be specified. By default the discrete gamma model is not used. Discrete gamma distribution of rates = yes Number of gamma categories: when the discrete gamma model is used, you have to provide an integer to specify the desired number of discrete gamma categories. Number of gamma categories = 5 Invariant sites: alternatively, or in conjunction with the discrete gamma model, the user can allow a proportion of sites to be invariant, i.e., with zero rate of evolution. The default value is no. Invariant sites = yes Mixed model for combined analyses of heterogeneous data To study heterogeneous sequences several models are required. The mixed model (see section 2.4.2) allows these models to work concurrently. 11

13 Model: this field contains the name of the model which is MIXED. Model = MIXED Number of models: the number of models used concurrently. If a class section was provided with the data file then the number of models should be the same as the number of classes. If you used the flag MIXED in your data file and did not provide a class section then this parameter has to be set to 2 and the two models must be a DNA substitution model and a base-paired substitution model respectively. Number of models = 3 {MODELi} block: each model used in the mixed model must be defined in its own block. If the number of models is n then the MODEL block must contains n blocks whose name are MODEL1, MODEL2,..., MODELn. The content of these blocks is the same as for a simple substitution model block. {MODEL} Model = MIXED Number of models = 2 {MODEL1} Model = REV Invariant sites = yes {\MODEL1} {MODEL2} Model = RNA7A Discrete gamma distribution of rates = yes Number of gamma categories = 5 {\MODEL2} {MODEL3} Model = RNA7D Invariant sites = no Discrete gamma distribution of rates = no {\MODEL3} {\MODEL} 1.3 Using the programs in the PHASE package Each program in the PHASE package requires a specific control-file, the content of which is described here. As in the previous section, compulsory parameters appears in red, optional parameters in green and conditional parameters dependant on the others are in orange likelihood Using likelihood The likelihood program is used to compute the likelihood of a model of evolution (i.e., tree + parameterised substitution model) given a set of studied sequences. To use likelihood, one has to provide a phylogeny for the taxa under investigation (i.e., topology and branch lengths) and a substitution model for nucleotide evolution with user-defined parameters. To use likelihood, type at the command-line: likelihood likelihood-control-file where likelihood-control-file is a valid control file for the likelihood program. For verification purposes likelihood outputs the phylogenetic tree used on the screen before the likelihood value. Unlike most other PHASE programs, likelihood does not send any results to a file. Control file for likelihood An example of a valid control file for likelihood can be found in appendix A.1. In its control file, the likelihood program requires the specification of: a DATAFILE block: see the data file block section (1.2.2). 12

14 a MODEL block: see the model block section (1.2.3). Tree file: the name of the file containing the phylogeny, i.e., a tree in the Newick format (section 2.1.2), with branch lengths values. Tree file = data/mammals-consensus.tree Model parameters file: the name of the file containing parameter values for the model defined in the MODEL block above. Simulate can help you to produce this file. Model parameters file = data/mammals-consensus.model optimise Using optimise The program optimise is used to compute maximum-likelihood estimates (MLE) for the branch lengths and substitution model parameters of a given model of evolution (i.e., a fixed tree topology and a specified substitution model with free parameters). One can specify some initial values for branch lengths and substitution model parameters to speed-up the convergence or to detect trapping in local maxima of the likelihood function. To use optimise, type at the command-line: optimise optimise-control-file where optimise-control-file is a valid control file for the optimise program. When launched, optimise displays the initial tree and the initial likelihood on the screen and begins the optimisation. Once it is finished, the ML substitution model parameters are printed on the screen and saved in the.output file with the ML tree and the value of the maximum likelihood. The ML tree is also saved in the.tree file and a.model file (see input section 1.1.4) is created to store the MLE for the substitution model parameters. Control file for optimise An example of a valid control file for optimise can be found in appendix A.2. The control file of the optimise program must/may provide: a DATAFILE block: see the data file block section (1.2.2). a MODEL block: see the model block section (1.2.3). Tree file: the name of the file containing the phylogeny, i.e., a tree in the Newick format (see section 2.1.2) with optional initial branch lengths values. Tree file = mammals69-mix-consensus.tree Random seed: the integer value provided with this field is used to initialise the random number generator (used to draw random initial branch lengths if they are not provided). Random seed = 1 Starting model parameters file: the name of a file containing initial values for the parameters of the substitution model used. If this field is not provided, the analysed sequences are used to initialise the model. Starting model parameters file = data/hiv.model Output file: the basename for the three files basename.tree, basename.model and basename.output. They contain the results generated by optimise. Output file = mammals69-mix-optimise simulate Using simulate Simulate is used: 13

15 1. to generate examples of.model files for all the substitution models implemented in PHASE. A.model file (see section 1.1.4) is used to provide initial or fixed values for the model parameters to some programs in the package. 2. to generate molecular sequences which evolved from a random initial one according to a specified model of evolution, i.e., phylogeny and substitution model. To use simulate, type at the command-line: simulate simulate-control-file where simulate-control-file is a valid control file for the simulate program. In its first mode of operation simulate create a single.model file and you can modify this file with your own initial values. In its second mode of operation, simulate displays on screen the tree used to generate the actual sequences. This tree was either provided by the user or randomly created by the program. In the second case the tree is saved in a file specified by the user. Eventually, the likelihood of the generated molecular sequences given the model is printed on the screen and simulate saves the sequences in a file specified by the user. The format of this file is described in the data file format section (1.1.1). If the MIXED model described in section is used, heterogeneous sequences are generated in sequential order. Control file for simulate In appendix A.3 and A.4, example control files are provided for the first and the second mode of operation respectively. The control file of the simulate program must provide a MODEL block: see the model block section (1.2.3). Retrieve the name of the model s parameters: a boolean field to specify the user s aim. Use yes for the first mode of usage mentionned above and no for the second mode. Retrieve the name of the model s parameters = no Model parameters file: if simulate is used to generate an example of a substitution model parameters file, the parameters are saved in a file having the name provided. When simulate is used to generate sequences, the user must provide parameters for the substitution model and they are read from the given file. Model parameters file = simulate.model The following fields may be required when simulate is used to generate sequences. Random seed: the integer value provided with this field is used to initialise the random number generator. Random seed = 1 Random tree and Tree file: simulate can either generate a random tree or use a supplied phylogeny. If Random tree is equal to yes then simulate generates a random tree and saves it in the specified file. If Random tree is equal to no then simulate parses the user tree from the specified file. Random tree = no Tree file = 8-species.tree Number of species and Maximum branch length: when the Random tree field is set to yes, the user must provide the number of species and the maximum value for branch lengths in the generated the tree. Number of species = 10 Maximum branch length =.4 Number of symbols from class i: you have to specify the number of symbols (e.g., number of nucleotides or number of paired sites) you want to generate for each class in your final sequence. Number of symbols from class 1 = 100 Number of symbols from class 2 = 100 Number of symbols from class 3 = 100 Number of symbols from class 4 = 500 Number of symbols from class 5 =

16 Structure for the elements of class i: simulate can add a stucture in the generated data file in which case you have to specify the appropriate structure for the elements of each class. Structure for the elements of class 1 =. Structure for the elements of class 2 =. Structure for the elements of class 3 =. Structure for the elements of class 4 =. Structure for the elements of class 5 = () Data file type and Total length of the raw sequences: simulate produces an input file following the format defined in the data file format section (1.1.1). To produce this file, you have to specify yourself the type and the length written in the first line (see section 1.1.1). With the 5 classes described above: Data file type = RNA Total length of the raw sequences = 1400 #( *2) Output file: the name of the file where generated sequences are saved. Output file = simulated-data/codons and rna.sequences mlphase Using mlphase The mlphase program can be used: 1. to find the Maximum Likelihood Estimates for branch lengths and, optionally, evolutionary model parameters for a user-defined set of topologies. 2. to find the phylogeny and, optionally, evolutionary model parameters that yield the maximum likelihood. Three algorithms are provided for topology search: Simple exhaustive search Branch-and-bound exhaustive search Heuristic stepwise addition In the first mode of operation, mlphase operates like optimise but several trees can be considered at once. In the second mode of operation, when mlphase performs a branch and bound search or an exhaustive search, the ten phylogenies (and associated substitution model parameters) with the highest likelihood are returned. These two search algorithms return the best tree unless they become trapped in local minima during the optimisation process. The heuristic stepwise addition returns only one tree. It is less likely to find the optimal tree but it is computationally feasible with a larger number of taxa. Be warned that the optimiser might crash unexpectedly sometimes and you can change the initial values to overcome that (hopefully rare) problem. To reduce the search space and the computation time, constraints can be placed on the phylogenetic tree topologies considered during ML inference. With a clade file (see section 1.1.6) one can specify invariant monophyletic clade topologies which should be preserved during phylogenetic inference. The program will look for an optimal topology consistent with these clade arrangements. To use mlphase, type at the command-line: mlphase mlphase-control-file where mlphase-control-file is a valid control file for mlphase. The mlphase program saves the results of an inference in a single file. Results are also displayed on screen during the run. Control file for mlphase Please see the examples in appendix A.5 and A.6. These control files show the two main modes of operation. The control file of the mlphase program contains: a DATAFILE block: see the data file block section (1.2.2). a MODEL block: see the model block section (1.2.3). 15

17 a FUNCTION block dependant on the operating mode of mlphase (see below) Random seed: the seed for the random number generator. Random seed = 13 Output file: the name of the file where the results are sent. Output file = results/hiv-mlphase.output The FUNCTION block contains specific parameters according to the mode of operation. At the moment, mlphase can Optimise user-defined phylogenetic trees or Search for ML topology. When the user wants to optimise a set of defined trees the FUNCTION block contains the following fields: Function: the parameter to specify the mode of operation. Function = Optimise user-defined phylogenetic trees Trees file: the name of the file containing the phylogenies, i.e., a set of trees in the Newick format (section 2.1.2) with optional initial branch lengths values. Trees file = primates.phylogenies Number of trees: the user has to specify the number of trees in the previous file. Number of trees = 4 Optimise model parameters: set this field to no if the model parameters are to be considered fixed, set it to yes if you want to optimise them. Optimise model parameters = no User s model parameters file: if the parameters are constant one must provide values for them. This field is for the name of the file containing the parameters for the model defined in the MODEL block. If provided when not required, the content of this file is used to initialise the parameters of the model before optimisation. User s model parameters file = data/hiv-rev.model When looking for the ML tree, the FUNCTION block contains: Function: the parameter to specify the mode of operation. Function = Search for ML topology Topology search: this field specifies the search algorithm used to determine the phylogenies with the highest likelihood. At the moment the search algorithms implemented are Simple exhaustive search, Branch-and-bound exhaustive search and Heuristic stepwise addition. Topology search = Heuristic stepwise addition User defined monophyletic clades and Clade file: set the first field to yes if you want to constrain the search in the topology space. The second field is the name of your clade file (see section 1.1.6). User defined monophyletic clades = yes Clade file = primates.clades Optimise model parameters: set this field to no if the model parameters are to be considered fixed, set it to yes if you want to optimise them. Optimise model parameters = yes User s model parameters file: if the parameters are constant one must provide values for them, this field is for the name of the file containing the parameters for the model defined in the MODEL block. User s model parameters file = data/primates-rna7a.model 16

18 1.3.5 mcmcphase Using mcmcphase The mcmcphase program perfoms Bayesian estimation of phylogenies (see section 2.5) and uses Markov chain Monte Carlo to produce large samples from the posterior probability density. To use mcmcphase, simply type at the command-line: mcmcphase mcmcphase-control-file where mcmcphase-control-file is a valid control file for the mcmcphase program. The mcmcphase program saves the results of an inference in many files. Be warned that it might require a large amount of disk space for large studies (around 90 Mb for 70 species and samples)..besttree and.bestmodel files: the phylogeny and the parameters of the substitution model when the best state (i.e., the state with the highest likelihood) was visited, it is not necessary one of the sampled configurations and this state might have been visited during the burnin period. The best configuration is not very important in a MCMC analysis but a strange best state indicates quickly that something went wrong. The tree and the model can also be used as starting points in maximum likelihood inference..mp file: the file with the sampled parameters of the substitution model(s). Each sample occupies one line. The parameters are, in order, the proportion of invariant sites if an invariant category is used (+I models) the gamma shape parameter (α) if the discrete gamma model is used (+dgx models) the frequencies of the states as they appear in the substitution matrix the rate ratios When a MIXED model is used, substitution model parameters are printed sequentially. Except for the first model, each set of parameters is preceded by the average substitution rate of the model. The average substitution rate for the first model is always 1.0 and therefore this value is not reported..samples file: the sampled topologies, this file can be used with another phylogenetic package to produce a consensus tree. To avoid wasting disk space, mcmcphase will output the sampled topologies using an index for each species according to their appearance order in the datafile..bl file: the branch lengths for the previous topologies (for use with other PHASE programs)..output file: a file with similar content to the screen output..plot file: the evolution of the likelihood during the run. Sampling of these values starts at the beginning of the run, i.e., likelihood values are stored durning the burnin too. Using consensus The consensus program is used to exploit the large sample of states produced by mcmcphase. The program still lacks the ability to produce a consensus tree by itself and requires that tree from the user. Many phylogenetic programs can build a consensus tree from the sample of topologies produced by mcmcphase in the.samples file. You can use the consense program of PHYLIP 2 for instance. To use consensus, simply type at the command-line: consensus mcmcphase-control-file consensus-topology-file where mcmcphase-control-file is the control file that was used by the mcmcphase program to produce the results and consensus-topology-file is the file which contains the consensus topology. Since mcmcphase outputs the topologies using numbers instead of the names of the species, consensus expects the consensus topology to be given with numbers too. The consensus program retrieves the model used and the location of the sample files from the controlfile. Two consensus substitution models are produced using respectively the mean and median values of the sample. The consensus topology is used to produce a consensus tree with branch lengths. The

19 branch lengths of the states whose topology is identical to the consensus topology are used. For each branch, the consensus length is simply the mean value of all the lengths. The consensus program cannot return a consensus tree if the consensus topology has never been visited. In such a case, we suggest you use optimise to produce ML branch lengths. Control file for mcmcphase Please see the examples provided in appendix A.7 and A.8. can/must have: In the control file of mcmcphase one a DATAFILE block: see the data file block section (1.2.2). a MODEL block: see the model block section (1.2.3). a PERTURBATION block: control block for the mixing properties of mcmcphase (see below). Random seed: the seed for the random number generator. Random seed = 1 Burnin iterations: the number of burnin cycles (i.e., cycles before the beginning of the sampling). During the burnin, only likelihood values are stored. Burnin iterations = Sampling iterations: the number of cycles for sampling. Sampling iterations = Sampling period: the number of cycles between extraction of two consecutive samples. Sampling period = 20 Random start model parameters and User s starting model parameters file: to reduce the necessary burnin time, the chain can be initialised with some user-specified model parameters. Otherwise the sequences are used to initialise the substitution model. Random start model parameters = no User s starting model parameters file = data/primates-rna7a.model Random start tree and User s starting tree file: similarly, one can choose to initialise the chain randomly or with a user-defined topology. We do not encourage the use of an initial user-defined topology but this option can be useful to quickly gain an idea of what results can be expected. Random start tree = yes User s starting tree file = this field is ignored in this case Output file: the basename for all the output files (basename.besttree, basename.bestmodel, basename.mp, basename.samples, basename.bl, basename.output and basename.plot). Output file = results/hiv-dna Output format: the format used for the topologies in the.samples file, it can be phylip (with a semi-colon at the end) or bambe (without semi-colon). Output format = phylip PERTURBATION block The PERTURBATION block contains the mixing parameters used for the proposals. mixing parameters are relative to the branches: The following Initial branch step proposal parameter: the initial standard deviation of the normal distribution used to modify the branch lengths. This proposal parameter is modified during the burnin. Initial branch step proposal parameter = 0.1 Branch length upper bound: the upper bound used for the uniform prior distribution of branch lengths. Branch length upper bound =

B Y P A S S R D E G R Version 1.0.2

B Y P A S S R D E G R Version 1.0.2 B Y P A S S R D E G R Version 1.0.2 March, 2008 Contents Overview.................................. 2 Installation notes.............................. 2 Quick Start................................. 3 Examples

More information

in interleaved format. The same data set in sequential format:

in interleaved format. The same data set in sequential format: PHYML user's guide Introduction PHYML is a software implementing a new method for building phylogenies from sequences using maximum likelihood. The executables can be downloaded at: http://www.lirmm.fr/~guindon/phyml.html.

More information

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea Descent w/modification Descent w/modification Descent w/modification Descent w/modification CPU Descent w/modification Descent w/modification Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

More information

Bayesian phylogenetic inference MrBayes (Practice)

Bayesian phylogenetic inference MrBayes (Practice) Bayesian phylogenetic inference MrBayes (Practice) The aim of this tutorial is to give a very short introduction to MrBayes. There is a website with most information about MrBayes: www.mrbayes.net (which

More information

Heterotachy models in BayesPhylogenies

Heterotachy models in BayesPhylogenies Heterotachy models in is a general software package for inferring phylogenetic trees using Bayesian Markov Chain Monte Carlo (MCMC) methods. The program allows a range of models of gene sequence evolution,

More information

Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller

Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller Parameter and State inference using the approximate structured coalescent 1 Background Phylogeographic methods can help reveal the movement

More information

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet) Phylogeny Codon models Last lecture: poor man s way of calculating dn/ds (Ka/Ks) Tabulate synonymous/non- synonymous substitutions Normalize by the possibilities Transform to genetic distance K JC or K

More information

Tutorial using BEAST v2.4.1 Troubleshooting David A. Rasmussen

Tutorial using BEAST v2.4.1 Troubleshooting David A. Rasmussen Tutorial using BEAST v2.4.1 Troubleshooting David A. Rasmussen 1 Background The primary goal of most phylogenetic analyses in BEAST is to infer the posterior distribution of trees and associated model

More information

Protein phylogenetics

Protein phylogenetics Protein phylogenetics Robert Hirt PAUP4.0* can be used for an impressive range of analytical methods involving DNA alignments. This, unfortunately is not the case for estimating protein phylogenies. Only

More information

Tutorial using BEAST v2.4.2 Introduction to BEAST2 Jūlija Pečerska and Veronika Bošková

Tutorial using BEAST v2.4.2 Introduction to BEAST2 Jūlija Pečerska and Veronika Bošková Tutorial using BEAST v2.4.2 Introduction to BEAST2 Jūlija Pečerska and Veronika Bošková This is a simple introductory tutorial to help you get started with using BEAST2 and its accomplices. 1 Background

More information

PRINCIPLES OF PHYLOGENETICS Spring 2008 Updated by Nick Matzke. Lab 11: MrBayes Lab

PRINCIPLES OF PHYLOGENETICS Spring 2008 Updated by Nick Matzke. Lab 11: MrBayes Lab Integrative Biology 200A University of California, Berkeley PRINCIPLES OF PHYLOGENETICS Spring 2008 Updated by Nick Matzke Lab 11: MrBayes Lab Note: try downloading and installing MrBayes on your own laptop,

More information

BUCKy Bayesian Untangling of Concordance Knots (applied to yeast and other organisms)

BUCKy Bayesian Untangling of Concordance Knots (applied to yeast and other organisms) Introduction BUCKy Bayesian Untangling of Concordance Knots (applied to yeast and other organisms) Version 1.2, 17 January 2008 Copyright c 2008 by Bret Larget Last updated: 11 November 2008 Departments

More information

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES Integrative Biology 200, Spring 2014 Principles of Phylogenetics: Systematics University of California, Berkeley Updated by Traci L. Grzymala Lab 07: Maximum Likelihood Model Selection and RAxML Using

More information

TreeTime User Manual

TreeTime User Manual TreeTime User Manual Lin Himmelmann www.linhi.de Dirk Metzler www.zi.biologie.uni-muenchen.de/evol/statgen.html March 2009 TreeTime is controlled via an input file in Nexus file format (view Maddison 1997).

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Stat 547 Assignment 3

Stat 547 Assignment 3 Stat 547 Assignment 3 Release Date: Saturday April 16, 2011 Due Date: Wednesday, April 27, 2011 at 4:30 PST Note that the deadline for this assignment is one day before the final project deadline, and

More information

The NEXUS file includes a variety of node-date calibrations, including an offsetexp(min=45, mean=50) calibration for the root node:

The NEXUS file includes a variety of node-date calibrations, including an offsetexp(min=45, mean=50) calibration for the root node: Appendix 1: Issues with the MrBayes dating analysis of Slater (2015). In setting up variant MrBayes analyses (Supplemental Table S2), a number of issues became apparent with the NEXUS file of the original

More information

Tutorial: Phylogenetic Analysis on BioHealthBase Written by: Catherine A. Macken Version 1: February 2009

Tutorial: Phylogenetic Analysis on BioHealthBase Written by: Catherine A. Macken Version 1: February 2009 Tutorial: Phylogenetic Analysis on BioHealthBase Written by: Catherine A. Macken Version 1: February 2009 BioHealthBase provides multiple functions for inferring phylogenetic trees, through the Phylogenetic

More information

Estimating rates and dates from time-stamped sequences A hands-on practical

Estimating rates and dates from time-stamped sequences A hands-on practical Estimating rates and dates from time-stamped sequences A hands-on practical This chapter provides a step-by-step tutorial for analyzing a set of virus sequences which have been isolated at different points

More information

CSE 549: Computational Biology

CSE 549: Computational Biology CSE 549: Computational Biology Phylogenomics 1 slides marked with * by Carl Kingsford Tree of Life 2 * H5N1 Influenza Strains Salzberg, Kingsford, et al., 2007 3 * H5N1 Influenza Strains The 2007 outbreak

More information

G-PhoCS Generalized Phylogenetic Coalescent Sampler version 1.2.3

G-PhoCS Generalized Phylogenetic Coalescent Sampler version 1.2.3 G-PhoCS Generalized Phylogenetic Coalescent Sampler version 1.2.3 Contents 1. About G-PhoCS 2. Download and Install 3. Overview of G-PhoCS analysis: input and output 4. The sequence file 5. The control

More information

Understanding the content of HyPhy s JSON output files

Understanding the content of HyPhy s JSON output files Understanding the content of HyPhy s JSON output files Stephanie J. Spielman July 2018 Most standard analyses in HyPhy output results in JSON format, essentially a nested dictionary. This page describes

More information

Seeing the wood for the trees: Analysing multiple alternative phylogenies

Seeing the wood for the trees: Analysing multiple alternative phylogenies Seeing the wood for the trees: Analysing multiple alternative phylogenies Tom M. W. Nye, Newcastle University tom.nye@ncl.ac.uk Isaac Newton Institute, 17 December 2007 Multiple alternative phylogenies

More information

HORIZONTAL GENE TRANSFER DETECTION

HORIZONTAL GENE TRANSFER DETECTION HORIZONTAL GENE TRANSFER DETECTION Sequenzanalyse und Genomik (Modul 10-202-2207) Alejandro Nabor Lozada-Chávez Before start, the user must create a new folder or directory (WORKING DIRECTORY) for all

More information

A STEP-BY-STEP TUTORIAL FOR DISCRETE STATE PHYLOGEOGRAPHY INFERENCE

A STEP-BY-STEP TUTORIAL FOR DISCRETE STATE PHYLOGEOGRAPHY INFERENCE BEAST: Bayesian Evolutionary Analysis by Sampling Trees A STEP-BY-STEP TUTORIAL FOR DISCRETE STATE PHYLOGEOGRAPHY INFERENCE This step-by-step tutorial guides you through a discrete phylogeography analysis

More information

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such) Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences joe@gs Phylogeny methods, part 1 (Parsimony and such) Methods of reconstructing phylogenies (evolutionary trees) Parsimony

More information

Page 1.1 Guidelines 2 Requirements JCoDA package Input file formats License. 1.2 Java Installation 3-4 Not required in all cases

Page 1.1 Guidelines 2 Requirements JCoDA package Input file formats License. 1.2 Java Installation 3-4 Not required in all cases JCoDA and PGI Tutorial Version 1.0 Date 03/16/2010 Page 1.1 Guidelines 2 Requirements JCoDA package Input file formats License 1.2 Java Installation 3-4 Not required in all cases 2.1 dn/ds calculation

More information

PHYLIP. Joe Felsenstein. Depts. of Genome Sciences and of Biology, University of Washington. PHYLIP p.1/13

PHYLIP. Joe Felsenstein. Depts. of Genome Sciences and of Biology, University of Washington. PHYLIP p.1/13 PHYLIP p.1/13 PHYLIP Joe Felsenstein Depts. of Genome Sciences and of Biology, University of Washington PHYLIP p.2/13 Software for this lab This lab is intended to introduce the PHYLIP package and a number

More information

CS 581. Tandy Warnow

CS 581. Tandy Warnow CS 581 Tandy Warnow This week Maximum parsimony: solving it on small datasets Maximum Likelihood optimization problem Felsenstein s pruning algorithm Bayesian MCMC methods Research opportunities Maximum

More information

Parsimony methods. Chapter 1

Parsimony methods. Chapter 1 Chapter 1 Parsimony methods Parsimony methods are the easiest ones to explain, and were also among the first methods for inferring phylogenies. The issues that they raise also involve many of the phenomena

More information

Bayesian Robust Inference of Differential Gene Expression The bridge package

Bayesian Robust Inference of Differential Gene Expression The bridge package Bayesian Robust Inference of Differential Gene Expression The bridge package Raphael Gottardo October 30, 2017 Contents Department Statistics, University of Washington http://www.rglab.org raph@stat.washington.edu

More information

[davinci]$ export CLASSPATH=$CLASSPATH:path_to_file/DualBrothers.jar:path_to_file/colt.jar

[davinci]$ export CLASSPATH=$CLASSPATH:path_to_file/DualBrothers.jar:path_to_file/colt.jar 1 Installing the software 1.1 Java compiler and necessary class libraries The DualBrothers package is distributed as a Java JAR file (DualBrothers.jar). In order to use the package, a Java virtual machine

More information

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie. Arbres formels et Arbre de la Vie Olivier Gascuel Centre National de la Recherche Scientifique LIRMM, Montpellier, France www.lirmm.fr/gascuel 10 permanent researchers 2 technical staff 3 postdocs, 10

More information

"PRINCIPLES OF PHYLOGENETICS" Spring 2008

PRINCIPLES OF PHYLOGENETICS Spring 2008 Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2008 Lab 7: Introduction to PAUP* Today we will be learning about some of the basic features of PAUP* (Phylogenetic

More information

( ylogenetics/bayesian_workshop/bayesian%20mini conference.htm#_toc )

(  ylogenetics/bayesian_workshop/bayesian%20mini conference.htm#_toc ) (http://www.nematodes.org/teaching/tutorials/ph ylogenetics/bayesian_workshop/bayesian%20mini conference.htm#_toc145477467) Model selection criteria Review Posada D & Buckley TR (2004) Model selection

More information

Genomic Evolutionary Rate Profiling (GERP) Sidow Lab

Genomic Evolutionary Rate Profiling (GERP) Sidow Lab Last Updated: June 29, 2005 Genomic Evolutionary Rate Profiling (GERP) Documentation @2004-2005, Sidow Lab Maintained by Gregory M. Cooper (coopergm@stanford.edu), a PhD student in the lab of Arend Sidow

More information

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Wayne Pfeiffer (SDSC/UCSD) & Alexandros Stamatakis (TUM) February 25, 2010 What was done? Why is it important? Who cares? Hybrid MPI/OpenMP

More information

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early

More information

Introduction to MrBayes

Introduction to MrBayes Introduction to MrBayes Fred(rik) Ronquist Dept. Bioinformatics and Genetics Swedish Museum of Natural History, Stockholm, Sweden Installing MrBayes! Two options:! Go to mrbayes.net, click Download and

More information

Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate

Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate Gene regulation DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate Especially but not only during developmental stage And cells respond to

More information

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence

More information

Approximate Bayesian Computation. Alireza Shafaei - April 2016

Approximate Bayesian Computation. Alireza Shafaei - April 2016 Approximate Bayesian Computation Alireza Shafaei - April 2016 The Problem Given a dataset, we are interested in. The Problem Given a dataset, we are interested in. The Problem Given a dataset, we are interested

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Scaling species tree estimation methods to large datasets using NJMerge

Scaling species tree estimation methods to large datasets using NJMerge Scaling species tree estimation methods to large datasets using NJMerge Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana Champaign 2018 Phylogenomics Software

More information

Extended Bayesian Skyline Plot tutorial for BEAST 2

Extended Bayesian Skyline Plot tutorial for BEAST 2 Extended Bayesian Skyline Plot tutorial for BEAST 2 Joseph Heled (updated for BEAST 2 by Tim Vaughan) This short practical explains how to set up an Extended Bayesian Skyline Plot (EBSP) analysis in BEAST

More information

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week Statistics & Bayesian Inference Lecture 3 Joe Zuntz Overview Overview & Motivation Metropolis Hastings Monte Carlo Methods Importance sampling Direct sampling Gibbs sampling Monte-Carlo Markov Chains Emcee

More information

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 ML phylogenetic inference and GARLI Derrick Zwickl University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 Outline Heuristics and tree searches ML phylogeny inference and

More information

10kTrees - Exercise #2. Viewing Trees Downloaded from 10kTrees: FigTree, R, and Mesquite

10kTrees - Exercise #2. Viewing Trees Downloaded from 10kTrees: FigTree, R, and Mesquite 10kTrees - Exercise #2 Viewing Trees Downloaded from 10kTrees: FigTree, R, and Mesquite The goal of this worked exercise is to view trees downloaded from 10kTrees, including tree blocks. You may wish to

More information

A Short History of Markov Chain Monte Carlo

A Short History of Markov Chain Monte Carlo A Short History of Markov Chain Monte Carlo Christian Robert and George Casella 2010 Introduction Lack of computing machinery, or background on Markov chains, or hesitation to trust in the practicality

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

"PRINCIPLES OF PHYLOGENETICS" Spring Lab 1: Introduction to PHYLIP

PRINCIPLES OF PHYLOGENETICS Spring Lab 1: Introduction to PHYLIP Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2008 Lab 1: Introduction to PHYLIP What s due at the end of lab, or next Tuesday in class: 1. Print out

More information

Lesson 13 Molecular Evolution

Lesson 13 Molecular Evolution Sequence Analysis Spring 2000 Dr. Richard Friedman (212)305-6901 (76901) friedman@cuccfa.ccc.columbia.edu 130BB Lesson 13 Molecular Evolution In this class we learn how to draw molecular evolutionary trees

More information

Tutorial using BEAST v2.5.0 Introduction to BEAST2 Jūlija Pečerska, Veronika Bošková and Louis du Plessis

Tutorial using BEAST v2.5.0 Introduction to BEAST2 Jūlija Pečerska, Veronika Bošková and Louis du Plessis Tutorial using BEAST v2.5.0 Introduction to BEAST2 Jūlija Pečerska, Veronika Bošková and Louis du Plessis This is a simple introductory tutorial to help you get started with using BEAST2 and its accomplices.

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

7.36/7.91/20.390/20.490/6.802/6.874 PROBLEM SET 3. Gibbs Sampler, RNA secondary structure, Protein Structure with PyRosetta, Connections (25 Points)

7.36/7.91/20.390/20.490/6.802/6.874 PROBLEM SET 3. Gibbs Sampler, RNA secondary structure, Protein Structure with PyRosetta, Connections (25 Points) 7.36/7.91/20.390/20.490/6.802/6.874 PROBLEM SET 3. Gibbs Sampler, RNA secondary structure, Protein Structure with PyRosetta, Connections (25 Points) Due: Thursday, April 3 th at noon. Python Scripts All

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Laboratory 3: Detecting selection

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Laboratory 3: Detecting selection Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Laboratory 3: Detecting selection Handed out: November 28 Due: December 14 Part 2. Detecting selection likelihood

More information

TreeCollapseCL 4 Emma Hodcroft Andrew Leigh Brown Group Institute of Evolutionary Biology University of Edinburgh

TreeCollapseCL 4 Emma Hodcroft Andrew Leigh Brown Group Institute of Evolutionary Biology University of Edinburgh TreeCollapseCL 4 Emma Hodcroft Andrew Leigh Brown Group Institute of Evolutionary Biology University of Edinburgh 2011-2015 This command-line Java program takes in Nexus/Newick-style phylogenetic tree

More information

Semi-Supervised Learning with Trees

Semi-Supervised Learning with Trees Semi-Supervised Learning with Trees Charles Kemp, Thomas L. Griffiths, Sean Stromsten & Joshua B. Tenenbaum Department of Brain and Cognitive Sciences, MIT, Cambridge, MA 0139 {ckemp,gruffydd,sean s,jbt}@mit.edu

More information

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics Practical 1: Getting started in OpenBUGS Slide 1 An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics Dr. Christian Asseburg Centre for Health Economics Practical 1 Getting

More information

Introduction to Computational Phylogenetics

Introduction to Computational Phylogenetics Introduction to Computational Phylogenetics Tandy Warnow The University of Texas at Austin No Institute Given This textbook is a draft, and should not be distributed. Much of what is in this textbook appeared

More information

1 Methods for Posterior Simulation

1 Methods for Posterior Simulation 1 Methods for Posterior Simulation Let p(θ y) be the posterior. simulation. Koop presents four methods for (posterior) 1. Monte Carlo integration: draw from p(θ y). 2. Gibbs sampler: sequentially drawing

More information

Bayesian Inference of Species Trees from Multilocus Data using *BEAST

Bayesian Inference of Species Trees from Multilocus Data using *BEAST Bayesian Inference of Species Trees from Multilocus Data using *BEAST Alexei J Drummond, Walter Xie and Joseph Heled April 13, 2012 Introduction We describe a full Bayesian framework for species tree estimation.

More information

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 Learning Objectives understand

More information

Lab 10: Introduction to RevBayes: Phylogenetic Analysis Using Graphical Models and Markov chain Monte Carlo By Will Freyman, edited by Carrie Tribble

Lab 10: Introduction to RevBayes: Phylogenetic Analysis Using Graphical Models and Markov chain Monte Carlo By Will Freyman, edited by Carrie Tribble IB200, Spring 2018 University of California, Berkeley Lab 10: Introduction to RevBayes: Phylogenetic Analysis Using Graphical Models and Markov chain Monte Carlo By Will Freyman, edited by Carrie Tribble

More information

Package markophylo. December 31, 2015

Package markophylo. December 31, 2015 Type Package Package markophylo December 31, 2015 Title Markov Chain Models for Phylogenetic Trees Version 1.0.4 Date 2015-12-31 Author Utkarsh J. Dang and G. Brian Golding Maintainer Utkarsh J. Dang

More information

A Statistical Test for Clades in Phylogenies

A Statistical Test for Clades in Phylogenies A STATISTICAL TEST FOR CLADES A Statistical Test for Clades in Phylogenies Thurston H. Y. Dang 1, and Elchanan Mossel 2 1 Department of Electrical Engineering and Computer Sciences, University of California,

More information

ALTree: Association and Localisation tests using haplotype phylogenetic Trees. Claire Bardel, Vincent Danjean, Pierre Darlu and Emmanuelle Génin

ALTree: Association and Localisation tests using haplotype phylogenetic Trees. Claire Bardel, Vincent Danjean, Pierre Darlu and Emmanuelle Génin ALTree: Association and Localisation tests using haplotype phylogenetic Trees Claire Bardel, Vincent Danjean, Pierre Darlu and Emmanuelle Génin March 17, 2006 Contents 1 Overview of the software 3 1.1

More information

Species Trees with Relaxed Molecular Clocks Estimating per-species substitution rates using StarBEAST2

Species Trees with Relaxed Molecular Clocks Estimating per-species substitution rates using StarBEAST2 Species Trees with Relaxed Molecular Clocks Estimating per-species substitution rates using StarBEAST2 Joseph Heled, Remco Bouckaert, Walter Xie, Alexei J. Drummond and Huw A. Ogilvie 1 Background In this

More information

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst Phylogenetics Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Phylogenetics phylum = tree phylogenetics: reconstruction of evolutionary

More information

Generation of distancebased phylogenetic trees

Generation of distancebased phylogenetic trees primer for practical phylogenetic data gathering. Uconn EEB3899-007. Spring 2015 Session 12 Generation of distancebased phylogenetic trees Rafael Medina (rafael.medina.bry@gmail.com) Yang Liu (yang.liu@uconn.edu)

More information

Systematics - Bio 615

Systematics - Bio 615 The 10 Willis (The commandments as given by Willi Hennig after coming down from the Mountain) 1. Thou shalt not paraphyle. 2. Thou shalt not weight. 3. Thou shalt not publish unresolved nodes. 4. Thou

More information

STEM-hy Tutorial Workshop on Molecular Evolution 2013

STEM-hy Tutorial Workshop on Molecular Evolution 2013 STEM-hy Tutorial Workshop on Molecular Evolution 2013 Getting started: To run the examples in this tutorial, you should copy the file STEMhy tutorial 2013.zip from the /class/shared/ directory and unzip

More information

On the Optimality of the Neighbor Joining Algorithm

On the Optimality of the Neighbor Joining Algorithm On the Optimality of the Neighbor Joining Algorithm Ruriko Yoshida Dept. of Statistics University of Kentucky Joint work with K. Eickmeyer, P. Huggins, and L. Pachter www.ms.uky.edu/ ruriko Louisville

More information

Phylogeographic inference in continuous space A hands-on practical

Phylogeographic inference in continuous space A hands-on practical Phylogeographic inference in continuous space A hands-on practical This chapter provides a step-by-step tutorial for reconstructing the spatial dynamics of the West Nile virus (WNV) invasion across North

More information

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins

More information

MCMC Methods for data modeling

MCMC Methods for data modeling MCMC Methods for data modeling Kenneth Scerri Department of Automatic Control and Systems Engineering Introduction 1. Symposium on Data Modelling 2. Outline: a. Definition and uses of MCMC b. MCMC algorithms

More information

Workshop Practical on concatenation and model testing

Workshop Practical on concatenation and model testing Workshop Practical on concatenation and model testing Jacob L. Steenwyk & Antonis Rokas Programs that you will use: Bash, Python, Perl, Phyutility, PartitionFinder, awk To infer a putative species phylogeny

More information

PhyloBayes-MPI. A Bayesian software for phylogenetic reconstruction using mixture models. MPI version

PhyloBayes-MPI. A Bayesian software for phylogenetic reconstruction using mixture models. MPI version PhyloBayes-MPI A Bayesian software for phylogenetic reconstruction using mixture models MPI version Nicolas Lartillot, Nicolas Rodrigue, Daniel Stubbs, Jacques Richer nicolas.lartillot@umontreal.ca Version

More information

CLC Phylogeny Module User manual

CLC Phylogeny Module User manual CLC Phylogeny Module User manual User manual for Phylogeny Module 1.0 Windows, Mac OS X and Linux September 13, 2013 This software is for research purposes only. CLC bio Silkeborgvej 2 Prismet DK-8000

More information

Lab 8 Phylogenetics I: creating and analysing a data matrix

Lab 8 Phylogenetics I: creating and analysing a data matrix G44 Geobiology Fall 23 Name Lab 8 Phylogenetics I: creating and analysing a data matrix For this lab and the next you will need to download and install the Mesquite and PHYLIP packages: http://mesquiteproject.org/mesquite/mesquite.html

More information

The GLMMGibbs Package

The GLMMGibbs Package The GLMMGibbs Package April 22, 2002 Version 0.5-1 Author Jonathan Myles and David Clayton Maintainer Jonathan Myles Depends R (>= 1.0) Date 2001/22/01 Title

More information

BIOL 848: Phylogenetic Methods MrBayes Tutorial Fall, 2010

BIOL 848: Phylogenetic Methods MrBayes Tutorial Fall, 2010 MrBayes Lab Note: This computer lab exercise was written by Paul O. Lewis. Paul has graciously allowed Mark Holder to use and modify the lab for the Summer Institute in Statistical Genetics and BIOL 848.

More information

Definitions. Matt Mauldin

Definitions. Matt Mauldin Combining Data Sets Matt Mauldin Definitions Character Independence: changes in character states are independent of others Character Correlation: changes in character states occur together Character Congruence:

More information

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance:

More information

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 28 th November 2007 OUTLINE 1 INFERRING

More information

Prior Distributions on Phylogenetic Trees

Prior Distributions on Phylogenetic Trees Prior Distributions on Phylogenetic Trees Magnus Johansson Masteruppsats i matematisk statistik Master Thesis in Mathematical Statistics Masteruppsats 2011:4 Matematisk statistik Juni 2011 www.math.su.se

More information

human chimp mouse rat

human chimp mouse rat Michael rudno These notes are based on earlier notes by Tomas abak Phylogenetic Trees Phylogenetic Trees demonstrate the amoun of evolution, and order of divergence for several genomes. Phylogenetic trees

More information

Main Reference. Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taking a Random Walk through Tree Space Genetics 2005

Main Reference. Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taking a Random Walk through Tree Space Genetics 2005 Stochastic Models for Horizontal Gene Transfer Dajiang Liu Department of Statistics Main Reference Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taing a Random Wal through Tree Space

More information

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users Practical Considerations for WinBUGS Users Kate Cowles, Ph.D. Department of Statistics and Actuarial Science University of Iowa 22S:138 Lecture 12 Oct. 3, 2003 Issues in MCMC use for Bayesian model fitting

More information

DesignDirector Version 1.0(E)

DesignDirector Version 1.0(E) Statistical Design Support System DesignDirector Version 1.0(E) User s Guide NHK Spring Co.,Ltd. Copyright NHK Spring Co.,Ltd. 1999 All Rights Reserved. Copyright DesignDirector is registered trademarks

More information

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course)

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course) Tree Searching We ve discussed how we rank trees Parsimony Least squares Minimum evolution alanced minimum evolution Maximum likelihood (later in the course) So we have ways of deciding what a good tree

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

PhyloType User Manual V1.4

PhyloType User Manual V1.4 PhyloType User Manual V1.4 francois.chevenet@ird.fr www.phylotype.org Screenshot of the PhyloType Web interface: www.phylotype.org (please contact the authors by e-mail for details or technical problems,

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

Lab 8: Molecular Evolution

Lab 8: Molecular Evolution Integrative Biology 200B University of California, Berkeley, Spring 2011 "Ecology and Evolution" by NM Hallinan, updated by Nick Matzke Lab 8: Molecular Evolution There are many different features of genes

More information

ALTree: Association and Localisation tests using haplotype phylogenetic Trees

ALTree: Association and Localisation tests using haplotype phylogenetic Trees ALTree: Association and Localisation tests using haplotype phylogenetic Trees Claire Bardel, Vincent Danjean, Pierre Darlu and Emmanuelle Génin Version 1.1.0 Contents 1 Introduction 3 1.1 What s new?.....................................

More information

One report (in pdf format) addressing each of following questions.

One report (in pdf format) addressing each of following questions. MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW1: Sequence alignment and Evolution Due: 24:00 EST, Feb 15, 2016 by autolab Your goals in this assignment are to 1. Complete a genome assembler

More information

BCLUST -- A program to assess reliability of gene clusters from expression data by using consensus tree and bootstrap resampling method

BCLUST -- A program to assess reliability of gene clusters from expression data by using consensus tree and bootstrap resampling method BCLUST -- A program to assess reliability of gene clusters from expression data by using consensus tree and bootstrap resampling method Introduction This program is developed in the lab of Hongyu Zhao

More information

BayesTraits V2. September Andrew Meade Mark Pagel

BayesTraits V2. September Andrew Meade Mark Pagel BayesTraits V2 September 2014 Andrew Meade (A.Meade@Reading.ac.uk) Mark Pagel (M.Pagel@Reading.ac.uk) 1 Table of Contents Major Changes from V1... 4 Disclaimer... 4 Introduction... 5 Methods and Approach...

More information

Estimation of Item Response Models

Estimation of Item Response Models Estimation of Item Response Models Lecture #5 ICPSR Item Response Theory Workshop Lecture #5: 1of 39 The Big Picture of Estimation ESTIMATOR = Maximum Likelihood; Mplus Any questions? answers Lecture #5:

More information