PHASE: a Software Package for Phylogenetics And S equence Evolution

Size: px

Start display at page:

Download "PHASE: a Software Package for Phylogenetics And S equence Evolution"

Virgil Stevens
5 years ago
Views:

1 PHASE: a Software Package for Phylogenetics And S equence Evolution Version 1.1, April 24, 2003 Copyright 2002, 2003 by the University of Manchester. PHASE is distributed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Howsun Jow and Vivek Gowri-Shankar bug report: vivek.gowri-shankar@cs.man.ac.uk

2 Why is PHASE different from other phylogenetic programs? This package is designed specifically for use with RNA sequences that have a conserved secondary structure, e.g., rrna and trna. It is well known that compensatory substitutions occur in the paired regions of RNA secondary structures; this means that substitutions occurring on one side of a pair are correlated with substitutions on the other side. Most phylogenetic programs assume that each site in a molecule evolves independently of the others but this assumption is not valid for RNA genes. Substitution models of sequence evolution that consider pairs of sites rather than single sites are implemented in this package along with standard nucleotides substitution models used nowadays. When a RNA molecule with a secondary structure is used in conjunction with a RNA substitution model, PHASE requires a structure-based alignment of the sequences with the consensus secondary structure indicated in bracket and dot notation at the top of the alignment. We assume that you can provide this structure. It is now commonplace to perform combined analyses of heterogeneous sequence data when nucleotides with diffent patterns of evolution are sequenced for a set of studied species. It is possible to use several substitution models simultaneously with PHASE (for paired and/or unpaired sites) when analysing protein coding genes or when stems and loops of RNA genes are used. PHASE provides a Markov Chain Monte Carlo sampler to generate large numbers of possible phylogenetic trees with probability proportional to their likelihood. This is a Bayesian statistical method that allows posterior probabilities to be generated for alternative trees and alternative clades. These posterior probabilities provide a sound statistical measure of support of alternative phylogenetic hypotheses, and they remove the need for bootstrapping. Where many alternative arrangements of a given set of species exist, it is possible to calculate posterior probabilities for all the alternative arrangements of these species in a convenient way. Standard Maximum Likelihood techniques for inferring the optimal tree with any of the DNA or RNA evolution models are also implemented. The program s features include: Bayesian estimation of phylogenies and substitution model parameters standard ML search algorithms for inferring the optimal tree with optional topology constraints 6, 7 and 16 state RNA models standard 4 state DNA models invariant and discrete gamma model for substitution rate heterogeneity between sites mixing of molecular data types in a single analysis Journal publications : C. Hudelot, V. Gowri-Shankar, H. Jow, M. Rattray and P. Higgs. RNA-based Phylogenetic Methods: Application to Mammalian Mitochondrial RNA Sequences. Molecular Phylogenetics and Evolution (in press, 2003). H. Jow, C. Hudelot, M. Rattray and P. Higgs. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Molecular Biology and Evolution, 19(9): (2002). Acknowledgements Howsun Jow and Vivek Gowri-Shankar carried out this work as PhD students at Manchester University under the supervision of Magnus Rattray. We gratefully acknowledge contributions to the design, documentation and testing from Paul Higgs and Cendrine Hudelot. The PHASE software was developed as part of a BBSRC funded research project into RNA-based phylogenetic methods (investigators: Paul Higgs and Magnus Rattray). 1

3 Contents Why is PHASE different from other phylogenetic programs? Acknowledgements Introduction 4 How to read this manual? Aquiring and installing the software MS-Windows installation Unix-like system installation Description of programs in the PHASE package optimise and mlphase mcmcphase and consensus likelihood simulate analyser Running the programs Using programs in the PHASE package Inputs/outputs in PHASE Data file format Control file format Tree file format Substitution model parameters file format Parameters displayed on the screen and output of each program Clade file format Control files Structure of the control files Datafile block Model block Using the programs in the PHASE package likelihood optimise simulate mlphase mcmcphase analyser

4 2 Elements of phylogenetic theory Phylogenetic trees Unrooted phylogenies String representation of a tree Branch lengths Nucleotide substitution models A Markov model of substitution Transition matrices Nucleotide substitution models implemented in PHASE Paired-site substitution models RNA secondary structure Theory of compensatory substitutions Base-paired substitution models implemented in PHASE Refinements to substitution models Invariant and discrete gamma models The MIXED model Bayesian phylogenetics Bayes theorem Markov chain Monte-Carlo (MCMC) Priors and proposals Pitfalls of Markov chain Monte-Carlo techniques A Some examples of control files 37 A.1 Control file for likelihood A.2 Control file for optimise A.3 Control file for simulate (1) A.4 Control file for simulate (2) A.5 Control file for mlphase (1) A.6 Control file for mlphase (2) A.7 Control file for mcmcphase (1) A.8 Control file for mcmcphase (2) Bibliography 46 3

5 Introduction How to read this manual? People with a good background in phylogenetic inference might be interested only in the first chapter which explains how to use PHASE. The second chapter contains a few elements of the theory of phylogenetic inference with some valuable information about PHASE that can make technical details in the first chapter clearer. Experienced phylogeneticists might find it useful to read the RNA substitution models section (2.3) to learn about RNA substitution models and the Bayesian phylogenetics section (2.5) if they are not familiar with Markov Chain Monte-Carlo (MCMC) techniques. Once you have read the short description of the programs in this introduction, you can try them straightaway with the examples provided. However, be warned that inferences using the mammals dataset of 69 species and the maximum likelihood inference with the primates (primates-rna-ml.control) require at least one day. You should use other control files instead. The first chapter of this manual should be used as a reference only and to clarify obscure points about PHASE programs. The HTML version of these pages is probably more appropriate to find useful information. Aquiring and installing the software PHASE can be downloaded from it is currently available for Windows and Unix/Linux platforms. MS-Windows installation Download the archive phase-1.1-mswin-exec.zip and decompress it into the directory of your choice, for instance c:\phase\. PHASE does not require any other installation procedure and you can therefore test the software straightaway with the provided example files. Unix-like system installation For Unix and Linux systems you are recommended to compile the program yourself. However, if the process fails and if you cannot produce a proper executable, then you can try the precompiled linux version in the archive phase-1.1-linux-i586-exec.tgz. To compile the program yourself: decompress and extract the archive into the directory of your choice tar -xvvzf phase-1.1.tgz enter the newly created phase-1.1 directory cd phase-1.1 compile with the provided Makefile make 4

6 We assume here that you have the default recent C++ compiler g++ on your platform. You cannot compile PHASE with gcc v2.96 and older. You can check the gcc version installed on your system by typing g++ -v. You might want to (or might have to) edit and modify the makefiles in order to adapt them to your specific system configuration. In that case please have a look at the readme file first. PHASE uses the BLAS and LAPACK library routines. Unless your system is equipped with optimised versions of these mathematical libraries, in which case you are strongly advised to modify the makefile, generic versions will be built during the compilation process. The g77 compiler and the libg2c library are required but they should already be present on your system. Description of programs in the PHASE package The PHASE (PHylogenetics and Sequence Evolution) package consists of two main programs, mlphase and mcmcphase. mlphase performs maximum-likelihood inference mcmcphase is a Bayesian phylogenetic inference program There are five other smaller programs in the package: analyser checks the content of the molecular sequences likelihood computes the likelihood given a specified evolution model simulate generates sequences according to a specified evolution model optimise is a smaller version of mlphase without tree search capabilities consensus is used with mcmcphase to summarize the results of a MCMC run. Below we summarize the behaviour of these programs. Please refer to the first chapter in order to learn how to use them. optimise and mlphase The mlphase program is a maximum-likelihood phylogenetic inference program similar to dnaml in PHYLIP 1 and baseml in PAML 2. The mlphase program has a broad range of functionalities and can be used with a large number of evolutionary substitution models including those which take into account the RNA secondary structure in the evolution of RNA sequences (see sections 2.2 and 2.3). The mlphase program has two main modes of operation: 1. Optimisation of user-defined trees: estimation of maximum likelihood (ML) branch lengths and, optionally, evolutionary model parameters, given a set of labelled molecular sequences, for a user-defined set of phylogenetic tree topologies. 2. Maximum-likelihood tree search: the program aims at finding the model (tree topology, associated branch length and, optionally, sequence evolution model parameters) that yields the highest likelihood. The user can choose the topology search algorithm to be used among the three available: Simple exhaustive search: all the possible phylogenies are considered. Branch and bound search: non-optimal phylogenies are rejected before evaluation. Heuristic search via stepwise addition: greedy search for the best topology. Constraints can be placed on the phylogenetic tree topologies that are considered during ML inference in order to reduce the search space and the computation time

7 The optimise program is a simpler version of mlphase and is provided for convenience. This program returns the ML branch lengths and ML evolutionary parameters of a fixed user-defined tree topology, for instance a consensus tree found with a MCMC run. This is equivalent to the first mode of mlphase with only one tree. The optimise program requires less parameters than mlphase; it is simpler to use and allows quick experimentations with different initial parameters when an entrapment in a local maximum of the likelihood is suspected. mcmcphase and consensus The mcmcphase program performs Bayesian phylogenetic inference. It uses a Markov Chain Monte Carlo algorithm to sample from the posterior probability distribution of phylogenetic tree topology, branch lengths and sequence evolution model parameters. For an explanation of Bayesian phylogenetics and a description of the MCMC sampling algorithms used in mcmcphase, please consult the Bayesian Phylogenetics section (section 2.5) or Jow et al. (2002). The consensus program is used to exploit the results of a MCMC run. This program produces two consensus models (using mean and median of the parameters in the sample) and can return consensus branch lengths for any supplied topology, e.g., a PHYLIP-style consensus tree, if similar topologies were sampled during the run. likelihood The likelihood program computes the likelihood of a phylogeny with respect to any implemented substitution models. simulate The program simulate generates molecular sequences according to an user-specified tree (topology & branch lengths) and substitution model (type & parameters). analyser At the moment analyser outputs basic statistics about a sequence data file. It can also be used to locate in sequences the sites with too many gaps in case you decide to remove them. The analyser program can be quite useful to validate your secondary structure alignment and to set a maximum limit for the mismatch frequency at each site (see section 2.3.1). Running the programs Programs in the PHASE package are run through the command line under both Unix-like systems and MS-windows systems. For Windows operating systems, you have to open a MS-DOS command window to use them. Click on Run... in the Start... menu and type cmd in the newly opened dialog box. You might have to type command instead of cmd depending on your MS-Windows version. Once the command window is opened, you have to move to the directory where you extracted the software. At the shell prompt, you can type, for example, cd c:\phase\. You can then run any program of the PHASE package. Run the programs by typing their name followed by the arguments they require. In most cases, PHASE s programs take one argument which is the name of a control file (see section 1.2). You can type for instance: mcmcphase control\hiv-dna-mcmc-1.control 3 or, optimise control/primates-rna-optimise-7a.control After installation, if the examples are all present, these commands should work. 3 Please note that the use of the \ or / characters is dependant on your operating system. On Unix systems you might have to type./ before the program name. 6

8 1 - Using programs in the PHASE package 1.1 Inputs/outputs in PHASE Data file format All molecular sequence data used by the PHASE programs are stored in a common format. The data file format is similar to the PHYLIP data file format but has a few minor modifications. A data file is divided into four sections but two of them are not compulsory. Comments can be included by preceding the commented lines with a hash (#) symbol. The entire commented line is ignored by the program. Taking a look at the example data files in the package (*.dna, *.rna, and *.mix in the data directory) will make the following explanations easier to understand. File content The first non-comment section of the data file is a single line containing 1. the number of species 2. the length of the molecular sequences 3. a code which can be either DNA for usual unpaired molecular sequences or RNA for base-paired molecular sequences. In fact the purpose of this code is to indicate whether a pairing mask (see below) is present in the data file and you can use the code RNA even if some nucleotides are unpaired. For example the line, DNA at the beginning of a data file indicates that there are five non-base-paired sequences of length 100 in the file. For convenience a third code, MIXED, can be used instead of RNA when the user is using a concatenation of RNA loops and stems (see section 2.3.1) but should be avoided in other cases. More details on the specific meaning of the code MIXED are given in the class section below. The lines, RNA and, MIXED both indicate that there are ten sequences of length 300 in the file and that a pairing mask is associated with them. Pairing mask The second section of the data file is the pairing mask. This mask is only required when sequences contain some base-paired nucleotides (in that case the code should be RNA or MIXED). In the case of fully unpaired sequences i.e., when the DNA code is used the pairing mask must not be provided. The pairing mask is in the form of a mathematical expression consisting of round brackets. Corresponding brackets indicate that the bases at those positions in the sequence form a base-pair in the RNA secondary structure. Unpaired sites can be indicated with a dot. or a hyphen -. For example a sequence ACCAGAUGGU with a pairing mask (((.(.)))) indicates that the sequence is made of the base-pairs AU-CG-CG-GU and unpaired sites A-A. 7

9 Molecular sequences The third section of the data file contains the molecular sequences. Indels (-) and ambiguities (purine (R), pyrimidine (Y), unknown(n or?) ) are allowed. Sequences can be written in one of two formats. The first is the non-interleaved format. This consists of an identifying label for each sequence followed by the whole sequence. An example is: 2 8 DNA Mouse ACCGUGGU UCCAUAAA Rat ACUGUGGC UCGAUAUA There can be no spaces in the label though the sequence itself can be formatted into blocks using multiple lines and spaces. An alternate way of specifying the sequences is using the interleaved format. This enables the sequences to be split into homologous blocks. The non-interleaved example given above could equivalently be written: 2 8 DNA Mouse ACCG Rat ACUG UGGUUCCAUAAA UGGCUCGAUAUA Notice that only the first interleaved block should contain labels. Subsequent interleaved blocks are assumed to have the same labels and to be in the same order. Class section The fourth section is not compulsory and is used when performing a combined analysis of heterogeneous data sets (e.g., loops and stems of a RNA molecule, protein coding genes with three codon positions or concatenated data of different genes with different evolutionary patterns). You can safely skip this section if you plan to study DNA sequences or RNA helices only (i.e., no. in the pairing mask) with only one appropriate nucleotide/base-pair substitution model. The aim of this section is to assign each nucleotide/pair to a class. Each class is expected to have a different pattern of evolution. This section consists of a sequence of integers which correspond to the class of each nucleotide. For instance, the class section of a protein coding gene may look like: When the data file contains a class section, programs in the PHASE package expect it to comply to the following set of rules: class labels are separated by a space classes are labelled from 1 to K, where K is the number of distinct classes the number of labels equals the length of the sequences when used in conjunction with a base-paired structure, the two components of a paired site are in the same class. Since PHASE is specifically designed for the analysis of RNA sequences with secondary structure, the most common use of the class section should be the obvious separation of unpaired and base-paired sites into two distinct classes. The code MIXED can replace the code RNA to avoid a tiresome task and let PHASE know that he can simply use the provided pairing mask to build the class section (e.g., (((.())))..) implies ). When the code MIXED is used the class section is not compulsory and the unpaired and paired sites will respectively be attributed to the classes 1 and 2 automatically 1. Usually classes are used to determine the model of sequence evolution PHASE is using with each nucleotide. Each class in the data file is treated by its own model of nucleotide substitution during the one 1 When the class section is present the code MIXED is equivalent to RNA: the user assignment prevails the automatic 8

10 phylogenetic inference. The models are defined later in the model section of the control file(see 1.2.3). Let us just point out here that if you use the MIXED type for your data with the automatic assignment, i.e., without the class section, you have to make sure your first and second model are respectively a nucleotide substitution model and a base-pair substitution model when you declare your models of evolution. We will return to this point later on Control file format Most programs in the package use a control file. The purpose of this file is to assign a specific task to the program, i.e., analysed sequences, assumed substitution model, and others specific parameters. Control files are the key to using the software and two sections are devoted to them. Section 1.2 describes the structure of this file and describes common features for many programs in the package. Section 1.3 presents the specific parameters for each program Tree file format PHASE can output trees into a file and sometimes the user has to provide a file which contains one or more trees. A tree file is simply a file with one ore more phylogenies written in the computer readable format described in the tree representation section (2.1.2) Substitution model parameters file format With a model parameters file, one can provide initial values for the parameters of the substitution models used. PHASE can also create a model parameters file to store the results concerning a substitution model after a run (these could be Maximum Likelihood Estimate (MLE) parameters or Mean Posterior Estimate (MPE) parameters). Model parameters file content The content of this file is highly dependant on the substitution model used and we cannot describe it in general terms. The fields used to assign a value to each parameter are hopefully quite self-explanatory as long as you know the underlying substitution model. You might need to have a look at the transition matrices section (2.2.2) to understand the PHASE concept of rate ratios in substitution models. Each Rate ratio i parameter in this file stands for the parameter α i in the transition matrix of the corresponding model. Transition matrices for all implemented substitution models are given in section for DNA models and in section for RNA models. Producing a model parameters file Model parameters files and control files share the same structural elements. Some examples can be found in the data directory (*.model). Although it is quite easy to understand the content of a model parameters file without explanations when reading it, you might find it harder to produce your own file from scratch and without guidance if you want to initialise a substitution model with specific values. It is possible to use the simulate program to generate a stub of this file for each model implemented in the PHASE package. This skeleton can be modified easily to suit your needs. See section for details Parameters displayed on the screen and output of each program Each program in the package will output information on the screen, and one or more files to store the results permanently. The outputs will be reviewed individually for each program in section 1.3. The content displayed on the screen is usually quite easy to understand, but you might be a bit confused by the parameters of substitution models. PHASE outputs on the screen two kind of matrices: one rate ratios matrix R 9

11 one transition matrix Q These matrices are described in the transition matrices section (2.2.2). Other parameters have a straightforward meaning Clade file format The user is allowed to specify some invariant clades to reduce the number of possible topologies when using mlphase. A clade file contains a list of monophyletic clades in newick format (see section 2.1.2). All studied species must appear once (and only once) in the file, either alone or in a clade. Here is a simple clade file example for 6 species: (Specie5,Specie6); (Specie4,(Specie3,Specie2)); Specie1; 1.2 Control files Most programs in the PHASE package have their options set using a simple text file. We call this file the control file. Although the content of this file may differ for each program in the package, its structure remains the same. Some control files are provided as example with the package (*.control in the control directory). The easiest and safest way to use PHASE is to copy one of these examples and to adapt it to your need Structure of the control files A control file contains logical blocks (e.g., DATAFILE block, MODEL block,... ) and control lines. Lines preceeded by a hash (#) symbol are considered comments and ignored. Comments can be placed anywhere. A control line is used to define a parameter and gives it a value. It has the format: label = value The order in which control lines are provided in the control file is not important but they must appear in the right block. Note that PHASE is case sensitive, Tree file and Tree File are two different labels. At the moment no warning is issued if the user mistypes an optional parameter. Please check your control files against the provided examples, otherwise PHASE might miss some important parameters without you noticing it. A block is a container. It contains control lines but can also contain other blocks. The block BLOCKNAME begins with the tag: {BLOCKNAME} and ends with the tag: {\BLOCKNAME} Tags must be put alone in their line. By convention the name of blocks are all uppercase. In the remainder of this document, parameters of the control files are colored depending on their status. Compulsory parameters are in red and you must provide a value for them. Optionnal parameters are in green and they do not need to appear in the control file. Often, a default value will be assumed for optional parameters. Some fields are dependent on the presence and/or values of other parameters and their presence (or absence) is compulsory under certain conditions. These conditional parameters are in orange Datafile block Almost all programs in the PHASE package require a DATAFILE block to parse analysed sequences. As stated previously, the DATAFILE block begins with the tag {DATAFILE} alone on a line and ends with the tag {\DATAFILE} alone on a line. The DATAFILE block contains some necessary 10

12 information which is not included in the data file itself (see section for the format of this file); it contains the following control lines: Data file: the location of the molecular sequences file to be used. Data file = data/sequences.dna Interleaved data file: a yes/no option that specifies whether the molecular data is interleaved. Interleaved data file = yes Outgroup: the label of the outgroup sequence (see section 2.1.1). The inference techniques used in PHASE produce unrooted phylogenies and using an outgroup in your study is not required. However PHASE requires this parameter to produce a unique newick representation (2.1.2) for unrooted trees. Outgroup = Mole Heterogeneous data models: is a yes/no parameter which specifies whether the data file contains a class section. The default value is no and the class section of your data file will be ignored if you forget this field. Heterogeneous data models = yes Model block Most programs in the PHASE package require the specification of a substitution model for sequence evolution. This is the purpose of the MODEL block. The MODEL block is delimited by the {MODEL} and {\MODEL} tags. It contains the name of the substitution model followed by parameters (and sometimes blocks) specific to the model (see section 2.2 for background information on substitution models of nucleotide evolution). Simple substitution model Depending on the data to be analysed, the PHASE package can be used with a wide variety of DNA substitution models or RNA-specific base-paired models (see sections and for a review of these models). The content of the MODEL block is the same for all these models and the parameters are: Model: the model s name, by convention it should be all upper case. Model = REV Nucleotide substitution models implemented include JC69, K80, HKY85, TN93 and REV. Base-paired substitution models implemented include RNA6A, RNA6B, RNA7A, RNA7D, RNA16A. Discrete gamma distribution of rates: the discrete gamma model (see section 2.4.1) can be used to account for among site rate variation. Use yes/no values to turn this option on/off. When a discrete gamma model is used, PHASE expects the number of gamma categories to be specified. By default the discrete gamma model is not used. Discrete gamma distribution of rates = yes Number of gamma categories: when the discrete gamma model is used, you have to provide an integer to specify the desired number of discrete gamma categories. Number of gamma categories = 5 Invariant sites: alternatively, or in conjunction with the discrete gamma model, the user can allow a proportion of sites to be invariant, i.e., with zero rate of evolution. The default value is no. Invariant sites = yes Mixed model for combined analyses of heterogeneous data To study heterogeneous sequences several models are required. The mixed model (see section 2.4.2) allows these models to work concurrently. 11

13 Model: this field contains the name of the model which is MIXED. Model = MIXED Number of models: the number of models used concurrently. If a class section was provided with the data file then the number of models should be the same as the number of classes. If you used the flag MIXED in your data file and did not provide a class section then this parameter has to be set to 2 and the two models must be a DNA substitution model and a base-paired substitution model respectively. Number of models = 3 {MODELi} block: each model used in the mixed model must be defined in its own block. If the number of models is n then the MODEL block must contains n blocks whose name are MODEL1, MODEL2,..., MODELn. The content of these blocks is the same as for a simple substitution model block. {MODEL} Model = MIXED Number of models = 2 {MODEL1} Model = REV Invariant sites = yes {\MODEL1} {MODEL2} Model = RNA7A Discrete gamma distribution of rates = yes Number of gamma categories = 5 {\MODEL2} {MODEL3} Model = RNA7D Invariant sites = no Discrete gamma distribution of rates = no {\MODEL3} {\MODEL} 1.3 Using the programs in the PHASE package Each program in the PHASE package requires a specific control-file, the content of which is described here. As in the previous section, compulsory parameters appears in red, optional parameters in green and conditional parameters dependant on the others are in orange likelihood Using likelihood The likelihood program is used to compute the likelihood of a model of evolution (i.e., tree + parameterised substitution model) given a set of studied sequences. To use likelihood, one has to provide a phylogeny for the taxa under investigation (i.e., topology and branch lengths) and a substitution model for nucleotide evolution with user-defined parameters. To use likelihood, type at the command-line: likelihood likelihood-control-file where likelihood-control-file is a valid control file for the likelihood program. For verification purposes likelihood outputs the phylogenetic tree used on the screen before the likelihood value. Unlike most other PHASE programs, likelihood does not send any results to a file. Control file for likelihood An example of a valid control file for likelihood can be found in appendix A.1. In its control file, the likelihood program requires the specification of: a DATAFILE block: see the data file block section (1.2.2). 12

14 a MODEL block: see the model block section (1.2.3). Tree file: the name of the file containing the phylogeny, i.e., a tree in the Newick format (section 2.1.2), with branch lengths values. Tree file = data/mammals-consensus.tree Model parameters file: the name of the file containing parameter values for the model defined in the MODEL block above. Simulate can help you to produce this file. Model parameters file = data/mammals-consensus.model optimise Using optimise The program optimise is used to compute maximum-likelihood estimates (MLE) for the branch lengths and substitution model parameters of a given model of evolution (i.e., a fixed tree topology and a specified substitution model with free parameters). One can specify some initial values for branch lengths and substitution model parameters to speed-up the convergence or to detect trapping in local maxima of the likelihood function. To use optimise, type at the command-line: optimise optimise-control-file where optimise-control-file is a valid control file for the optimise program. When launched, optimise displays the initial tree and the initial likelihood on the screen and begins the optimisation. Once it is finished, the ML substitution model parameters are printed on the screen and saved in the.output file with the ML tree and the value of the maximum likelihood. The ML tree is also saved in the.tree file and a.model file (see input section 1.1.4) is created to store the MLE for the substitution model parameters. Control file for optimise An example of a valid control file for optimise can be found in appendix A.2. The control file of the optimise program must/may provide: a DATAFILE block: see the data file block section (1.2.2). a MODEL block: see the model block section (1.2.3). Tree file: the name of the file containing the phylogeny, i.e., a tree in the Newick format (see section 2.1.2) with optional initial branch lengths values. Tree file = mammals69-mix-consensus.tree Random seed: the integer value provided with this field is used to initialise the random number generator (used to draw random initial branch lengths if they are not provided). Random seed = 1 Starting model parameters file: the name of a file containing initial values for the parameters of the substitution model used. If this field is not provided, the analysed sequences are used to initialise the model. Starting model parameters file = data/hiv.model Output file: the basename for the three files basename.tree, basename.model and basename.output. They contain the results generated by optimise. Output file = mammals69-mix-optimise simulate Using simulate Simulate is used: 13

15 1. to generate examples of.model files for all the substitution models implemented in PHASE. A.model file (see section 1.1.4) is used to provide initial or fixed values for the model parameters to some programs in the package. 2. to generate molecular sequences which evolved from a random initial one according to a specified model of evolution, i.e., phylogeny and substitution model. To use simulate, type at the command-line: simulate simulate-control-file where simulate-control-file is a valid control file for the simulate program. In its first mode of operation simulate create a single.model file and you can modify this file with your own initial values. In its second mode of operation, simulate displays on screen the tree used to generate the actual sequences. This tree was either provided by the user or randomly created by the program. In the second case the tree is saved in a file specified by the user. Eventually, the likelihood of the generated molecular sequences given the model is printed on the screen and simulate saves the sequences in a file specified by the user. The format of this file is described in the data file format section (1.1.1). If the MIXED model described in section is used, heterogeneous sequences are generated in sequential order. Control file for simulate In appendix A.3 and A.4, example control files are provided for the first and the second mode of operation respectively. The control file of the simulate program must provide a MODEL block: see the model block section (1.2.3). Retrieve the name of the model s parameters: a boolean field to specify the user s aim. Use yes for the first mode of usage mentionned above and no for the second mode. Retrieve the name of the model s parameters = no Model parameters file: if simulate is used to generate an example of a substitution model parameters file, the parameters are saved in a file having the name provided. When simulate is used to generate sequences, the user must provide parameters for the substitution model and they are read from the given file. Model parameters file = simulate.model The following fields may be required when simulate is used to generate sequences. Random seed: the integer value provided with this field is used to initialise the random number generator. Random seed = 1 Random tree and Tree file: simulate can either generate a random tree or use a supplied phylogeny. If Random tree is equal to yes then simulate generates a random tree and saves it in the specified file. If Random tree is equal to no then simulate parses the user tree from the specified file. Random tree = no Tree file = 8-species.tree Number of species and Maximum branch length: when the Random tree field is set to yes, the user must provide the number of species and the maximum value for branch lengths in the generated the tree. Number of species = 10 Maximum branch length =.4 Number of symbols from class i: you have to specify the number of symbols (e.g., number of nucleotides or number of paired sites) you want to generate for each class in your final sequence. Number of symbols from class 1 = 100 Number of symbols from class 2 = 100 Number of symbols from class 3 = 100 Number of symbols from class 4 = 500 Number of symbols from class 5 =

16 Structure for the elements of class i: simulate can add a stucture in the generated data file in which case you have to specify the appropriate structure for the elements of each class. Structure for the elements of class 1 =. Structure for the elements of class 2 =. Structure for the elements of class 3 =. Structure for the elements of class 4 =. Structure for the elements of class 5 = () Data file type and Total length of the raw sequences: simulate produces an input file following the format defined in the data file format section (1.1.1). To produce this file, you have to specify yourself the type and the length written in the first line (see section 1.1.1). With the 5 classes described above: Data file type = RNA Total length of the raw sequences = 1400 #( *2) Output file: the name of the file where generated sequences are saved. Output file = simulated-data/codons and rna.sequences mlphase Using mlphase The mlphase program can be used: 1. to find the Maximum Likelihood Estimates for branch lengths and, optionally, evolutionary model parameters for a user-defined set of topologies. 2. to find the phylogeny and, optionally, evolutionary model parameters that yield the maximum likelihood. Three algorithms are provided for topology search: Simple exhaustive search Branch-and-bound exhaustive search Heuristic stepwise addition In the first mode of operation, mlphase operates like optimise but several trees can be considered at once. In the second mode of operation, when mlphase performs a branch and bound search or an exhaustive search, the ten phylogenies (and associated substitution model parameters) with the highest likelihood are returned. These two search algorithms return the best tree unless they become trapped in local minima during the optimisation process. The heuristic stepwise addition returns only one tree. It is less likely to find the optimal tree but it is computationally feasible with a larger number of taxa. Be warned that the optimiser might crash unexpectedly sometimes and you can change the initial values to overcome that (hopefully rare) problem. To reduce the search space and the computation time, constraints can be placed on the phylogenetic tree topologies considered during ML inference. With a clade file (see section 1.1.6) one can specify invariant monophyletic clade topologies which should be preserved during phylogenetic inference. The program will look for an optimal topology consistent with these clade arrangements. To use mlphase, type at the command-line: mlphase mlphase-control-file where mlphase-control-file is a valid control file for mlphase. The mlphase program saves the results of an inference in a single file. Results are also displayed on screen during the run. Control file for mlphase Please see the examples in appendix A.5 and A.6. These control files show the two main modes of operation. The control file of the mlphase program contains: a DATAFILE block: see the data file block section (1.2.2). a MODEL block: see the model block section (1.2.3). 15

17 a FUNCTION block dependant on the operating mode of mlphase (see below) Random seed: the seed for the random number generator. Random seed = 13 Output file: the name of the file where the results are sent. Output file = results/hiv-mlphase.output The FUNCTION block contains specific parameters according to the mode of operation. At the moment, mlphase can Optimise user-defined phylogenetic trees or Search for ML topology. When the user wants to optimise a set of defined trees the FUNCTION block contains the following fields: Function: the parameter to specify the mode of operation. Function = Optimise user-defined phylogenetic trees Trees file: the name of the file containing the phylogenies, i.e., a set of trees in the Newick format (section 2.1.2) with optional initial branch lengths values. Trees file = primates.phylogenies Number of trees: the user has to specify the number of trees in the previous file. Number of trees = 4 Optimise model parameters: set this field to no if the model parameters are to be considered fixed, set it to yes if you want to optimise them. Optimise model parameters = no User s model parameters file: if the parameters are constant one must provide values for them. This field is for the name of the file containing the parameters for the model defined in the MODEL block. If provided when not required, the content of this file is used to initialise the parameters of the model before optimisation. User s model parameters file = data/hiv-rev.model When looking for the ML tree, the FUNCTION block contains: Function: the parameter to specify the mode of operation. Function = Search for ML topology Topology search: this field specifies the search algorithm used to determine the phylogenies with the highest likelihood. At the moment the search algorithms implemented are Simple exhaustive search, Branch-and-bound exhaustive search and Heuristic stepwise addition. Topology search = Heuristic stepwise addition User defined monophyletic clades and Clade file: set the first field to yes if you want to constrain the search in the topology space. The second field is the name of your clade file (see section 1.1.6). User defined monophyletic clades = yes Clade file = primates.clades Optimise model parameters: set this field to no if the model parameters are to be considered fixed, set it to yes if you want to optimise them. Optimise model parameters = yes User s model parameters file: if the parameters are constant one must provide values for them, this field is for the name of the file containing the parameters for the model defined in the MODEL block. User s model parameters file = data/primates-rna7a.model 16

18 1.3.5 mcmcphase Using mcmcphase The mcmcphase program perfoms Bayesian estimation of phylogenies (see section 2.5) and uses Markov chain Monte Carlo to produce large samples from the posterior probability density. To use mcmcphase, simply type at the command-line: mcmcphase mcmcphase-control-file where mcmcphase-control-file is a valid control file for the mcmcphase program. The mcmcphase program saves the results of an inference in many files. Be warned that it might require a large amount of disk space for large studies (around 90 Mb for 70 species and samples)..besttree and.bestmodel files: the phylogeny and the parameters of the substitution model when the best state (i.e., the state with the highest likelihood) was visited, it is not necessary one of the sampled configurations and this state might have been visited during the burnin period. The best configuration is not very important in a MCMC analysis but a strange best state indicates quickly that something went wrong. The tree and the model can also be used as starting points in maximum likelihood inference..mp file: the file with the sampled parameters of the substitution model(s). Each sample occupies one line. The parameters are, in order, the proportion of invariant sites if an invariant category is used (+I models) the gamma shape parameter (α) if the discrete gamma model is used (+dgx models) the frequencies of the states as they appear in the substitution matrix the rate ratios When a MIXED model is used, substitution model parameters are printed sequentially. Except for the first model, each set of parameters is preceded by the average substitution rate of the model. The average substitution rate for the first model is always 1.0 and therefore this value is not reported..samples file: the sampled topologies, this file can be used with another phylogenetic package to produce a consensus tree. To avoid wasting disk space, mcmcphase will output the sampled topologies using an index for each species according to their appearance order in the datafile..bl file: the branch lengths for the previous topologies (for use with other PHASE programs)..output file: a file with similar content to the screen output..plot file: the evolution of the likelihood during the run. Sampling of these values starts at the beginning of the run, i.e., likelihood values are stored durning the burnin too. Using consensus The consensus program is used to exploit the large sample of states produced by mcmcphase. The program still lacks the ability to produce a consensus tree by itself and requires that tree from the user. Many phylogenetic programs can build a consensus tree from the sample of topologies produced by mcmcphase in the.samples file. You can use the consense program of PHYLIP 2 for instance. To use consensus, simply type at the command-line: consensus mcmcphase-control-file consensus-topology-file where mcmcphase-control-file is the control file that was used by the mcmcphase program to produce the results and consensus-topology-file is the file which contains the consensus topology. Since mcmcphase outputs the topologies using numbers instead of the names of the species, consensus expects the consensus topology to be given with numbers too. The consensus program retrieves the model used and the location of the sample files from the controlfile. Two consensus substitution models are produced using respectively the mean and median values of the sample. The consensus topology is used to produce a consensus tree with branch lengths. The

19 branch lengths of the states whose topology is identical to the consensus topology are used. For each branch, the consensus length is simply the mean value of all the lengths. The consensus program cannot return a consensus tree if the consensus topology has never been visited. In such a case, we suggest you use optimise to produce ML branch lengths. Control file for mcmcphase Please see the examples provided in appendix A.7 and A.8. can/must have: In the control file of mcmcphase one a DATAFILE block: see the data file block section (1.2.2). a MODEL block: see the model block section (1.2.3). a PERTURBATION block: control block for the mixing properties of mcmcphase (see below). Random seed: the seed for the random number generator. Random seed = 1 Burnin iterations: the number of burnin cycles (i.e., cycles before the beginning of the sampling). During the burnin, only likelihood values are stored. Burnin iterations = Sampling iterations: the number of cycles for sampling. Sampling iterations = Sampling period: the number of cycles between extraction of two consecutive samples. Sampling period = 20 Random start model parameters and User s starting model parameters file: to reduce the necessary burnin time, the chain can be initialised with some user-specified model parameters. Otherwise the sequences are used to initialise the substitution model. Random start model parameters = no User s starting model parameters file = data/primates-rna7a.model Random start tree and User s starting tree file: similarly, one can choose to initialise the chain randomly or with a user-defined topology. We do not encourage the use of an initial user-defined topology but this option can be useful to quickly gain an idea of what results can be expected. Random start tree = yes User s starting tree file = this field is ignored in this case Output file: the basename for all the output files (basename.besttree, basename.bestmodel, basename.mp, basename.samples, basename.bl, basename.output and basename.plot). Output file = results/hiv-dna Output format: the format used for the topologies in the.samples file, it can be phylip (with a semi-colon at the end) or bambe (without semi-colon). Output format = phylip PERTURBATION block The PERTURBATION block contains the mixing parameters used for the proposals. mixing parameters are relative to the branches: The following Initial branch step proposal parameter: the initial standard deviation of the normal distribution used to modify the branch lengths. This proposal parameter is modified during the burnin. Initial branch step proposal parameter = 0.1 Branch length upper bound: the upper bound used for the uniform prior distribution of branch lengths. Branch length upper bound =

B Y P A S S R D E G R Version 1.0.2

B Y P A S S R D E G R Version 1.0.2 March, 2008 Contents Overview.................................. 2 Installation notes.............................. 2 Quick Start................................. 3 Examples